Skip to main content

Command Palette

Search for a command to run...

How ChatGPT and Claude Actually Work

Updated
β€’10 min read
How ChatGPT and Claude Actually Work
A

Hi there! πŸ‘‹ I'm a frontend developer with hands-on experience building intuitive and scalable user interfaces. I specialize in technologies like React, TypeScript, Next.js, and Redux, and I'm driven by a passion for crafting meaningful, user-friendly projects. Currently diving deep into problem-solving and algorithms to refine my developer mindset, I enjoy breaking down complex challenges into elegant solutions. When I’m not coding, you’ll find me contributing to open-source projects or sharing insights about web development, design patterns, and tech trends here on Hashnode. Let’s connect and learn together! πŸš€

Part 3 of 3: The AI Series That Explains Everything Simply

In Part 1, we built a spam detector and understood how a neural network thinks. In Part 2, we watched it train β€” weights going from random to intelligent through thousands of loops. Now comes the payoff. The question that's been waiting since the beginning: what makes ChatGPT different? Why does it feel less like a calculator and more like a mind? The answer is one idea. And it came from a single research paper.


Let's start with a problem.

Our spam detector from Part 1 is smart. It reads 8 features from an email and makes a decision. But here's what it can't do:

Sentence: "I grew up in France, so I speak fluent ___"

Our spam detector: has no idea what goes in the blank.
ChatGPT: "French." Instantly. Correctly.

Why can ChatGPT do this when our spam detector can't?

It's not just scale β€” though scale matters enormously. It's a specific architectural breakthrough that completely changed what neural networks can do with language.

It's called Attention.


The Problem Attention Solves

To understand why Attention matters, we need to understand what came before it.

Before 2017, most language AI used a type of network called an RNN (Recurrent Neural Network). An RNN reads text word by word β€” sequentially, like reading a book one word at a time, left to right.

The problem: by the time it reached the end of a long sentence, it had largely forgotten the beginning.

RNN approach (word by word):
"I" β†’ "grew" β†’ "up" β†’ "in" β†’ "France" β†’ "so" β†’ "I" β†’ "speak" β†’ "fluent" β†’ ???

By the time it reaches "fluent", "France" is already fading. The connection is lost.

This is a fundamental limitation. Language is full of long-range dependencies β€” words at the start of a paragraph that determine meaning at the end.

Attention solves this by doing something radical: reading the entire sentence at once, and letting every word directly examine every other word.


What Attention Actually Does

Here's the core idea, put as simply as possible:

When predicting the missing word, Attention lets the model ask: "Which words in this sentence are most relevant to what I'm trying to figure out right now?"

For our France sentence:

Attention scores for predicting the blank:

"I"      β†’ 0.05  (barely relevant)
"grew"   β†’ 0.03  (barely relevant)
"up in"  β†’ 0.04  (barely relevant)
"France" β†’ 0.89  ← THIS is what matters
"so"     β†’ 0.05  (low)
"speak"  β†’ 0.12  (medium β€” gives context)
"fluent" β†’ 0.21  (medium β€” language hint)

The model paid the most attention to "France" when deciding what comes next.

That's the entire concept. Attention is the ability to look at everything simultaneously and decide what matters most, for each specific prediction.


Self-Attention: Every Word Talks to Every Other Word

The specific type of Attention used in modern AI is called Self-Attention β€” because the sentence attends to itself.

Every word in the input generates three things:

  • A Query β€” "what am I looking for?"

  • A Key β€” "what do I contain?"

  • A Value β€” "what information do I pass forward?"

Then every word's Query is compared against every other word's Key. The similarity score determines the attention weight. The final output for each word is a weighted mixture of all the Values.

You don't need to memorize Q, K, V. What you need to understand is the result:

After Self-Attention, every word's representation has been enriched by the context of every other word in the sentence.

The word "bank" in "money in the bank" now knows it's near "money." The word "bank" in "bank of the river" now knows it's near "river." They started as the same token. After Attention, they're fundamentally different β€” and correctly so.


Multi-Head Attention: Multiple Perspectives at Once

One Attention operation is powerful. But modern transformers run multiple Attention operations in parallel β€” called Multi-Head Attention.

Each "head" looks at the sentence from a different angle:

Head 1 β†’ looks for grammatical relationships
          (subject β†’ verb β†’ object)

Head 2 β†’ looks for semantic similarity
          (words with related meanings)

Head 3 β†’ looks for coreference
          (which "he/she/it" refers to what)

Head 4 β†’ looks for positional patterns
          (what typically follows what)

... (modern models have 96+ heads)

All heads run simultaneously.
Their outputs are combined.
Result: a rich, multi-dimensional understanding of the text.

Think of it like a panel of expert readers, each focusing on something different, then pooling their insights.


The Transformer Architecture: Putting It All Together

In 2017, a team of researchers at Google published a paper titled "Attention Is All You Need."

The title was a provocation. They were claiming that you didn't need the sequential RNN architecture that had dominated language AI. Attention alone was sufficient β€” and in fact, superior.

They were right. The architecture they introduced, the Transformer, became the foundation for every major language model built since: BERT, GPT-2, GPT-3, GPT-4, Claude, Gemini, LLaMA β€” all of them.

Here's what a Transformer looks like inside:

The key part is that Transformer Block repeating N times. Each block refines the understanding of the text β€” the first few blocks catch basic patterns, the middle blocks understand grammar and semantics, the later blocks deal with high-level meaning and context.


How the Transformer Generates Text

Here's something that surprises most people:

ChatGPT generates text one token at a time. It predicts the next word, adds it to the input, predicts the next word again, and repeats.

Prompt: "Write a story."

Step 1: "Write a story." β†’ predicts β†’ "Once"
Step 2: "Write a story. Once" β†’ predicts β†’ "upon"
Step 3: "Write a story. Once upon" β†’ predicts β†’ "a"
Step 4: "Write a story. Once upon a" β†’ predicts β†’ "time"
Step 5: "Write a story. Once upon a time" β†’ predicts β†’ "there"
...

Each prediction runs the entire Transformer β€” all the Attention heads, all the layers β€” to decide what single token comes next. Then that token gets added to the context and the whole process runs again.

It sounds slow. With modern GPUs processing billions of operations per second, it happens faster than you can read.


From Spam Detector to ChatGPT: The Scale Bridge

Here's the moment where everything from all three parts comes together.

Our spam detector and ChatGPT are built on the same foundation:

  • Layers of neurons

  • Weights that were trained by gradient descent

  • A training loop that minimized a loss function

The differences:

Same idea. Different universe of scale.

And that scale β€” combined with Attention β€” is what produces something that can write code, explain concepts, translate languages, and hold a conversation.


But ChatGPT Isn't Just a Transformer

There's one more step. After the base Transformer is trained on predicting the next token across the internet, it's not yet ready to be a helpful assistant. It would just complete text β€” sometimes usefully, sometimes not.

To make it helpful, safe, and conversational, it goes through a second training phase called RLHF β€” Reinforcement Learning from Human Feedback.

ChatGPT, Claude, and every other commercial AI assistant has gone through a version of this process. The base model provides the capability. RLHF shapes the personality, safety, and helpfulness.


The Three Branches of AI β€” Where Do We Land?

In the wider story of AI, there are three philosophical traditions:

Symbolic AI (1950s–80s) β€” intelligence through explicit rules. IF fever AND cough THEN flu. Powerful for structured problems. Brittle in the real world.

Statistical AI (1980s–2010s) β€” intelligence through pattern-finding in data. Learns from examples. The foundation of machine learning.

Connectionist AI (2010s–now) β€” intelligence through brain-inspired neural networks. Deep learning. Transformers. This is where we live today.

ChatGPT and Claude are primarily Connectionist β€” Transformer neural networks at the core β€” but they're also Statistical (trained by predicting token probabilities) and have elements of Symbolic (safety rules, RLHF constraints shaping behavior).

Modern AI isn't one thing. It's a synthesis of 70 years of ideas, all converging.


The Complete Picture

You've made it through all three parts. Let's zoom out and see everything at once:

The spam detector you understood in Part 1? It's the neuron. ChatGPT is what happens when you stack 175 billion of those neurons, trained on more text than any human could read in a thousand lifetimes, with Attention letting them all talk to each other simultaneously.

That's the complete picture.


What You Learned Across This Series

Part 1 β€” Structure: A neural network is layers of neurons doing simple math. Each neuron: input Γ— weight + bias β†’ activation β†’ output. Layers find increasingly complex patterns.

Part 2 β€” Learning: Training is a loop: forward pass β†’ loss β†’ backpropagation β†’ update weights β†’ repeat. The output is weights saved to a file. The weights are the intelligence.

Part 3 β€” Language: Attention lets every token look at every other token simultaneously. Transformers stack Attention with feedforward networks into deep stacks. Scale turns this into something that feels like understanding.

The thread running through all three: Same neuron. Same training loop. Same weights. Just scale β€” and one revolutionary idea about how tokens should relate to each other.


Where to Go From Here

You now understand AI better than most people who use it professionally. Here's what to explore next, in order of logical progression:

If you want to go deeper into concepts:

  • Read the original "Attention Is All You Need" paper (it's surprisingly readable)

  • Explore BERT vs GPT β€” why some models only use the Encoder and others only the Decoder

  • Learn about fine-tuning β€” how you can take a pretrained model and adapt it to your specific task

If you want to go practical:

  • Try the Hugging Face Transformers library β€” you can run real models in 5 lines of Python

  • Experiment with prompt engineering β€” understanding the model helps you prompt it better

  • Build a simple text classifier using a pretrained BERT model

If you want to go further in AI theory:

  • Explore Diffusion Models β€” the architecture behind image generators like Midjourney

  • Learn about Reinforcement Learning from Human Feedback (RLHF) in depth

  • Study the emerging field of multimodal models β€” AI that understands text, images, and audio together


Thank you for reading all three parts. If this series helped you understand AI more clearly than anything else you've read β€” share it. The more people who understand how this technology actually works, the better conversations we'll all have about it.

And if you have questions β€” drop them in the comments. I read every one.


Series navigation:

  • Part 1 β†’ What Is a Neural Network? β†’ [link]

  • Part 2 β†’ How AI Trains Itself β†’ [link]

  • Part 3 ← You are here