lunes, 25 de mayo de 2026

How Transformers Work

 How Transformers Work

A Complete and Corrected Guide: from "Attention Is All You Need" to ChatGPT

Based on the original Google paper (2017) and The Illustrated Transformer

1. Introduction: The Silent Revolution of 2017

In 2017, researchers at Google Research published a paper titled Attention Is All You Need. Behind that title lay an idea that changed artificial intelligence forever: instead of processing language word by word, a machine could learn to pay attention to all words simultaneously and decide which ones were relevant to each other.

Shortly after, Jay Alammar published The Illustrated Transformer, a visual explanation that democratized understanding of this architecture and became essential reading for students, engineers, and curious minds around the world.

This document explains both ideas accurately and completely, correcting common simplifications and adding concepts that are frequently omitted: tokenization, positional encoding, training, RLHF, and real-world limitations.

 

2. The Original Problem: Language Is Context

Imagine trying to teach Spanish to a machine. You show it these two sentences:

 "The dog bit the mailman"

"The mailman bit the dog" 


The words are nearly identical. The meaning is completely opposite. Order matters. Context matters. For decades, computers struggled enormously with this.

 

3. Before the Transformer: RNNs and LSTMs

Before 2017, language models relied on architectures called RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory). Their logic was simple: the AI read a sentence word by word, in order, maintaining a "memory" of previous context.

The Short Memory Problem

For long sentences, the AI would forget important information from the beginning. Consider this sentence:

 "The cat that was under the dining room table we bought in Lima last year knocked over the glass."

 

By the time it reached "knocked over the glass", the system had nearly forgotten the main subject: "the cat". Additionally, these models processed sequentially, making them slow and impossible to parallelize on modern hardware.

 

 

4. The Big Idea: The Transformer

The researchers proposed a radical question: what if, instead of reading word by word, the system could look at the entire sentence simultaneously and decide which relationships matter?

That is the Transformer. It is not an incremental improvement; it is a complete paradigm shift.

 

Core principle of the Transformer:

Look at all words at once and calculate how relevant each one is to understanding the others.

5. Tokenization: Machines Don't Read Words

⚠ Frequently omitted concept

Models do not process whole words, but fragments called tokens. This difference has important practical consequences.

A token is not necessarily a word. It can be a syllable, a prefix, a number, a punctuation mark, or even a single character. A typical model like GPT-4 has a vocabulary of roughly 100,000 tokens.

Tokenization Examples


 

 

 

 

 

 

 

This explains why models sometimes make mistakes on tasks that seem simple (like counting letters or performing arithmetic): they do not "see" individual characters, but tokens that may group several of them together.

 

6. Embeddings: Converting Words into Numbers

Before the Transformer can process any token, it must be converted into a numerical vector called an embedding. An embedding is a list of hundreds or thousands of numbers that represents the "meaning" of a token in a mathematical space.

The Map Analogy

Imagine each word occupying a position on a multidimensional map. Words with similar meanings end up close together. For example:

 King - Man + Woman ≈ Queen

This mathematical operation works because the embedding captures semantic relationships in language.

Embeddings are learned during training. The model adjusts these vectors millions of times until they accurately represent linguistic relationships.

 

7. Positional Encoding: How Does Order Work if Everything Is Seen at Once?

❌ Common misconception

If the Transformer sees all words at the same time, how does it distinguish "dog bit mailman" from "mailman bit dog"? Without a position signal, it simply couldn't.

The solution is elegant: before processing embeddings, the model adds a mathematical signal that encodes the position of each token in the sequence. This signal is called Positional Encoding.

The Numbered Seats Analogy

Imagine a theater where everyone enters at the same time. Without seat numbers, chaos would ensue. Positional Encoding is the number on each seat: it lets the model know that the word at position 3 is different from the same word at position 7, even if they are identical.

The authors of the original paper used trigonometric functions (sine and cosine) to generate these position signals. More modern models learn positions during training.

 

8. Self-Attention: The Heart of the Transformer

Self-Attention is the mechanism that allows each token to "look" at all others and decide how much attention to give them. It is the central concept of the paper.

How Does It Work Mathematically?

For each token, the model generates three vectors from its embedding:


 

 

 

 

The model calculates how compatible each token's Query is with the Keys of all others. The result is an attention weight: how much each token should "listen" to each other. It then uses those weights to combine the Values and produce a context-enriched representation.

Concrete Example

Sentence: "Mary went to the bank because she needed money."

When processing the word "money", the model assigns:

  • High attention → "bank", "needed"

  • Low attention → "went", "to"

 

This allows the model to understand that "bank" here refers to a financial institution and not a riverbank, thanks to the context provided by "money".

 

9. Multi-Head Attention: Multiple Perspectives

The Transformer does not use a single attention mechanism: it uses several in parallel, called "heads". Each head learns to pay attention to different aspects of language.

 


 

 

 

 

  

Each head produces its own contextualized representation. At the end, all are concatenated and transformed into a single rich representation combining multiple simultaneous perspectives.

 

10. Encoder and Decoder: Different Architectures for Different Tasks

⚠ Important correction

The Encoder-Decoder architecture is not "the" architecture of all modern models. It is one of three variants. GPT uses Decoder only; BERT uses Encoder only; T5 uses both.

The Encoder: Understanding

The Encoder processes the entire input sentence and builds a rich representation of its meaning. Each layer allows tokens to "enrich" themselves with information from others. It is ideal for tasks requiring text comprehension: classification, semantic search, sentiment analysis.

The Decoder: Generation

The Decoder generates text token by token. It has one important constraint: it can only attend to tokens it has already generated (causal or masked attention). This prevents it from "cheating" by looking at the future during training.


 

 

 

 

GPT (and by extension ChatGPT) uses a modified Decoder: the "cross-attention" layer that would connect it to an Encoder is removed (because there is no Encoder). What remains is a pure autoregressive Decoder, trained to predict the next token.

 

11. Training: How the Model Learns

⚠ Frequently omitted aspect

The architecture alone does not explain the model's intelligence. Training is what makes it useful. A Transformer without training is an empty shell.

Pre-training: Predicting the Next Token

GPT models are pre-trained with a simple but powerful objective: given a text, predict the next token. The model processes trillions of tokens of text (books, articles, code, web pages) and adjusts its parameters to minimize prediction error.

This process produces a base model that has "absorbed" an enormous amount of linguistic and factual knowledge. However, this base model does not know how to follow instructions, is not useful as an assistant, and may generate problematic content.

Supervised Fine-tuning

After pre-training, the model receives examples of ideal conversations (written by humans): instruction -> quality response. The model learns to imitate this pattern.

 

12. RLHF: The Difference Between a Base Model and ChatGPT

❌ Critical omission in most popular explanations

RLHF (Reinforcement Learning from Human Feedback) is what transforms a text predictor into a useful, aligned, and relatively safe assistant. Without this phase, ChatGPT as we know it would not exist.

RLHF is a three-step process applied after pre-training: 


 

 


 

 

The result is a model that not only predicts likely text, but generates useful, honest responses aligned with human preferences. More recent techniques such as DPO (Direct Preference Optimization) achieve similar results more efficiently.

 

13. Why Does ChatGPT "Seem" to Think?

ChatGPT does not think like a human. What it does is predict the next most likely token, conditioned on all previous context. But this prediction operates over extraordinarily rich representations of language, learned from trillions of human-generated texts.

By learning language patterns, the model indirectly acquires knowledge about history, science, programming, logic, emotions, and communication styles. It is like someone who has read an enormous library and can answer questions by extracting and recombining patterns from that knowledge.

The Fundamental Limitation

The model has no real understanding, no beliefs of its own, and no experiences. When it generates a convincing response on a topic, it is recombining statistical patterns from language, not reasoning from first principles. This explains its errors: confabulating facts, being inconsistent across conversations, or failing at reasoning tasks that require strict logical steps.

 

14. Real Limitations of Transformers

Popular explanations tend to ignore these limitations 

       Hallucinations: the model generates fluent text even when the content is incorrect. It cannot distinguish between what it knows and what it does not know.

       Inherited bias: if training data contains biases, the model reproduces or amplifies them.

       Context window: the Transformer can only process a limited number of tokens at once (though this has improved enormously in recent models, reaching millions of tokens).

       Computational cost: training a large model consumes massive amounts of energy and specialized hardware. Only organizations with significant resources can do it.

       Opacity ("black box"): although we can view attention weights, explaining why the model made a specific decision remains an open research problem.

       Lack of persistent memory: without additional tools, the model does not remember previous conversations. Each session starts from scratch.

 

15. Timeline of the Transformer Revolution


 

 

 

 

 

16. Beyond Text: Transformers Everywhere

The Transformer architecture is no longer limited to language. The attention mechanism works for finding relationships in any type of sequential or structured information:






 

17. Summary: The 10 Key Concepts


 

 

 

 

 

Conclusion

The Transformer was to artificial intelligence what the combustion engine was to the Industrial Revolution: not an incremental improvement, but a complete paradigm shift.

The central idea is almost philosophical in its simplicity: to understand something, you need to know what to pay attention to. Humans do this constantly. Now machines do too, though in a fundamentally different way from how we do it.

Understanding this architecture in depth, including its hidden pieces (tokenization, positional encoding, RLHF) and its real limitations, is essential for using these tools critically, detecting their errors, and anticipating their possibilities.

 

 



How Transformers Work

  How Transformers Work A Complete and Corrected Guide: from "Attention Is All You Need" to ChatGPT Based on the original Google p...