Transformer Architecture & Attention

How transformers work — self-attention, multi-head attention, positional encoding, and the encoder-decoder architecture.

Overview

Before 2017, the dominant approach to sequence modeling was recurrent neural networks (RNNs) and their variants — LSTMs and GRUs. They processed tokens one at a time, carrying a hidden state from left to right. This sequential nature created two fundamental problems: training could not be parallelized across a sequence, and long-range dependencies were hard to learn because information had to travel through many steps.

The Transformer architecture, introduced in Vaswani et al.'s "Attention Is All You Need" (2017), discarded recurrence entirely. Instead, it lets every token attend directly to every other token in the sequence — in parallel. This single change unlocked the large-scale pretraining that produced BERT, GPT, T5, and every major language model since.

Understanding the Transformer is the foundation for understanding modern LLMs. Once you see how self-attention works, everything else — fine-tuning, RAG, agents — becomes easier to reason about.

The Core Idea: Attention

Consider the sentence: "The animal didn't cross the street because it was too tired." Your brain instantly resolves that "it" refers to "animal," not "street." You do this by focusing on certain words more than others given the context of the word you are trying to understand. This is attention.

Self-attention is the mathematical version of that process. For each token in a sequence, the model asks: given this token, how much should I focus on every other token when computing its representation? Tokens that are contextually relevant get high attention weights; unrelated tokens get low weights.

The result is that each token's output embedding is a weighted combination of all other tokens' information — a context-aware representation rather than a context-free lookup.

Self-Attention Mechanism

The mechanism uses three learned linear projections of the input: Query (Q), Key (K), and Value (V). A useful analogy is a library search:

Query — what you are looking for ("books about distributed systems")
Keys — the titles and tags on every book in the library
Values — the actual content inside each book

You match your query against all keys to get relevance scores, then retrieve a blend of the values weighted by those scores.

Mathematically, for an input matrix X, you project into Q, K, V matrices and compute:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

The division by √d_k (the square root of the key dimension) prevents the dot products from growing so large that the softmax saturates into near-zero gradients.

Loading diagram...

The output for each token is a weighted sum over all value vectors, where the weights capture how relevant every other token is. Tokens the model deems irrelevant contribute almost nothing; tokens with high relevance contribute strongly.

Self-attention is not a lookup — it is a soft blend

A common misconception is that attention "selects" a single token. It does not. The softmax produces a probability distribution over all tokens, and the output is always a weighted combination of all values. Even the token with the highest attention weight shares the output with every other token — just proportionally less.

Multi-Head Attention

A single attention head computes one set of Q, K, V projections and produces one blended output. But a sentence has many types of relationships simultaneously: syntactic dependencies, semantic similarity, coreference, positional proximity. A single head cannot capture all of them at once.

Multi-head attention runs h independent attention heads in parallel, each with its own learned projections. The outputs are concatenated and projected back to the model dimension.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O
where head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)

Loading diagram...

In practice, probing studies have found that different heads do specialize. Some heads track syntactic subject-verb agreement; others track coreference chains; others attend primarily to nearby tokens. This emergent specialization is not explicitly trained — it arises from the loss objective.

Positional Encoding

Self-attention is inherently order-agnostic. If you shuffle the tokens in a sentence, the attention computation produces the same results (just with shuffled outputs). That is a problem: word order carries meaning.

The Transformer injects positional information by adding a positional encoding vector to each token embedding before the first layer. The original paper uses sinusoidal encodings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique pattern of sine and cosine values across the embedding dimensions. The model can learn to read relative distances from these patterns because PE(pos + k) can be expressed as a linear function of PE(pos).

Modern models like those in the GPT-4 family typically use learned positional embeddings or Rotary Position Embedding (RoPE) instead of fixed sinusoids, but the goal is the same: give the model a way to distinguish token order.

The Full Architecture

A full Transformer consists of an encoder stack and a decoder stack, each made of N identical layers. The encoder reads the input sequence; the decoder generates the output sequence.

Loading diagram...

Each encoder layer has two sub-layers: multi-head self-attention, then a position-wise feed-forward network (FFN). Each sub-layer is wrapped in a residual connection followed by Layer Normalization (Add & Norm). The decoder adds a third sub-layer: cross-attention over the encoder output, which is how the decoder "reads" the encoded input.

The FFN is two linear transformations with a ReLU (or GELU) activation between them. Its hidden dimension is typically 4x the model dimension. Research suggests the FFN layers act as a kind of key-value memory, storing factual associations learned during pretraining.

Residual connections pass the input of each sub-layer directly to its output via addition. This gives gradients a direct path back through the network during backpropagation and is a primary reason Transformers can be trained to hundreds of layers deep.

Code Examples

Both examples implement scaled dot-product attention — the core building block — from scratch.

import numpy as np


def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable row-wise softmax."""
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)


def scaled_dot_product_attention(
    Q: np.ndarray,
    K: np.ndarray,
    V: np.ndarray,
    mask: np.ndarray | None = None,
) -> tuple[np.ndarray, np.ndarray]:
    """
    Scaled dot-product attention.

    Args:
        Q: Query matrix  (seq_len, d_k)
        K: Key matrix    (seq_len, d_k)
        V: Value matrix  (seq_len, d_v)
        mask: Optional boolean mask — True means "ignore this position"

    Returns:
        output:  (seq_len, d_v)
        weights: (seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    # Raw attention scores
    scores = Q @ K.T / np.sqrt(d_k)          # (seq_len, seq_len)

    if mask is not None:
        scores = np.where(mask, -1e9, scores) # mask out future tokens

    weights = softmax(scores)                 # (seq_len, seq_len)
    output = weights @ V                      # (seq_len, d_v)
    return output, weights


# --- Demo ---
np.random.seed(42)
seq_len, d_k, d_v = 4, 8, 8

Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)

print("Output shape:", output.shape)          # (4, 8)
print("Attention weights (row sums):", weights.sum(axis=-1))  # [1. 1. 1. 1.]

# Causal (decoder) mask — each token can only attend to itself and earlier tokens
causal_mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
output_causal, weights_causal = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print("Causal weights[0]:", weights_causal[0])  # only position 0 has weight

Encoder-only, Decoder-only, and Encoder-Decoder

Not every application needs the full encoder-decoder stack. The field has converged on three architectural families, each suited to different tasks.

Architecture	Attention style	Representative models	Best for
Encoder-only	Bidirectional self-attention	BERT, RoBERTa, DeBERTa	Classification, NER, embeddings, semantic search
Decoder-only	Causal (masked) self-attention	GPT-2, GPT-4, LLaMA, Mistral	Text generation, chat, code completion
Encoder-decoder	Bidirectional encoder + causal decoder	T5, BART, mT5	Translation, summarization, question answering

Encoder-only models read the full input sequence bidirectionally — every token attends to every other token with no masking. This produces rich contextual representations ideal for understanding tasks. You cannot use them to generate text autoregressively.

Decoder-only models use a causal mask so each token can only attend to itself and earlier tokens. This enforces the autoregressive generation property: the model predicts the next token given only the previous ones. GPT and most modern chat models use this architecture.

Encoder-decoder models use an encoder to build a representation of the input, then a decoder that attends to that representation (via cross-attention) while generating output tokens. This is the original Transformer design, and it is still preferred for tasks with a distinct input-output structure like translation.

Why decoder-only models dominate today

Encoder-decoder models were state-of-the-art for many NLP tasks through 2021. The shift toward decoder-only came from two observations: scaling laws favor simple architectures, and a sufficiently large decoder-only model can perform "understanding" tasks via prompting without a separate encoder. GPT-3 demonstrated this in 2020. Since then, most frontier models (GPT-4, Claude, Gemini, LLaMA) are decoder-only.

Key Innovations

Several design decisions in the Transformer combine to make it trainable at scale. Each one solves a specific failure mode of earlier deep networks.

Residual connections add each sub-layer's input directly to its output. Without them, gradients in very deep networks vanish before reaching the early layers. Residuals give the gradient a shortcut path and also let early layers act as identity functions if that is optimal.

Layer Normalization normalizes activations across the feature dimension (not the batch dimension). Applied before or after each sub-layer, it stabilizes training by preventing activations from growing unboundedly deep in the network.

Learned positional embeddings vs. fixed sinusoids — the original paper uses fixed sinusoidal encodings, which generalize to sequence lengths not seen in training. Modern models use learned embeddings (which perform similarly within the training range) or RoPE/ALiBi variants that extrapolate better to longer sequences.

The FFN as a knowledge store — Geva et al. (2021) showed that FFN layers behave like key-value memories. The first linear layer's rows act as "keys" that activate on certain input patterns; the second layer's columns act as "values" that add specific information to the residual stream. This may be where factual knowledge is primarily stored during pretraining.

Parallel training — because self-attention processes all tokens simultaneously (unlike RNNs), the full attention matrix for a sequence can be computed in a single matrix multiplication on a GPU. This parallelism is what makes pretraining on trillions of tokens feasible.

Key Papers & Resources

"Attention Is All You Need" — Vaswani et al., 2017 (arXiv:1706.03762). The original paper. Introduced the Transformer for machine translation. Compact and clearly written; the architecture section (Section 3) is the essential read.

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" — Devlin et al., 2018 (arXiv:1810.04805). Showed that encoder-only pretraining with masked language modeling produces representations that fine-tune to state-of-the-art performance on many tasks. Defined the modern "pretrain then fine-tune" paradigm.

"Language Models are Unsupervised Multitask Learners" (GPT-2) — Radford et al., 2019 (OpenAI blog). Demonstrated that a large decoder-only language model picks up many tasks from pretraining alone, foreshadowing the few-shot capabilities of GPT-3.

"The Illustrated Transformer" — Jay Alammar (jalammar.github.io). The best visual walkthrough of the Transformer available. Animated diagrams show exactly how Q, K, V interact and how multi-head attention works in practice. Highly recommended as a companion to the paper.

"Transformer Feed-Forward Layers Are Key-Value Memories" — Geva et al., 2021 (arXiv:2012.14913). Provides an interpretability lens on what FFN layers store and retrieve, useful for understanding why scaling increases model knowledge.

Connected Topics

Tokenization & Embeddings — before the Transformer can attend to anything, text must be converted into token IDs and then embedded into vectors. How that conversion works — BPE, WordPiece, vocabulary size tradeoffs — directly affects what the attention mechanism sees.

Fine-tuning & RLHF — the pretrained Transformer is a general-purpose text predictor. Fine-tuning adapts it to specific behaviors. Understanding the base architecture is prerequisite to understanding what fine-tuning changes and what it leaves intact.

Transformer Architecture & Attention

On this page