GPT-2 is a decoder-only transformer that predicts the next token in a sequence. That one sentence describes the entire model. Everything else — the 12 attention heads, the causal mask, the weight tying, the LayerNorm placement — is engineering decisions that make it work well and train stably.
This post builds the full GPT-2 (124M parameter) architecture from scratch in PyTorch, explains every decision, and trains it locally on the TinyShakespeare dataset.

What GPT-2 Actually Is
GPT-2 is a language model: given a sequence of tokens, predict what comes next. Trained on enough text, the model learns grammar, facts, reasoning patterns, and style — because all of those are regularities in next-token prediction.
The architecture is a stack of transformer blocks, each with two sub-layers:
- Causal multi-head self-attention
- Position-wise MLP (feed-forward)
“Causal” means each token can only attend to tokens before it — no peeking at the future.
graph TD
TOK["Token IDs\n(B, T)"] --> EMB
POS["Position IDs\n0 … T-1"] --> PEMB
subgraph "Embeddings"
EMB["wte\nToken Embedding\nvocab × C"]
PEMB["wpe\nPosition Embedding\nT × C"]
ADD["+ add\n(B, T, C)"]
EMB --> ADD
PEMB --> ADD
end
ADD --> B1
subgraph "× 12 Transformer Blocks"
B1["Block\nLayerNorm → Attention → Residual\nLayerNorm → MLP → Residual"]
B2["Block ..."]
B12["Block 12"]
B1 --> B2 --> B12
end
B12 --> LNF["Final LayerNorm"]
LNF --> LMH["lm_head\nLinear C → vocab\n(weight tied to wte)"]
LMH --> LOGITS["Logits\n(B, T, vocab_size)"]
LOGITS --> LOSS["Cross-entropy loss\nvs shifted targets"]
style ADD fill:#d4edda,stroke:#009900,color:#111
style LMH fill:#d4edda,stroke:#009900,color:#111
style LOSS fill:#f8d7da,stroke:#990000,color:#111
GPT-2 small (124M) specs:
| Hyperparameter | Value |
|---|---|
Layers (n_layer) |
12 |
Attention heads (n_head) |
12 |
Embedding dim (n_embd) |
768 |
Context length (block_size) |
1024 |
| Vocabulary size | 50257 |
| MLP hidden dim | 4 × 768 = 3072 |
The Data Pipeline
We use TinyShakespeare — 1MB of Shakespeare plays, simple to load, complex enough to learn structure.
import torch
import torch.nn as nn
import torch.nn.functional as F
# Download dataset
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "input.txt")
with open("input.txt", "r") as f:
text = f.read()
print(f"Dataset: {len(text):,} characters")
Tokenizer. For a from-scratch local run, character-level tokenization is simplest. For true GPT-2 BPE tokenization use tiktoken:
# Character-level (fast, local, no deps)
chars = sorted(set(text))
vocab_size = len(chars) # 65 for Shakespeare
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
data = torch.tensor(encode(text), dtype=torch.long)
Train / val split and batch sampler:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
block_size = 256 # context length (use 1024 for full GPT-2)
batch_size = 64
def get_batch(split):
d = train_data if split == 'train' else val_data
ix = torch.randint(len(d) - block_size, (batch_size,))
x = torch.stack([d[i : i + block_size ] for i in ix])
y = torch.stack([d[i + 1: i + block_size + 1] for i in ix])
return x.to(device), y.to(device)
The targets y are the inputs x shifted left by one — token $t$ predicts token $t+1$.
Causal Self-Attention
This is the heart of GPT-2. Every token asks: “how much should I attend to every previous token?”
The math
graph LR
subgraph "Attention scores for one head"
X2["X\n(B,T,C)"] --> WQ["× W_Q"] & WK["× W_K"] & WV["× W_V"]
WQ --> Q["Q\n(B,T,d_k)"]
WK --> K["K\n(B,T,d_k)"]
WV --> V["V\n(B,T,d_v)"]
Q & K --> DOT["QKᵀ / √d_k\n(B,T,T)"]
DOT --> MASK["+ causal mask\n−∞ for future"]
MASK --> SM["softmax\n(B,T,T)"]
SM --> ATTN["× V\n(B,T,d_v)"]
end
style MASK fill:#f8d7da,stroke:#990000,color:#111
style SM fill:#d4edda,stroke:#009900,color:#111
For each token position, compute query $Q$, key $K$, value $V$ vectors:
Attention scores (scaled dot-product):
where $M$ is the causal mask — $-\infty$ for future positions, $0$ elsewhere. After softmax, $-\infty$ becomes $0$, so future tokens get zero weight.
The $\frac{1}{\sqrt{d_k}}$ scaling prevents the dot products from growing large as $d_k$ increases. Without it, softmax saturates and gradients vanish.
Multi-head attention
Instead of one attention operation, do $h$ of them in parallel with different learned projections, then concatenate:
graph LR
X["Input X\n(B, T, C)"] --> QKV["Single Linear\n c_attn: C → 3C"]
QKV --> Q["Q split"]
QKV --> K["K split"]
QKV --> V["V split"]
Q --> R["Reshape\n(B,nh,T,hs)"]
K --> R
V --> R
R --> ATT["Scaled dot-product\n+ causal mask\n+ softmax"]
ATT --> OUT["Weighted sum\nof V"]
OUT --> PROJ["c_proj: C → C"]
PROJ --> Y["Output Y\n(B, T, C)"]
style ATT fill:#d4edda,stroke:#009900,color:#111
Implementation
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# Q, K, V projections fused into one matrix — 3x more efficient
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.head_size = config.n_embd // config.n_head
# causal mask — lower triangular, registered as buffer (not a parameter)
self.register_buffer(
"mask",
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size)
)
def forward(self, x):
B, T, C = x.size() # batch, sequence length, embedding dim
# Project once, split into Q K V — each (B, T, C)
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
# Reshape into (B, n_head, T, head_size) for parallel head computation
k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2)
v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)
# Scaled dot-product attention: (B, nh, T, T)
att = (q @ k.transpose(-2, -1)) * (self.head_size ** -0.5)
# Apply causal mask: future positions → -inf → 0 after softmax
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
# Weighted sum of values: (B, nh, T, hs)
y = att @ v
# Concatenate heads: (B, T, C)
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
Key detail: the QKV projection is a single Linear(C, 3C) layer, then split. This is more efficient than three separate projections (one GEMM vs three).
The MLP Block
After attention, each token’s representation passes through a position-wise feed-forward network independently. This is where most of the model’s “knowledge” is stored.
Structure: expand to $4C$ → GELU → compress back to $C$.
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU(approximate='tanh') # GPT-2 uses tanh approx
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
def forward(self, x):
return self.c_proj(self.gelu(self.c_fc(x)))
Why GELU over ReLU? GELU is smooth near zero — it doesn’t hard-gate values, it softly weights them. This leads to better gradient flow and empirically better training for language models.
Why 4× expansion? The expanded space gives the MLP room to represent complex functions. It’s a heuristic from the original transformer paper that has stuck across essentially all modern LLMs.
The Transformer Block
One block = LayerNorm → Attention → residual + LayerNorm → MLP → residual.
GPT-2 uses pre-LayerNorm (normalize before each sub-layer, not after). This is a departure from the original “Attention is All You Need” paper and improves training stability significantly.
graph TD
X["x (input)"] --> LN1["LayerNorm 1"]
LN1 --> ATT["CausalSelfAttention"]
ATT --> ADD1["+ residual"]
X --> ADD1
ADD1 --> LN2["LayerNorm 2"]
LN2 --> MLP["MLP"]
MLP --> ADD2["+ residual"]
ADD1 --> ADD2
ADD2 --> OUT["x (output)"]
style ATT fill:#d4edda,stroke:#009900,color:#111
style MLP fill:#d4edda,stroke:#009900,color:#111
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x)) # attention sub-layer
x = x + self.mlp(self.ln_2(x)) # MLP sub-layer
return x
The residual connections (x + ...) are critical — they let gradients flow directly to early layers without passing through attention or MLP. Without them, 12-layer networks would not train.
The Full GPT-2 Model
from dataclasses import dataclass
@dataclass
class GPTConfig:
block_size: int = 256 # context length
vocab_size: int = 65 # 65 for char-level Shakespeare; 50257 for GPT-2 BPE
n_layer: int = 6 # 12 for full GPT-2 small
n_head: int = 6 # 12 for full GPT-2 small
n_embd: int = 384 # 768 for full GPT-2 small
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd), # token embeddings
wpe = nn.Embedding(config.block_size, config.n_embd), # position embeddings
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = nn.LayerNorm(config.n_embd), # final LayerNorm
))
# language model head: projects embeddings → vocab logits
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# Weight tying: token embedding and lm_head share the same matrix
self.transformer.wte.weight = self.lm_head.weight
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.size()
assert T <= self.config.block_size
# Token + positional embeddings
pos = torch.arange(T, device=idx.device) # (T,)
tok_emb = self.transformer.wte(idx) # (B, T, C)
pos_emb = self.transformer.wpe(pos) # (T, C)
x = tok_emb + pos_emb # (B, T, C)
# Pass through transformer blocks
for block in self.transformer.h:
x = block(x)
# Final LayerNorm + project to vocab
x = self.transformer.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
def get_num_params(self):
return sum(p.numel() for p in self.parameters())
Weight tying
self.transformer.wte.weight = self.lm_head.weight makes both layers share the same matrix. This saves ~38M parameters (768 × 50257 = 38.6M) with no accuracy loss.
The intuition: the token embedding maps token ID → vector, and lm_head maps vector → token logits. They’re doing the same job in opposite directions. Using the same matrix means a token’s embedding IS its “signature” — the model learns one consistent representation rather than two separate ones.
graph LR
subgraph "Without weight tying (2 separate matrices)"
WTE1["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H1["hidden\nstates"]
H1 --> LMH1["lm_head.weight\n768 × 50257\n38.6M params"] --> L1["logits"]
end
subgraph "With weight tying (1 shared matrix)"
WTE2["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H2["hidden\nstates"]
H2 -->|"same matrix\ntransposed"| L2["logits"]
end
style WTE2 fill:#d4edda,stroke:#009900,color:#111
style L2 fill:#d4edda,stroke:#009900,color:#111
Embeddings in Detail
GPT-2 uses two separate learned embeddings added together:
graph LR
subgraph "Input: 'Hello world'"
T1["Hello\ntoken 15496"] & T2["world\ntoken 995"]
end
subgraph "Token Embedding wte [50257 × 768]"
TE1["row 15496\n→ 768-dim vec"]
TE2["row 995\n→ 768-dim vec"]
end
subgraph "Position Embedding wpe [1024 × 768]"
PE1["row 0\n→ 768-dim vec"]
PE2["row 1\n→ 768-dim vec"]
end
T1 --> TE1
T2 --> TE2
T1 -->|pos 0| PE1
T2 -->|pos 1| PE2
TE1 & PE1 --> ADD1["+ add → x₀\n768-dim"]
TE2 & PE2 --> ADD2["+ add → x₁\n768-dim"]
style ADD1 fill:#d4edda,stroke:#009900,color:#111
style ADD2 fill:#d4edda,stroke:#009900,color:#111
Token embedding (wte): maps each token ID to a 768-dim vector. Vocabulary size 50257 for BPE, 65 for character-level.
Positional embedding (wpe): maps each position index $0, 1, \ldots, T-1$ to a 768-dim vector. Learned, not fixed sinusoidal like the original transformer. The model learns what “being at position 47” means.
Why add instead of concatenate? Adding means position information is baked directly into the token representation — the same representation that flows through all subsequent layers. Concatenating would double the dimension.
Training Loop
flowchart LR
DATA["get_batch\n(x, y)"] --> FWD["Forward pass\nlogits, loss"]
FWD --> ZERO["zero_grad"]
ZERO --> BWD["loss.backward\ncompute grads"]
BWD --> CLIP["clip_grad_norm_\nprevent explosion"]
CLIP --> STEP["optimizer.step\nupdate weights"]
STEP -->|"next iteration"| DATA
STEP -->|"every 500 steps"| EVAL["estimate_loss\ntrain + val"]
EVAL -->|"log"| LOG["step N | train X | val Y"]
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(f"Using device: {device}")
config = GPTConfig()
model = GPT(config).to(device)
print(f"Parameters: {model.get_num_params():,}")
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
@torch.no_grad()
def estimate_loss(eval_iters=100):
model.eval()
losses = {}
for split in ['train', 'val']:
total = 0.0
for _ in range(eval_iters):
x, y = get_batch(split)
_, loss = model(x, y)
total += loss.item()
losses[split] = total / eval_iters
model.train()
return losses
max_iters = 5000
eval_every = 500
for step in range(max_iters):
x, y = get_batch('train')
logits, loss = model(x, y)
optimizer.zero_grad(set_to_none=True)
loss.backward()
# Gradient clipping — prevents exploding gradients in deep networks
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % eval_every == 0 or step == max_iters - 1:
losses = estimate_loss()
print(f"step {step:5d} | train {losses['train']:.4f} | val {losses['val']:.4f}")
Expected output on a MacBook M2 (~5 min):
step 0 | train 4.2891 | val 4.2934
step 500 | train 2.1034 | val 2.1712
step 1000 | train 1.7823 | val 1.9140
step 2000 | train 1.5601 | val 1.7832
step 5000 | train 1.3244 | val 1.6109
Text Generation
@torch.no_grad()
def generate(model, prompt, max_new_tokens=200, temperature=1.0, top_k=40):
model.eval()
idx = torch.tensor(encode(prompt), dtype=torch.long, device=device).unsqueeze(0)
for _ in range(max_new_tokens):
# Crop context to block_size
idx_cond = idx[:, -config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] # take last position: (1, vocab_size)
# Temperature: scale logits before softmax
logits = logits / temperature
# Top-k: zero out all but the k highest logits
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
probs = F.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_tok], dim=1)
return decode(idx[0].tolist())
print(generate(model, prompt="\n", max_new_tokens=300, temperature=0.8))
Temperature controls randomness:
- $T < 1$ — sharper distribution, more predictable
- $T = 1$ — sample from the raw model distribution
- $T > 1$ — flatter distribution, more creative/random
Top-k prevents the model from ever sampling very low-probability tokens, which eliminates nonsense while keeping variety.
flowchart TD
P["prompt tokens\n(1, T)"] --> CROP["crop to block_size"]
CROP --> FWD["model forward pass\n→ logits (1, T, V)"]
FWD --> LAST["take last position\nlogits (1, V)"]
LAST --> TEMP["÷ temperature\nscale sharpness"]
TEMP --> TOPK["zero all but top-k\nlogits"]
TOPK --> SOFT["softmax → probs\n(1, V)"]
SOFT --> SAMPLE["multinomial sample\n→ next token id"]
SAMPLE --> CAT["cat to idx\n(1, T+1)"]
CAT -->|"repeat until\nmax_new_tokens"| CROP
style SAMPLE fill:#d4edda,stroke:#009900,color:#111
style SOFT fill:#d4edda,stroke:#009900,color:#111
Parameter Count Breakdown
For the small config (n_layer=6, n_head=6, n_embd=384, vocab=65):
| Component | Parameters |
|---|---|
Token embedding wte |
65 × 384 = 24,960 |
Position embedding wpe |
256 × 384 = 98,304 |
| Per block: attention QKV | 384 × (3×384) = 442,368 |
| Per block: attention proj | 384 × 384 = 147,456 |
| Per block: MLP fc | 384 × 1536 = 589,824 |
| Per block: MLP proj | 1536 × 384 = 589,824 |
| Per block: 2× LayerNorm | 2 × 768 = 1,536 |
| Per block total | ~1.77M |
| 6 blocks | ~10.6M |
| Final LayerNorm | 768 |
lm_head (tied to wte) |
0 extra |
| Total | ~10.8M |
For full GPT-2 small (12 layers, 768 embd, vocab 50257): 124M parameters.
What Makes It GPT-2 Specifically
The original GPT-1 → GPT-2 changes were mostly scale, but also:
- Pre-LayerNorm instead of post-LayerNorm — better gradient flow, trains more stably
- LayerNorm on the final output after all blocks (the
ln_fwe have) - Weight initialization scaled by depth: residual projections initialized with $\frac{0.02}{\sqrt{2 \cdot n_layer}}$ to prevent the residual stream from growing too large
- No bias in attention — removes a set of useless parameters
- Vocabulary expanded to 50257 via BPE
The GPT-2 paper itself is only 8 pages. The architecture details are in the supplement. Most of what people call “GPT-2” is actually from the nanoGPT implementation by Karpathy, which is the clearest reference.
Common Pitfalls
Forgetting to shift targets. The model predicts token $t+1$ from token $t$. If you feed x as both input and target without the shift, the model learns the trivial task of copying.
Not zeroing gradients. PyTorch accumulates gradients by default. Always call optimizer.zero_grad() before .backward(). Use set_to_none=True for a small speedup.
No gradient clipping. Without clip_grad_norm_, early training can produce very large gradients that corrupt the weights permanently.
Context overflow. During generation, idx grows with each new token. Crop to block_size before the forward pass or you’ll get an index error in the positional embedding.
Weight tying breaks after load_state_dict. If you save and reload a model, the tied weight becomes two separate tensors. Re-tie after loading: model.transformer.wte.weight = model.lm_head.weight.
Summary
GPT-2 is:
- Token + position embeddings added together
- 12 transformer blocks, each with causal self-attention → MLP, both with residual connections and pre-LayerNorm
- A final LayerNorm then linear projection to vocab logits
- Weight tying between the input embedding and output projection
- Trained with cross-entropy loss on next-token prediction
That’s it. The entire architecture in five sentences. Everything else is choosing the right hyperparameters and making the implementation efficient.
Watch It Built Live
Andrej Karpathy’s full lecture reproducing GPT-2 (124M) from scratch:
References
Primary sources — code and papers
-
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
-
Karpathy, A. (2023). build-nanoGPT: Video+code lecture on building nanoGPT from scratch. GitHub. https://github.com/karpathy/build-nanogpt
-
Karpathy, A. (2023). nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs. GitHub. https://github.com/karpathy/nanoGPT
-
Karpathy, A. (2023). Let’s reproduce GPT-2 (124M). YouTube. https://www.youtube.com/watch?v=l8pRSuU81PU
Implementation guides read
-
Belaweid, A. (2025). GPT-2 Implementation From Scratch For Dummies! Substack. https://azizbelaweid.substack.com/p/gpt-2-implementation-from-scratch
-
RecsysML. (2025). Learning ML by doing: Training GPT-2 on a budget. Substack. https://recsysml.substack.com/p/training-gpt-2-on-a-budget
-
Cameron, W. (2024). Implementation of causal self-attention in PyTorch. GitHub Gist. https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256
Theory and architecture
-
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762
-
Raschka, S. (2023). Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch. https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
-
Ravi, S. (2025). GPT-2 Architecture Demystified: A Step-by-Step Breakdown. Medium. https://sararavi14.medium.com/gpt-2-architecture-demystified-a-step-by-step-breakdown-74b1c5c80d17
Official documentation
- HuggingFace. GPT-2 Model Documentation. HuggingFace Transformers. https://huggingface.co/docs/transformers/model_doc/gpt2