Build GPT-2 from Scratch in PyTorch: A Full Walkthrough

I understood what a GPT does at a high level and I could follow queries, keys, and values in isolation. But wiring all of it into a trainable 124M-parameter model — token embeddings, positional embeddings, causal masking, weight tying, a loss that actually shifts targets — that gap took me weeks. This post is the single walkthrough I wanted: one continuous story from “what is GPT-2?” through atom-by-atom construction with parameter math at every layer, a fully worked numeric forward pass on a toy sequence, and a training loop that runs on a laptop.

GPT-2 is a decoder-only transformer that predicts the next token. That one sentence is the whole objective. Everything else — 12 heads, the causal mask, pre-LayerNorm, weight tying — is engineering that makes next-token prediction trainable at scale.

GPT-2 transformer block architecture — attention, MLP, and LayerNorm layers with data streams

This is part of the Transformer Deep Dive series. If attention or embeddings are still fuzzy, read those first — I’ll assume you know what a query and an embedding row are, and focus on how they snap together into GPT-2.

What is GPT-2?

GPT-2 is a language model: given a sequence of tokens, predict what comes next. Train on enough text and the model picks up grammar, facts, style, and rough reasoning — because all of those show up as regularities in next-token prediction.

The architecture is a stack of identical transformer blocks. Each block has two sub-layers:

Causal multi-head self-attention — each token gathers context from tokens before it only.
Position-wise MLP — each token passes through a small feed-forward network independently.

“Causal” means no peeking at the future. Token at position $t$ may attend to positions $0, 1, \ldots, t$. Position $t+1$ is masked out.

graph TD
  TOK["Token IDs\n(B, T)"] --> EMB
  POS["Position IDs\n0 … T-1"] --> PEMB

  subgraph "Embeddings"
    EMB["wte\nToken Embedding\nvocab × C"]
    PEMB["wpe\nPosition Embedding\nT × C"]
    ADD["+ add\n(B, T, C)"]
    EMB --> ADD
    PEMB --> ADD
  end

  ADD --> B1

  subgraph "× 12 Transformer Blocks"
    B1["Block\nLayerNorm → Attention → Residual\nLayerNorm → MLP → Residual"]
    B2["Block ..."]
    B12["Block 12"]
    B1 --> B2 --> B12
  end

  B12 --> LNF["Final LayerNorm"]
  LNF --> LMH["lm_head\nLinear C → vocab\n(weight tied to wte)"]
  LMH --> LOGITS["Logits\n(B, T, vocab_size)"]
  LOGITS --> LOSS["Cross-entropy loss\nvs shifted targets"]

  style ADD fill:#d4edda,stroke:#009900,color:#111
  style LMH fill:#d4edda,stroke:#009900,color:#111
  style LOSS fill:#f8d7da,stroke:#990000,color:#111

GPT-2 small (124M) hyperparameters:

Hyperparameter	Symbol	Value
Layers	`n_layer`	12
Attention heads	`n_head`	12
Embedding dim	`n_embd`	768
Context length	`block_size`	1024
Vocabulary size	`vocab_size`	50257
MLP hidden dim	$4 \times n_{\text{embd}}$	3072
Head size	$n_{\text{embd}} / n_{\text{head}}$	64

I’ll build the full architecture below. For local training I shrink these numbers so a MacBook can finish in minutes; the math is identical.

How does the full GPT-2 pipeline fit together?

End to end, one training step looks like this:

flowchart LR
  TXT["Raw text"] --> TOK["Tokenize → IDs"]
  TOK --> BATCH["get_batch → (x, y)"]
  BATCH --> EMB["wte + wpe"]
  EMB --> BLOCKS["× n_layer Blocks"]
  BLOCKS --> LNF["ln_f"]
  LNF --> HEAD["lm_head → logits"]
  HEAD --> CE["cross_entropy vs y"]

The input x and target y differ by one position: token at index $t$ in x predicts token at index $t$ in y, which is token $t+1$ from the original stream. Get that shift wrong and the model learns to copy instead of predict.

How do I set up the data pipeline?

I use TinyShakespeare — about 1MB of text, enough structure to see loss drop, small enough to download instantly.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Download dataset
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "input.txt")

with open("input.txt", "r") as f:
    text = f.read()

print(f"Dataset: {len(text):,} characters")

For a from-scratch local run, character-level tokenization is simplest. For true GPT-2 BPE, use tiktoken or the BPE from-scratch post:

# Character-level (fast, local, no deps)
chars = sorted(set(text))
vocab_size = len(chars)          # 65 for Shakespeare
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

Train / val split and batch sampler:

n = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]

block_size = 256   # context length (use 1024 for full GPT-2)
batch_size = 64

def get_batch(split):
    d = train_data if split == 'train' else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i    : i + block_size    ] for i in ix])
    y = torch.stack([d[i + 1: i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

If the input chunk is token IDs $[t_0, t_1, \ldots, t_{T-1}]$, the target chunk is $[t_1, t_2, \ldots, t_T]$. Position $i$ in x predicts position $i$ in y.

Atom 1: How do token and position embeddings work?

The first layer turns integer IDs into vectors. I covered the token side in Token Embeddings Explained and the position side in Positional Encoding Explained. GPT-2 uses learned position embeddings (wpe), not sinusoids.

The combined input is:

$$ x = \text{Embed}_{\text{token}}(\text{idx}) + \text{Embed}_{\text{position}}(\text{pos}) $$

graph LR
  subgraph "Input: 'Hello world'"
    T1["Hello\ntoken 15496"] & T2["world\ntoken 995"]
  end

  subgraph "Token Embedding  wte  [50257 × 768]"
    TE1["row 15496\n→ 768-dim vec"] 
    TE2["row 995\n→ 768-dim vec"]
  end

  subgraph "Position Embedding  wpe  [1024 × 768]"
    PE1["row 0\n→ 768-dim vec"]
    PE2["row 1\n→ 768-dim vec"]
  end

  T1 --> TE1
  T2 --> TE2
  T1 -->|pos 0| PE1
  T2 -->|pos 1| PE2

  TE1 & PE1 --> ADD1["+ add → x₀\n768-dim"]
  TE2 & PE2 --> ADD2["+ add → x₁\n768-dim"]

  style ADD1 fill:#d4edda,stroke:#009900,color:#111
  style ADD2 fill:#d4edda,stroke:#009900,color:#111

Token embedding wte: matrix $W_{\text{te}} \in \mathbb{R}^{V \times C}$. Row $i$ is the vector for token ID $i$.

For GPT-2 small:

$$ V = 50257 $$

$$ C = 768 $$

Parameter count for wte:

$$ 50257 \times 768 = 38{,}597{,}376 $$

Position embedding wpe: matrix $W_{\text{pe}} \in \mathbb{R}^{L \times C}$ where $L$ is block_size.

$$ L = 1024 $$

$$ C = 768 $$

Parameter count for wpe:

$$ 1024 \times 768 = 786{,}432 $$

Embedding subtotal (before blocks):

$$ 38{,}597{,}376 + 786{,}432 = 39{,}383{,}808 $$

Why add instead of concatenate? Adding keeps width at $C$. The same $C$-dimensional tensor flows through every subsequent layer. Concatenating would double the width and force every later matrix to grow.

For the local Shakespeare config ($V = 65$, $C = 384$, $L = 256$):

$$ \text{wte} = 65 \times 384 = 24{,}960 $$

$$ \text{wpe} = 256 \times 384 = 98{,}304 $$

$$ \text{embeddings total} = 24{,}960 + 98{,}304 = 123{,}264 $$

Atom 2: How does causal self-attention work?

This is the heart of GPT-2. Every token asks: “how much should I listen to each token before me?” I walked through queries, keys, values, and dot products in How Attention Works in Transformers. Here I implement the causal version GPT-2 actually uses.

What is the attention math?

graph LR
  subgraph "Attention scores for one head"
    X2["X\n(B,T,C)"] --> WQ["× W_Q"] & WK["× W_K"] & WV["× W_V"]
    WQ --> Q["Q\n(B,T,d_k)"]
    WK --> K["K\n(B,T,d_k)"]
    WV --> V["V\n(B,T,d_v)"]
    Q & K --> DOT["QKᵀ / √d_k\n(B,T,T)"]
    DOT --> MASK["+ causal mask\n−∞ for future"]
    MASK --> SM["softmax\n(B,T,T)"]
    SM --> ATTN["× V\n(B,T,d_v)"]
  end
  style MASK fill:#f8d7da,stroke:#990000,color:#111
  style SM fill:#d4edda,stroke:#009900,color:#111

For each token position:

$$ Q = X W_Q $$

$$ K = X W_K $$

$$ V = X W_V $$

Attention output:

$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) V $$

$M$ is the causal mask. Future positions get $-\infty$. After softmax, $-\infty \rightarrow 0$, so future tokens receive zero weight.

Scaling factor for GPT-2 small:

$$ d_k = \frac{768}{12} = 64 $$

$$ \sqrt{d_k} = \sqrt{64} = 8 $$

How does multi-head attention change the picture?

Instead of one attention with $d_k = 768$, GPT-2 runs $h = 12$ heads in parallel, each with $d_k = 64$, then concatenates:

graph LR
  X["Input X\n(B, T, C)"] --> QKV["Single Linear\n c_attn: C → 3C"]
  QKV --> Q["Q split"]
  QKV --> K["K split"]
  QKV --> V["V split"]
  Q --> R["Reshape\n(B,nh,T,hs)"]
  K --> R
  V --> R
  R --> ATT["Scaled dot-product\n+ causal mask\n+ softmax"]
  ATT --> OUT["Weighted sum\nof V"]
  OUT --> PROJ["c_proj: C → C"]
  PROJ --> Y["Output Y\n(B, T, C)"]

  style ATT fill:#d4edda,stroke:#009900,color:#111

Parameter count: attention per block

GPT-2 fuses Q, K, V into one matrix c_attn with no bias:

$$ \text{c\_attn} = C \times 3C = 768 \times 2304 = 1{,}769{,}472 $$

Output projection c_proj:

$$ \text{c\_proj} = C \times C = 768 \times 768 = 589{,}824 $$

Attention per block:

$$ 1{,}769{,}472 + 589{,}824 = 2{,}359{,}296 $$

Rounded: about $4C^2 = 4 \times 768^2 = 2{,}359{,}296$ parameters per block for attention.

For the local config ($C = 384$):

$$ \text{c\_attn} = 384 \times 1152 = 442{,}368 $$

$$ \text{c\_proj} = 384 \times 384 = 147{,}456 $$

$$ \text{attention per block} = 442{,}368 + 147{,}456 = 589{,}824 $$

Implementation

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Q, K, V projections fused into one matrix — 3x more efficient
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_size = config.n_embd // config.n_head

        # causal mask — lower triangular, registered as buffer (not a parameter)
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.block_size, config.block_size))
                 .view(1, 1, config.block_size, config.block_size)
        )

    def forward(self, x):
        B, T, C = x.size()  # batch, sequence length, embedding dim

        # Project once, split into Q K V — each (B, T, C)
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

        # Reshape into (B, n_head, T, head_size) for parallel head computation
        k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)

        # Scaled dot-product attention: (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (self.head_size ** -0.5)

        # Apply causal mask: future positions → -inf → 0 after softmax
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        # Weighted sum of values: (B, nh, T, hs)
        y = att @ v

        # Concatenate heads: (B, T, C)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        return self.c_proj(y)

One fused Linear(C, 3C) beats three separate projections — one GEMM instead of three. The causal mask is a buffer, not a learned parameter: it never updates during training.

Atom 3: What does the MLP block do?

After attention, each token’s vector passes through a position-wise feed-forward network. This is where a lot of factual “knowledge” ends up stored — attention routes information; the MLP transforms it.

Structure: expand to $4C$, apply GELU, compress back to $C$:

$$ \text{MLP}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2 $$

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu   = nn.GELU(approximate='tanh')  # GPT-2 uses tanh approx
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        return self.c_proj(self.gelu(self.c_fc(x)))

Parameter count: MLP per block

GPT-2 small ($C = 768$):

$$ \text{c\_fc} = C \times 4C = 768 \times 3072 = 2{,}359{,}296 $$

$$ \text{c\_proj} = 4C \times C = 3072 \times 768 = 2{,}359{,}296 $$

Bias on each linear (GPT-2 MLP has bias):

$$ \text{bias\_fc} = 3072 $$

$$ \text{bias\_proj} = 768 $$

MLP per block:

$$ 2{,}359{,}296 + 2{,}359{,}296 + 3072 + 768 = 4{,}722{,}432 $$

Rounded: about $8C^2 = 8 \times 768^2 = 4{,}718{,}592$ plus biases.

For local config ($C = 384$, hidden $= 1536$):

$$ \text{c\_fc} = 384 \times 1536 = 589{,}824 $$

$$ \text{c\_proj} = 1536 \times 384 = 589{,}824 $$

$$ \text{MLP per block} = 589{,}824 + 589{,}824 + 1536 + 384 = 1{,}181{,}568 $$

GELU over ReLU: smooth near zero, better gradient flow for language models. The $4\times$ expansion is a heuristic from the original transformer paper that stuck — it gives the MLP room to represent nonlinear functions without widening the residual stream.

Atom 4: How does one transformer block fit together?

One block = pre-LayerNorm → attention → residual → pre-LayerNorm → MLP → residual.

GPT-2 uses pre-LayerNorm (normalize before each sub-layer). The original “Attention is All You Need” paper used post-LayerNorm. Pre-LN trains more stably in deep stacks — gradients flow cleaner through the residual path.

graph TD
  X["x (input)"] --> LN1["LayerNorm 1"]
  LN1 --> ATT["CausalSelfAttention"]
  ATT --> ADD1["+ residual"]
  X --> ADD1
  ADD1 --> LN2["LayerNorm 2"]
  LN2 --> MLP["MLP"]
  MLP --> ADD2["+ residual"]
  ADD1 --> ADD2
  ADD2 --> OUT["x (output)"]

  style ATT fill:#d4edda,stroke:#009900,color:#111
  style MLP fill:#d4edda,stroke:#009900,color:#111

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp  = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))  # attention sub-layer
        x = x + self.mlp(self.ln_2(x))   # MLP sub-layer
        return x

Parameter count: LayerNorm per block

Each LayerNorm(C) has scale $\gamma$ and shift $\beta$, each length $C$:

$$ 2 \times C = 2 \times 768 = 1536 $$

Two LayerNorms per block:

$$ 2 \times 1536 = 3072 $$

Parameter count: full block

GPT-2 small per block:

$$ \text{attention} = 2{,}359{,}296 $$

$$ \text{MLP} = 4{,}722{,}432 $$

$$ \text{LayerNorm} = 3{,}072 $$

$$ \text{per block total} = 2{,}359{,}296 + 4{,}722{,}432 + 3{,}072 = 7{,}084{,}800 $$

Across 12 blocks:

$$ 12 \times 7{,}084{,}800 = 85{,}017{,}600 $$

For local config ($C = 384$, 6 blocks):

$$ \text{attention} = 589{,}824 $$

$$ \text{MLP} = 1{,}181{,}568 $$

$$ \text{LayerNorm} = 2 \times 768 = 1{,}536 $$

$$ \text{per block} = 589{,}824 + 1{,}181{,}568 + 1{,}536 = 1{,}772{,}928 $$

$$ 6 \text{ blocks} = 6 \times 1{,}772{,}928 = 10{,}637{,}568 $$

The residual connections (x + ...) let gradients skip directly to early layers. Without them, a 12-layer network would not train.

Atom 5: How do I assemble the full GPT-2 model?

from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 256    # context length
    vocab_size: int = 65     # 65 for char-level Shakespeare; 50257 for GPT-2 BPE
    n_layer:    int = 6      # 12 for full GPT-2 small
    n_head:     int = 6      # 12 for full GPT-2 small
    n_embd:     int = 384    # 768 for full GPT-2 small


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte  = nn.Embedding(config.vocab_size, config.n_embd),  # token embeddings
            wpe  = nn.Embedding(config.block_size, config.n_embd),  # position embeddings
            h    = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),                     # final LayerNorm
        ))
        # language model head: projects embeddings → vocab logits
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying: token embedding and lm_head share the same matrix
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.size()
        assert T <= self.config.block_size

        # Token + positional embeddings
        pos = torch.arange(T, device=idx.device)          # (T,)
        tok_emb = self.transformer.wte(idx)                # (B, T, C)
        pos_emb = self.transformer.wpe(pos)                # (T, C)
        x = tok_emb + pos_emb                              # (B, T, C)

        # Pass through transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final LayerNorm + project to vocab
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)                           # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

    def get_num_params(self):
        return sum(p.numel() for p in self.parameters())

Final LayerNorm and total parameter count

ln_f parameters:

$$ 2 \times 768 = 1536 $$

GPT-2 small grand total (with weight tying):

$$ \text{embeddings} = 39{,}383{,}808 $$

$$ \text{12 blocks} = 85{,}017{,}600 $$

$$ \text{ln\_f} = 1{,}536 $$

$$ \text{lm\_head extra} = 0 \quad \text{(tied to wte)} $$

$$ \text{total} = 39{,}383{,}808 + 85{,}017{,}600 + 1{,}536 = 124{,}402{,}944 $$

Rounded: 124M parameters.

Without weight tying you’d add a second $V \times C$ matrix:

$$ 50257 \times 768 = 38{,}597{,}376 $$

$$ \text{untied total} = 124{,}402{,}944 + 38{,}597{,}376 = 163{,}000{,}320 $$

Weight tying saves ~39M parameters with no accuracy loss I could measure.

How does weight tying work?

self.transformer.wte.weight = self.lm_head.weight makes both layers share one matrix. Token embedding maps ID → vector. lm_head maps vector → logits over the vocabulary. Same job, opposite direction — one consistent representation per token.

graph LR
  subgraph "Without weight tying  (2 separate matrices)"
    WTE1["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H1["hidden\nstates"]
    H1 --> LMH1["lm_head.weight\n768 × 50257\n38.6M params"] --> L1["logits"]
  end

  subgraph "With weight tying  (1 shared matrix)"
    WTE2["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H2["hidden\nstates"]
    H2 -->|"same matrix\ntransposed"| L2["logits"]
  end

  style WTE2 fill:#d4edda,stroke:#009900,color:#111
  style L2 fill:#d4edda,stroke:#009900,color:#111

Local config total ($V=65$, $C=384$, $L=256$, $n_{\text{layer}}=6$):

$$ \text{embeddings} = 123{,}264 $$

$$ \text{blocks} = 10{,}637{,}568 $$

$$ \text{ln\_f} = 768 $$

$$ \text{total} = 123{,}264 + 10{,}637{,}568 + 768 = 10{,}761{,}600 $$

About 10.8M parameters — trainable on a laptop in minutes.

Full worked example: forward pass on a toy sequence

I’ll trace one forward pass with numbers small enough to write by hand. This is the sanity check I run before trusting any implementation — same logic as numerical gradient checking, but for shapes and arithmetic instead of derivatives.

Toy config:

$$ V = 4 \quad \text{(vocab: tokens 0,1,2,3)} $$

$$ C = 4 \quad \text{(embedding dim)} $$

$$ n_{\text{head}} = 2 $$

$$ d_k = C / n_{\text{head}} = 4 / 2 = 2 $$

$$ T = 3 \quad \text{(sequence length)} $$

$$ B = 1 \quad \text{(batch size 1)} $$

Input token IDs (imagine characters a, b, c):

$$ \text{idx} = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} $$

Shape $(B, T) = (1, 3)$.

Step 1 — Token embedding lookup

Suppose wte rows (toy values):

$$ W_{\text{te}}[1] = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.0 \\ 0.1 \end{bmatrix} $$

$$ W_{\text{te}}[2] = \begin{bmatrix} 0.3 \\ 0.0 \\ 0.2 \\ 0.1 \end{bmatrix} $$

$$ W_{\text{te}}[3] = \begin{bmatrix} 0.0 \\ 0.4 \\ 0.1 \\ 0.2 \end{bmatrix} $$

Stack into $X_{\text{tok}} \in \mathbb{R}^{3 \times 4}$:

$$ X_{\text{tok}} = \begin{bmatrix} 0.1 & 0.2 & 0.0 & 0.1 \\ 0.3 & 0.0 & 0.2 & 0.1 \\ 0.0 & 0.4 & 0.1 & 0.2 \end{bmatrix} $$

Step 2 — Position embedding lookup

Positions $0, 1, 2$:

$$ W_{\text{pe}}[0] = \begin{bmatrix} 0.0 \\ 0.1 \\ 0.0 \\ 0.0 \end{bmatrix} $$

$$ W_{\text{pe}}[1] = \begin{bmatrix} 0.1 \\ 0.0 \\ 0.1 \\ 0.0 \end{bmatrix} $$

$$ W_{\text{pe}}[2] = \begin{bmatrix} 0.0 \\ 0.0 \\ 0.1 \\ 0.1 \end{bmatrix} $$

Step 3 — Add token and position embeddings

Position 0:

$$ x_0 = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.0 \\ 0.1 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.1 \\ 0.0 \\ 0.0 \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.3 \\ 0.0 \\ 0.1 \end{bmatrix} $$

Position 1:

$$ x_1 = \begin{bmatrix} 0.3 \\ 0.0 \\ 0.2 \\ 0.1 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.0 \\ 0.1 \\ 0.0 \end{bmatrix} = \begin{bmatrix} 0.4 \\ 0.0 \\ 0.3 \\ 0.1 \end{bmatrix} $$

Position 2:

$$ x_2 = \begin{bmatrix} 0.0 \\ 0.4 \\ 0.1 \\ 0.2 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.0 \\ 0.1 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 0.0 \\ 0.4 \\ 0.2 \\ 0.3 \end{bmatrix} $$

Input to block 1:

$$ X = \begin{bmatrix} 0.1 & 0.3 & 0.0 & 0.1 \\ 0.4 & 0.0 & 0.3 & 0.1 \\ 0.0 & 0.4 & 0.2 & 0.3 \end{bmatrix} $$

Shape $(T, C) = (3, 4)$.

Step 4 — Pre-LayerNorm (schematic)

Real LayerNorm subtracts mean and divides by std per row. For the walkthrough I’ll use the normalized rows directly:

$$ \tilde{x}_0 = \begin{bmatrix} 0.0 \\ 0.5 \\ -0.5 \\ 0.0 \end{bmatrix} $$

$$ \tilde{x}_1 = \begin{bmatrix} 0.6 \\ -0.4 \\ 0.4 \\ -0.6 \end{bmatrix} $$

$$ \tilde{x}_2 = \begin{bmatrix} -0.5 \\ 0.5 \\ 0.0 \\ 0.0 \end{bmatrix} $$

Step 5 — Causal self-attention, head 1

Suppose after c_attn projection, head 1 has $Q, K, V \in \mathbb{R}^{3 \times 2}$:

$$ Q = \begin{bmatrix} 1.0 & 0.0 \\ 0.5 & 0.5 \\ 0.0 & 1.0 \end{bmatrix} $$

$$ K = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \\ 1.0 & 1.0 \end{bmatrix} $$

$$ V = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.1 \\ 0.2 & 0.3 \end{bmatrix} $$

Scale factor:

$$ \sqrt{d_k} = \sqrt{2} \approx 1.414 $$

Raw scores $S = Q K^\top / \sqrt{d_k}$.

Row 0, col 0 (token 0 attends to token 0):

$$ q_0 \cdot k_0 = 1.0 \times 1.0 + 0.0 \times 0.0 = 1.0 $$

$$ S_{0,0} = 1.0 / 1.414 = 0.707 $$

Row 0, col 1 (token 0 → token 1, future, will be masked):

$$ q_0 \cdot k_1 = 1.0 \times 0.0 + 0.0 \times 1.0 = 0.0 $$

$$ S_{0,1} = 0.0 / 1.414 = 0.0 $$

Row 1, col 0:

$$ q_1 \cdot k_0 = 0.5 \times 1.0 + 0.5 \times 0.0 = 0.5 $$

$$ S_{1,0} = 0.5 / 1.414 = 0.354 $$

Row 1, col 1:

$$ q_1 \cdot k_1 = 0.5 \times 0.0 + 0.5 \times 1.0 = 0.5 $$

$$ S_{1,1} = 0.5 / 1.414 = 0.354 $$

Row 2, col 0:

$$ q_2 \cdot k_0 = 0.0 \times 1.0 + 1.0 \times 0.0 = 0.0 $$

$$ S_{2,0} = 0.0 $$

Row 2, col 1:

$$ q_2 \cdot k_1 = 0.0 \times 0.0 + 1.0 \times 1.0 = 1.0 $$

$$ S_{2,1} = 1.0 / 1.414 = 0.707 $$

Row 2, col 2:

$$ q_2 \cdot k_2 = 0.0 \times 1.0 + 1.0 \times 1.0 = 1.0 $$

$$ S_{2,2} = 1.0 / 1.414 = 0.707 $$

Score matrix before mask:

$$ S_{\text{raw}} = \begin{bmatrix} 0.707 & 0.0 & \text{−∞} \\ 0.354 & 0.354 & \text{−∞} \\ 0.0 & 0.707 & 0.707 \end{bmatrix} $$

(Upper triangle set to $-\infty$ for causal masking.)

Step 6 — Softmax per row (each query position)

Row 0 — only position 0 is valid:

$$ e^{0.707} = 2.028 $$

$$ Z_0 = 2.028 $$

$$ w_{0,0} = 2.028 / 2.028 = 1.0 $$

Row 1 — positions 0 and 1:

$$ e^{0.354} = 1.425 $$

$$ Z_1 = 1.425 + 1.425 = 2.850 $$

$$ w_{1,0} = 1.425 / 2.850 = 0.5 $$

$$ w_{1,1} = 1.425 / 2.850 = 0.5 $$

Row 2 — positions 0, 1, 2:

$$ e^{0} = 1.0 $$

$$ e^{0.707} = 2.028 $$

$$ Z_2 = 1.0 + 2.028 + 2.028 = 5.056 $$

$$ w_{2,0} = 1.0 / 5.056 = 0.198 $$

$$ w_{2,1} = 2.028 / 5.056 = 0.401 $$

$$ w_{2,2} = 2.028 / 5.056 = 0.401 $$

Step 7 — Weighted sum of values (head 1 output)

Output row 0:

$$ o_0 = 1.0 \times v_0 = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} $$

Output row 1:

$$ o_1 = 0.5 \times v_0 + 0.5 \times v_1 $$

$$ o_1 = 0.5 \times \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} + 0.5 \times \begin{bmatrix} 0.3 \\ 0.1 \end{bmatrix} $$

$$ o_1 = \begin{bmatrix} 0.05 \\ 0.10 \end{bmatrix} + \begin{bmatrix} 0.15 \\ 0.05 \end{bmatrix} = \begin{bmatrix} 0.20 \\ 0.15 \end{bmatrix} $$

Output row 2:

$$ o_2 = 0.198 \times v_0 + 0.401 \times v_1 + 0.401 \times v_2 $$

$$ o_2 = 0.198 \times \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} + 0.401 \times \begin{bmatrix} 0.3 \\ 0.1 \end{bmatrix} + 0.401 \times \begin{bmatrix} 0.2 \\ 0.3 \end{bmatrix} $$

$$ o_2 = \begin{bmatrix} 0.0198 \\ 0.0396 \end{bmatrix} + \begin{bmatrix} 0.1203 \\ 0.0401 \end{bmatrix} + \begin{bmatrix} 0.0802 \\ 0.1203 \end{bmatrix} $$

$$ o_2 = \begin{bmatrix} 0.2203 \\ 0.2000 \end{bmatrix} $$

Head 1 contributes half the channels. Head 2 runs the same math on the other half, then c_proj mixes them back to width $C = 4$. After attention + residual, the block feeds the MLP, another residual, and the next block (if any).

Step 8 — Final LayerNorm and logits

After all blocks and ln_f, suppose the hidden state at position 2 (last token, predicting the next character) is:

$$ h_2 = \begin{bmatrix} 0.2 \\ 0.1 \\ -0.1 \\ 0.3 \end{bmatrix} $$

lm_head is a linear map $W_{\text{head}} \in \mathbb{R}^{4 \times 4}$ (tied to wte in the real model). Toy logits:

$$ z = h_2^\top W_{\text{head}} $$

Suppose the four logits are:

$$ z = \begin{bmatrix} 1.0 \\ 0.5 \\ 2.0 \\ -0.5 \end{bmatrix} $$

Step 9 — Cross-entropy loss for one position

Target token ID at position 2 is $y = 0$ (the model should predict token 0 next).

Softmax:

$$ e^{1.0} = 2.718 $$

$$ e^{0.5} = 1.649 $$

$$ e^{2.0} = 7.389 $$

$$ e^{-0.5} = 0.607 $$

$$ Z = 2.718 + 1.649 + 7.389 + 0.607 = 12.363 $$

$$ p_0 = 2.718 / 12.363 = 0.220 $$

$$ p_1 = 1.649 / 12.363 = 0.133 $$

$$ p_2 = 7.389 / 12.363 = 0.598 $$

$$ p_3 = 0.607 / 12.363 = 0.049 $$

Loss (negative log likelihood of correct class):

$$ \mathcal{L} = -\log(p_0) = -\log(0.220) = 1.514 $$

The model assigns only 22% probability to the correct token — high loss, as expected for an untrained network. Training pushes $p_0$ toward 1.0 across millions of such steps.

Shape audit for this toy run:

$$ \text{idx}: (1, 3) $$

$$ \text{tok\_emb}: (1, 3, 4) $$

$$ \text{pos\_emb}: (3, 4) \rightarrow \text{broadcast to } (1, 3, 4) $$

$$ \text{after block}: (1, 3, 4) $$

$$ \text{logits}: (1, 3, 4) $$

$$ \text{loss}: \text{scalar (mean over } B \times T \text{ positions)} $$

How do I train GPT-2 locally?

flowchart LR
  DATA["get_batch\n(x, y)"] --> FWD["Forward pass\nlogits, loss"]
  FWD --> ZERO["zero_grad"]
  ZERO --> BWD["loss.backward\ncompute grads"]
  BWD --> CLIP["clip_grad_norm_\nprevent explosion"]
  CLIP --> STEP["optimizer.step\nupdate weights"]
  STEP -->|"next iteration"| DATA

  STEP -->|"every 500 steps"| EVAL["estimate_loss\ntrain + val"]
  EVAL -->|"log"| LOG["step N | train X | val Y"]

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(f"Using device: {device}")

config = GPTConfig()
model  = GPT(config).to(device)
print(f"Parameters: {model.get_num_params():,}")

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

@torch.no_grad()
def estimate_loss(eval_iters=100):
    model.eval()
    losses = {}
    for split in ['train', 'val']:
        total = 0.0
        for _ in range(eval_iters):
            x, y = get_batch(split)
            _, loss = model(x, y)
            total += loss.item()
        losses[split] = total / eval_iters
    model.train()
    return losses

max_iters   = 5000
eval_every  = 500

for step in range(max_iters):
    x, y = get_batch('train')
    logits, loss = model(x, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    # Gradient clipping — prevents exploding gradients in deep networks
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % eval_every == 0 or step == max_iters - 1:
        losses = estimate_loss()
        print(f"step {step:5d} | train {losses['train']:.4f} | val {losses['val']:.4f}")

Expected output on a MacBook M2 (~5 min):

step     0 | train 4.2891 | val 4.2934
step   500 | train 2.1034 | val 2.1712
step  1000 | train 1.7823 | val 1.9140
step  2000 | train 1.5601 | val 1.7832
step  5000 | train 1.3244 | val 1.6109

What each line does:

get_batch samples random windows — no epoch boundary, infinite stream.
Forward pass computes cross-entropy over all $B \times T$ positions at once.
zero_grad(set_to_none=True) clears old gradients (PyTorch accumulates by default).
clip_grad_norm_(..., 1.0) caps the global gradient norm — without this, early training can blow up weights.
AdamW with lr=3e-4 and weight_decay=0.1 is the nanoGPT default that works out of the box.

For character-level Shakespeare, val loss around 1.6 means the model assigns roughly $e^{-1.6} \approx 0.20$ average probability to the correct next character — enough to generate recognizable Shakespeare-ish text.

How do I generate text after training?

@torch.no_grad()
def generate(model, prompt, max_new_tokens=200, temperature=1.0, top_k=40):
    model.eval()
    idx = torch.tensor(encode(prompt), dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        # Crop context to block_size
        idx_cond = idx[:, -config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :]          # take last position: (1, vocab_size)

        # Temperature: scale logits before softmax
        logits = logits / temperature

        # Top-k: zero out all but the k highest logits
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float('-inf')

        probs = F.softmax(logits, dim=-1)
        next_tok = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_tok], dim=1)

    return decode(idx[0].tolist())

print(generate(model, prompt="\n", max_new_tokens=300, temperature=0.8))

Temperature scales logits before softmax:

$$ \text{logits}' = \text{logits} / T $$

$T < 1$ — sharper distribution, more predictable
$T = 1$ — raw model distribution
$T > 1$ — flatter distribution, more random

Top-k zeros out all but the $k$ highest logits before sampling — cuts nonsense tokens while keeping variety.

flowchart TD
  P["prompt tokens\n(1, T)"] --> CROP["crop to block_size"]
  CROP --> FWD["model forward pass\n→ logits (1, T, V)"]
  FWD --> LAST["take last position\nlogits (1, V)"]
  LAST --> TEMP["÷ temperature\nscale sharpness"]
  TEMP --> TOPK["zero all but top-k\nlogits"]
  TOPK --> SOFT["softmax → probs\n(1, V)"]
  SOFT --> SAMPLE["multinomial sample\n→ next token id"]
  SAMPLE --> CAT["cat to idx\n(1, T+1)"]
  CAT -->|"repeat until\nmax_new_tokens"| CROP

  style SAMPLE fill:#d4edda,stroke:#009900,color:#111
  style SOFT fill:#d4edda,stroke:#009900,color:#111

During generation, idx grows every step. Crop to block_size before each forward pass or wpe will index out of range.

How many parameters does each piece of GPT-2 have?

Full breakdown for the local Shakespeare config ($n_{\text{layer}}=6$, $n_{\text{head}}=6$, $n_{\text{embd}}=384$, $V=65$):

Component	Parameters
Token embedding `wte`	$65 \times 384 = 24{,}960$
Position embedding `wpe`	$256 \times 384 = 98{,}304$
Per block: attention QKV	$384 \times 1152 = 442{,}368$
Per block: attention proj	$384 \times 384 = 147{,}456$
Per block: MLP fc	$384 \times 1536 = 589{,}824$
Per block: MLP proj	$1536 \times 384 = 589{,}824$
Per block: 2× LayerNorm	$2 \times 768 = 1{,}536$
Per block total	$1{,}772{,}928$
6 blocks	$10{,}637{,}568$
Final LayerNorm	$768$
`lm_head` (tied to `wte`)	$0$ extra
Total	$\approx 10.8\text{M}$

For full GPT-2 small (12 layers, 768 embd, vocab 50257): 124M parameters.

What makes this architecture GPT-2 specifically?

GPT-1 → GPT-2 was mostly scale, but a few design choices stuck:

Pre-LayerNorm instead of post-LayerNorm — better gradient flow in deep stacks
Final LayerNorm (ln_f) after all blocks
Weight initialization scaled by depth: residual projections use $\frac{0.02}{\sqrt{2 \cdot n_{\text{layer}}}}$ so the residual stream doesn’t explode
No bias in attention — removes useless parameters
Vocabulary expanded to 50257 via BPE
Weight tying between wte and lm_head

The GPT-2 paper is 8 pages; architecture details live in the supplement. Most implementations I trust — including this one — trace back to Karpathy’s nanoGPT, which is the clearest reference code I’ve found.

What are the common implementation pitfalls?

Forgetting to shift targets. Token $t$ predicts token $t+1$. Same x for input and target = the model learns to copy.

Not zeroing gradients. PyTorch accumulates by default. Always optimizer.zero_grad() before .backward(). set_to_none=True is a small speed win.

No gradient clipping. Without clip_grad_norm_, early training can produce huge gradients that permanently corrupt weights.

Context overflow during generation. idx grows each step. Crop to block_size before the forward pass.

Weight tying breaks after load_state_dict. Saving and reloading creates two separate tensors. Re-tie: model.transformer.wte.weight = model.lm_head.weight.

Trusting loss alone. A subtly broken model can still show falling loss. I verify with: (1) tensor-shape assertions, (2) gradient check against PyTorch on a tiny model, (3) deliberate overfit on one batch — loss should approach zero.

What are the tensor shapes during full-scale training?

For GPT-2 small with $B=8$, $T=1024$, $d=768$, $h=12$:

$$ \text{token IDs}: (8, 1024) $$

$$ \text{hidden states}: (8, 1024, 768) $$

$$ d_h = 768 / 12 = 64 $$

$$ Q, K, V \text{ per head}: (8, 12, 1024, 64) $$

$$ \text{attention scores}: (8, 12, 1024, 1024) $$

One score tensor holds:

$$ 8 \times 12 \times 1024^2 = 100{,}663{,}296 \text{ values} $$

At 2 bytes per value (fp16):

$$ 100{,}663{,}296 \times 2 = 201{,}326{,}592 \text{ bytes} $$

$$ 201{,}326{,}592 / 2^{20} \approx 192 \text{ MiB} $$

That’s one intermediate, before gradients and optimizer state. Sequence length — not just parameter count — dominates training memory. This is why Flash Attention and context-length tricks matter at scale.

Per block parameter budget:

$$ \text{attention} \approx 4d^2 = 4 \times 768^2 = 2{,}359{,}296 $$

$$ \text{MLP} \approx 8d^2 = 8 \times 768^2 = 4{,}718{,}592 $$

$$ \text{per block} \approx 7.08 \text{M} $$

$$ 12 \text{ blocks} \approx 85 \text{M} $$

Plus ~39M in embeddings. Weight tying avoids a second $V \times d$ output matrix.

Building GPT-2 from scratch was the point where transformers stopped being a diagram and became code I could break. Attention in isolation is a weighted sum. Embeddings in isolation are a lookup table. Snap twelve blocks together with causal masking, pre-LayerNorm, weight tying, and a shifted cross-entropy target — and you have a model that writes Shakespeare on a laptop. I still re-run the toy forward pass whenever I change anything. The numbers don’t lie.

Watch It Built Live

Andrej Karpathy’s full lecture reproducing GPT-2 (124M) from scratch:

References

Primary sources — code and papers

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Karpathy, A. (2023). build-nanoGPT: Video+code lecture on building nanoGPT from scratch. GitHub. https://github.com/karpathy/build-nanogpt
Karpathy, A. (2023). nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs. GitHub. https://github.com/karpathy/nanoGPT
Karpathy, A. (2023). Let’s reproduce GPT-2 (124M). YouTube. https://www.youtube.com/watch?v=l8pRSuU81PU

Implementation guides read

Belaweid, A. (2025). GPT-2 Implementation From Scratch For Dummies! Substack. https://azizbelaweid.substack.com/p/gpt-2-implementation-from-scratch
RecsysML. (2025). Learning ML by doing: Training GPT-2 on a budget. Substack. https://recsysml.substack.com/p/training-gpt-2-on-a-budget
Cameron, W. (2024). Implementation of causal self-attention in PyTorch. GitHub Gist. https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256

Theory and architecture

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762
Raschka, S. (2023). Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch. https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
Ravi, S. (2025). GPT-2 Architecture Demystified: A Step-by-Step Breakdown. Medium. https://sararavi14.medium.com/gpt-2-architecture-demystified-a-step-by-step-breakdown-74b1c5c80d17

Official documentation

HuggingFace. GPT-2 Model Documentation. HuggingFace Transformers. https://huggingface.co/docs/transformers/model_doc/gpt2