← Back

Building GPT-2 from Scratch

A complete walkthrough of GPT-2's architecture — tokenization, causal self-attention, transformer blocks, weight tying — implemented in pure PyTorch and trained locally on your machine.

GPT-2 is a decoder-only transformer that predicts the next token in a sequence. That one sentence describes the entire model. Everything else — the 12 attention heads, the causal mask, the weight tying, the LayerNorm placement — is engineering decisions that make it work well and train stably.

This post builds the full GPT-2 (124M parameter) architecture from scratch in PyTorch, explains every decision, and trains it locally on the TinyShakespeare dataset.

GPT-2 transformer block architecture — attention, MLP, and LayerNorm layers with data streams


What GPT-2 Actually Is

GPT-2 is a language model: given a sequence of tokens, predict what comes next. Trained on enough text, the model learns grammar, facts, reasoning patterns, and style — because all of those are regularities in next-token prediction.

The architecture is a stack of transformer blocks, each with two sub-layers:

  1. Causal multi-head self-attention
  2. Position-wise MLP (feed-forward)

“Causal” means each token can only attend to tokens before it — no peeking at the future.

graph TD
  TOK["Token IDs\n(B, T)"] --> EMB
  POS["Position IDs\n0 … T-1"] --> PEMB

  subgraph "Embeddings"
    EMB["wte\nToken Embedding\nvocab × C"]
    PEMB["wpe\nPosition Embedding\nT × C"]
    ADD["+ add\n(B, T, C)"]
    EMB --> ADD
    PEMB --> ADD
  end

  ADD --> B1

  subgraph "× 12 Transformer Blocks"
    B1["Block\nLayerNorm → Attention → Residual\nLayerNorm → MLP → Residual"]
    B2["Block ..."]
    B12["Block 12"]
    B1 --> B2 --> B12
  end

  B12 --> LNF["Final LayerNorm"]
  LNF --> LMH["lm_head\nLinear C → vocab\n(weight tied to wte)"]
  LMH --> LOGITS["Logits\n(B, T, vocab_size)"]
  LOGITS --> LOSS["Cross-entropy loss\nvs shifted targets"]

  style ADD fill:#d4edda,stroke:#009900,color:#111
  style LMH fill:#d4edda,stroke:#009900,color:#111
  style LOSS fill:#f8d7da,stroke:#990000,color:#111

GPT-2 small (124M) specs:

Hyperparameter Value
Layers (n_layer) 12
Attention heads (n_head) 12
Embedding dim (n_embd) 768
Context length (block_size) 1024
Vocabulary size 50257
MLP hidden dim 4 × 768 = 3072

The Data Pipeline

We use TinyShakespeare — 1MB of Shakespeare plays, simple to load, complex enough to learn structure.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Download dataset
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "input.txt")

with open("input.txt", "r") as f:
    text = f.read()

print(f"Dataset: {len(text):,} characters")

Tokenizer. For a from-scratch local run, character-level tokenization is simplest. For true GPT-2 BPE tokenization use tiktoken:

# Character-level (fast, local, no deps)
chars = sorted(set(text))
vocab_size = len(chars)          # 65 for Shakespeare
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

Train / val split and batch sampler:

n = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]

block_size = 256   # context length (use 1024 for full GPT-2)
batch_size = 64

def get_batch(split):
    d = train_data if split == 'train' else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i    : i + block_size    ] for i in ix])
    y = torch.stack([d[i + 1: i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

The targets y are the inputs x shifted left by one — token $t$ predicts token $t+1$.


Causal Self-Attention

This is the heart of GPT-2. Every token asks: “how much should I attend to every previous token?”

The math

graph LR
  subgraph "Attention scores for one head"
    X2["X\n(B,T,C)"] --> WQ["× W_Q"] & WK["× W_K"] & WV["× W_V"]
    WQ --> Q["Q\n(B,T,d_k)"]
    WK --> K["K\n(B,T,d_k)"]
    WV --> V["V\n(B,T,d_v)"]
    Q & K --> DOT["QKᵀ / √d_k\n(B,T,T)"]
    DOT --> MASK["+ causal mask\n−∞ for future"]
    MASK --> SM["softmax\n(B,T,T)"]
    SM --> ATTN["× V\n(B,T,d_v)"]
  end
  style MASK fill:#f8d7da,stroke:#990000,color:#111
  style SM fill:#d4edda,stroke:#009900,color:#111

For each token position, compute query $Q$, key $K$, value $V$ vectors:

$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$

Attention scores (scaled dot-product):

$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V $$

where $M$ is the causal mask — $-\infty$ for future positions, $0$ elsewhere. After softmax, $-\infty$ becomes $0$, so future tokens get zero weight.

The $\frac{1}{\sqrt{d_k}}$ scaling prevents the dot products from growing large as $d_k$ increases. Without it, softmax saturates and gradients vanish.

Multi-head attention

Instead of one attention operation, do $h$ of them in parallel with different learned projections, then concatenate:

graph LR
  X["Input X\n(B, T, C)"] --> QKV["Single Linear\n c_attn: C → 3C"]
  QKV --> Q["Q split"]
  QKV --> K["K split"]
  QKV --> V["V split"]
  Q --> R["Reshape\n(B,nh,T,hs)"]
  K --> R
  V --> R
  R --> ATT["Scaled dot-product\n+ causal mask\n+ softmax"]
  ATT --> OUT["Weighted sum\nof V"]
  OUT --> PROJ["c_proj: C → C"]
  PROJ --> Y["Output Y\n(B, T, C)"]

  style ATT fill:#d4edda,stroke:#009900,color:#111

Implementation

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Q, K, V projections fused into one matrix — 3x more efficient
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_size = config.n_embd // config.n_head

        # causal mask — lower triangular, registered as buffer (not a parameter)
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.block_size, config.block_size))
                 .view(1, 1, config.block_size, config.block_size)
        )

    def forward(self, x):
        B, T, C = x.size()  # batch, sequence length, embedding dim

        # Project once, split into Q K V — each (B, T, C)
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

        # Reshape into (B, n_head, T, head_size) for parallel head computation
        k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)

        # Scaled dot-product attention: (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (self.head_size ** -0.5)

        # Apply causal mask: future positions → -inf → 0 after softmax
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        # Weighted sum of values: (B, nh, T, hs)
        y = att @ v

        # Concatenate heads: (B, T, C)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        return self.c_proj(y)

Key detail: the QKV projection is a single Linear(C, 3C) layer, then split. This is more efficient than three separate projections (one GEMM vs three).


The MLP Block

After attention, each token’s representation passes through a position-wise feed-forward network independently. This is where most of the model’s “knowledge” is stored.

Structure: expand to $4C$ → GELU → compress back to $C$.

$$ \text{MLP}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2 $$
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu   = nn.GELU(approximate='tanh')  # GPT-2 uses tanh approx
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        return self.c_proj(self.gelu(self.c_fc(x)))

Why GELU over ReLU? GELU is smooth near zero — it doesn’t hard-gate values, it softly weights them. This leads to better gradient flow and empirically better training for language models.

Why 4× expansion? The expanded space gives the MLP room to represent complex functions. It’s a heuristic from the original transformer paper that has stuck across essentially all modern LLMs.


The Transformer Block

One block = LayerNorm → Attention → residual + LayerNorm → MLP → residual.

GPT-2 uses pre-LayerNorm (normalize before each sub-layer, not after). This is a departure from the original “Attention is All You Need” paper and improves training stability significantly.

graph TD
  X["x (input)"] --> LN1["LayerNorm 1"]
  LN1 --> ATT["CausalSelfAttention"]
  ATT --> ADD1["+ residual"]
  X --> ADD1
  ADD1 --> LN2["LayerNorm 2"]
  LN2 --> MLP["MLP"]
  MLP --> ADD2["+ residual"]
  ADD1 --> ADD2
  ADD2 --> OUT["x (output)"]

  style ATT fill:#d4edda,stroke:#009900,color:#111
  style MLP fill:#d4edda,stroke:#009900,color:#111
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp  = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))  # attention sub-layer
        x = x + self.mlp(self.ln_2(x))   # MLP sub-layer
        return x

The residual connections (x + ...) are critical — they let gradients flow directly to early layers without passing through attention or MLP. Without them, 12-layer networks would not train.


The Full GPT-2 Model

from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 256    # context length
    vocab_size: int = 65     # 65 for char-level Shakespeare; 50257 for GPT-2 BPE
    n_layer:    int = 6      # 12 for full GPT-2 small
    n_head:     int = 6      # 12 for full GPT-2 small
    n_embd:     int = 384    # 768 for full GPT-2 small


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte  = nn.Embedding(config.vocab_size, config.n_embd),  # token embeddings
            wpe  = nn.Embedding(config.block_size, config.n_embd),  # position embeddings
            h    = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),                     # final LayerNorm
        ))
        # language model head: projects embeddings → vocab logits
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying: token embedding and lm_head share the same matrix
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.size()
        assert T <= self.config.block_size

        # Token + positional embeddings
        pos = torch.arange(T, device=idx.device)          # (T,)
        tok_emb = self.transformer.wte(idx)                # (B, T, C)
        pos_emb = self.transformer.wpe(pos)                # (T, C)
        x = tok_emb + pos_emb                              # (B, T, C)

        # Pass through transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final LayerNorm + project to vocab
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)                           # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

    def get_num_params(self):
        return sum(p.numel() for p in self.parameters())

Weight tying

self.transformer.wte.weight = self.lm_head.weight makes both layers share the same matrix. This saves ~38M parameters (768 × 50257 = 38.6M) with no accuracy loss.

The intuition: the token embedding maps token ID → vector, and lm_head maps vector → token logits. They’re doing the same job in opposite directions. Using the same matrix means a token’s embedding IS its “signature” — the model learns one consistent representation rather than two separate ones.

graph LR
  subgraph "Without weight tying  (2 separate matrices)"
    WTE1["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H1["hidden\nstates"]
    H1 --> LMH1["lm_head.weight\n768 × 50257\n38.6M params"] --> L1["logits"]
  end

  subgraph "With weight tying  (1 shared matrix)"
    WTE2["wte.weight\n50257 × 768\n38.6M params"] -->|"token id → vector"| H2["hidden\nstates"]
    H2 -->|"same matrix\ntransposed"| L2["logits"]
  end

  style WTE2 fill:#d4edda,stroke:#009900,color:#111
  style L2 fill:#d4edda,stroke:#009900,color:#111

Embeddings in Detail

GPT-2 uses two separate learned embeddings added together:

$$ x = \text{Embed}_{\text{token}}(\text{idx}) + \text{Embed}_{\text{position}}(\text{pos}) $$
graph LR
  subgraph "Input: 'Hello world'"
    T1["Hello\ntoken 15496"] & T2["world\ntoken 995"]
  end

  subgraph "Token Embedding  wte  [50257 × 768]"
    TE1["row 15496\n→ 768-dim vec"] 
    TE2["row 995\n→ 768-dim vec"]
  end

  subgraph "Position Embedding  wpe  [1024 × 768]"
    PE1["row 0\n→ 768-dim vec"]
    PE2["row 1\n→ 768-dim vec"]
  end

  T1 --> TE1
  T2 --> TE2
  T1 -->|pos 0| PE1
  T2 -->|pos 1| PE2

  TE1 & PE1 --> ADD1["+ add → x₀\n768-dim"]
  TE2 & PE2 --> ADD2["+ add → x₁\n768-dim"]

  style ADD1 fill:#d4edda,stroke:#009900,color:#111
  style ADD2 fill:#d4edda,stroke:#009900,color:#111

Token embedding (wte): maps each token ID to a 768-dim vector. Vocabulary size 50257 for BPE, 65 for character-level.

Positional embedding (wpe): maps each position index $0, 1, \ldots, T-1$ to a 768-dim vector. Learned, not fixed sinusoidal like the original transformer. The model learns what “being at position 47” means.

Why add instead of concatenate? Adding means position information is baked directly into the token representation — the same representation that flows through all subsequent layers. Concatenating would double the dimension.


Training Loop

flowchart LR
  DATA["get_batch\n(x, y)"] --> FWD["Forward pass\nlogits, loss"]
  FWD --> ZERO["zero_grad"]
  ZERO --> BWD["loss.backward\ncompute grads"]
  BWD --> CLIP["clip_grad_norm_\nprevent explosion"]
  CLIP --> STEP["optimizer.step\nupdate weights"]
  STEP -->|"next iteration"| DATA

  STEP -->|"every 500 steps"| EVAL["estimate_loss\ntrain + val"]
  EVAL -->|"log"| LOG["step N | train X | val Y"]
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
print(f"Using device: {device}")

config = GPTConfig()
model  = GPT(config).to(device)
print(f"Parameters: {model.get_num_params():,}")

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

@torch.no_grad()
def estimate_loss(eval_iters=100):
    model.eval()
    losses = {}
    for split in ['train', 'val']:
        total = 0.0
        for _ in range(eval_iters):
            x, y = get_batch(split)
            _, loss = model(x, y)
            total += loss.item()
        losses[split] = total / eval_iters
    model.train()
    return losses

max_iters   = 5000
eval_every  = 500

for step in range(max_iters):
    x, y = get_batch('train')
    logits, loss = model(x, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    # Gradient clipping — prevents exploding gradients in deep networks
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % eval_every == 0 or step == max_iters - 1:
        losses = estimate_loss()
        print(f"step {step:5d} | train {losses['train']:.4f} | val {losses['val']:.4f}")

Expected output on a MacBook M2 (~5 min):

step     0 | train 4.2891 | val 4.2934
step   500 | train 2.1034 | val 2.1712
step  1000 | train 1.7823 | val 1.9140
step  2000 | train 1.5601 | val 1.7832
step  5000 | train 1.3244 | val 1.6109

Text Generation

@torch.no_grad()
def generate(model, prompt, max_new_tokens=200, temperature=1.0, top_k=40):
    model.eval()
    idx = torch.tensor(encode(prompt), dtype=torch.long, device=device).unsqueeze(0)

    for _ in range(max_new_tokens):
        # Crop context to block_size
        idx_cond = idx[:, -config.block_size:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :]          # take last position: (1, vocab_size)

        # Temperature: scale logits before softmax
        logits = logits / temperature

        # Top-k: zero out all but the k highest logits
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float('-inf')

        probs = F.softmax(logits, dim=-1)
        next_tok = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_tok], dim=1)

    return decode(idx[0].tolist())

print(generate(model, prompt="\n", max_new_tokens=300, temperature=0.8))

Temperature controls randomness:

  • $T < 1$ — sharper distribution, more predictable
  • $T = 1$ — sample from the raw model distribution
  • $T > 1$ — flatter distribution, more creative/random

Top-k prevents the model from ever sampling very low-probability tokens, which eliminates nonsense while keeping variety.

flowchart TD
  P["prompt tokens\n(1, T)"] --> CROP["crop to block_size"]
  CROP --> FWD["model forward pass\n→ logits (1, T, V)"]
  FWD --> LAST["take last position\nlogits (1, V)"]
  LAST --> TEMP["÷ temperature\nscale sharpness"]
  TEMP --> TOPK["zero all but top-k\nlogits"]
  TOPK --> SOFT["softmax → probs\n(1, V)"]
  SOFT --> SAMPLE["multinomial sample\n→ next token id"]
  SAMPLE --> CAT["cat to idx\n(1, T+1)"]
  CAT -->|"repeat until\nmax_new_tokens"| CROP

  style SAMPLE fill:#d4edda,stroke:#009900,color:#111
  style SOFT fill:#d4edda,stroke:#009900,color:#111

Parameter Count Breakdown

For the small config (n_layer=6, n_head=6, n_embd=384, vocab=65):

Component Parameters
Token embedding wte 65 × 384 = 24,960
Position embedding wpe 256 × 384 = 98,304
Per block: attention QKV 384 × (3×384) = 442,368
Per block: attention proj 384 × 384 = 147,456
Per block: MLP fc 384 × 1536 = 589,824
Per block: MLP proj 1536 × 384 = 589,824
Per block: 2× LayerNorm 2 × 768 = 1,536
Per block total ~1.77M
6 blocks ~10.6M
Final LayerNorm 768
lm_head (tied to wte) 0 extra
Total ~10.8M

For full GPT-2 small (12 layers, 768 embd, vocab 50257): 124M parameters.


What Makes It GPT-2 Specifically

The original GPT-1 → GPT-2 changes were mostly scale, but also:

  1. Pre-LayerNorm instead of post-LayerNorm — better gradient flow, trains more stably
  2. LayerNorm on the final output after all blocks (the ln_f we have)
  3. Weight initialization scaled by depth: residual projections initialized with $\frac{0.02}{\sqrt{2 \cdot n_layer}}$ to prevent the residual stream from growing too large
  4. No bias in attention — removes a set of useless parameters
  5. Vocabulary expanded to 50257 via BPE

The GPT-2 paper itself is only 8 pages. The architecture details are in the supplement. Most of what people call “GPT-2” is actually from the nanoGPT implementation by Karpathy, which is the clearest reference.


Common Pitfalls

Forgetting to shift targets. The model predicts token $t+1$ from token $t$. If you feed x as both input and target without the shift, the model learns the trivial task of copying.

Not zeroing gradients. PyTorch accumulates gradients by default. Always call optimizer.zero_grad() before .backward(). Use set_to_none=True for a small speedup.

No gradient clipping. Without clip_grad_norm_, early training can produce very large gradients that corrupt the weights permanently.

Context overflow. During generation, idx grows with each new token. Crop to block_size before the forward pass or you’ll get an index error in the positional embedding.

Weight tying breaks after load_state_dict. If you save and reload a model, the tied weight becomes two separate tensors. Re-tie after loading: model.transformer.wte.weight = model.lm_head.weight.


Summary

GPT-2 is:

  1. Token + position embeddings added together
  2. 12 transformer blocks, each with causal self-attention → MLP, both with residual connections and pre-LayerNorm
  3. A final LayerNorm then linear projection to vocab logits
  4. Weight tying between the input embedding and output projection
  5. Trained with cross-entropy loss on next-token prediction

That’s it. The entire architecture in five sentences. Everything else is choosing the right hyperparameters and making the implementation efficient.


Watch It Built Live

Andrej Karpathy’s full lecture reproducing GPT-2 (124M) from scratch:


References

Primary sources — code and papers

  1. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  2. Karpathy, A. (2023). build-nanoGPT: Video+code lecture on building nanoGPT from scratch. GitHub. https://github.com/karpathy/build-nanogpt

  3. Karpathy, A. (2023). nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs. GitHub. https://github.com/karpathy/nanoGPT

  4. Karpathy, A. (2023). Let’s reproduce GPT-2 (124M). YouTube. https://www.youtube.com/watch?v=l8pRSuU81PU

Implementation guides read

  1. Belaweid, A. (2025). GPT-2 Implementation From Scratch For Dummies! Substack. https://azizbelaweid.substack.com/p/gpt-2-implementation-from-scratch

  2. RecsysML. (2025). Learning ML by doing: Training GPT-2 on a budget. Substack. https://recsysml.substack.com/p/training-gpt-2-on-a-budget

  3. Cameron, W. (2024). Implementation of causal self-attention in PyTorch. GitHub Gist. https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256

Theory and architecture

  1. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. https://arxiv.org/abs/1706.03762

  2. Raschka, S. (2023). Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch. https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

  3. Ravi, S. (2025). GPT-2 Architecture Demystified: A Step-by-Step Breakdown. Medium. https://sararavi14.medium.com/gpt-2-architecture-demystified-a-step-by-step-breakdown-74b1c5c80d17

Official documentation

  1. HuggingFace. GPT-2 Model Documentation. HuggingFace Transformers. https://huggingface.co/docs/transformers/model_doc/gpt2