← Back

Building Autograd from Scratch

How modern deep learning frameworks compute gradients automatically. We build a working autograd engine in ~80 lines of Python — the same core idea behind PyTorch.

Every time you call .backward() in PyTorch, a small miracle happens: gradients flow backward through millions of operations, each one correctly applying calculus without you writing a single derivative by hand. This post tears that miracle apart and rebuilds it from first principles.

By the end you will have a working autograd engine in ~80 lines of Python that can train a neural network.

Autograd computation graph — nodes pulse orange on forward pass, blue on backward pass


The Problem: We Need Derivatives

Training a neural network means minimizing a loss function $L$ by nudging parameters in the direction that decreases $L$. That direction is the gradient: a vector of partial derivatives $\frac{\partial L}{\partial w_i}$ for every weight $w_i$.

For a network with millions of parameters, computing these derivatives by hand is impossible. We need a machine to do it.

Three approaches exist:

Method How Problem
Symbolic diff Algebra: manipulate expressions Expression explosion
Numerical diff Finite differences: $\frac{f(x+h) - f(x)}{h}$ Slow, numerically unstable
Automatic diff Track ops at runtime, apply chain rule This is what we build

Automatic differentiation (autograd) is exact (up to floating point), fast, and works for any Python control flow — loops, conditionals, recursion.

graph LR
  subgraph "Differentiation methods"
    S["Symbolic\ndiff"] -->|"Expression\nexplosion"| X1["✗"]
    N["Numerical\ndiff"] -->|"Slow +\nunstable"| X2["✗"]
    A["Automatic\ndiff"] -->|"Exact +\nfast"| OK["✓"]
  end
  style A fill:#d4edda,stroke:#009900,color:#111
  style OK fill:#d4edda,stroke:#009900,color:#111
  style X1 fill:#f8d7da,stroke:#990000,color:#111
  style X2 fill:#f8d7da,stroke:#990000,color:#111

The Chain Rule Is All You Need

Every neural network computation decomposes into elementary operations: add, multiply, exponentiate, etc. The chain rule tells us how to compose derivatives through a sequence of operations.

If $L = f(g(x))$, then:

$$ \frac{dL}{dx} = \frac{dL}{df} \cdot \frac{df}{dg} \cdot \frac{dg}{dx} $$

More generally, for a node $v$ in a computation graph with successors $s_1, s_2, \ldots$:

$$ \bar{v} = \sum_i \bar{s_i} \cdot \frac{\partial s_i}{\partial v} $$

where $\bar{v} = \frac{\partial L}{\partial v}$ is called the adjoint of $v$.

This is the entire mathematical foundation. Everything else is implementation.


Computation Graphs

Every expression you write builds a directed acyclic graph (DAG) of operations. Consider:

a = 2.0
b = 3.0
c = a * b   # c = 6
d = c + b   # d = 9
L = d * 2   # L = 18

The graph looks like:

graph LR
  a("a = 2.0") --> mul["× multiply"]
  b("b = 3.0") --> mul
  mul --> c("c = 6.0")
  c --> add["+ add"]
  b --> add
  add --> d("d = 9.0")
  d --> mul2["× 2"]
  mul2 --> L("L = 18.0")

  style L fill:#d4edda,stroke:#009900,color:#111
  style a fill:#fff,stroke:#dddddd,color:#111
  style b fill:#fff,stroke:#dddddd,color:#111

Leaf nodes (a, b) are inputs. The root (L) is the output. Every interior node stores:

  1. Its value (forward pass result)
  2. Which children created it
  3. How to backprop gradients to those children

The Value Class

We wrap every scalar in a Value object. Each Value stores its data, its gradient, its parent nodes, and a closure _backward that knows how to distribute gradients to parents.

classDiagram
  class Value {
    +float data
    +float grad
    +set _prev
    +string _op
    +_backward()
    +__add__(other)
    +__mul__(other)
    +__pow__(exp)
    +relu()
    +backward()
  }
  note for Value "grad starts at 0.0\n_backward is a closure\nstored at creation time"
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = float(data)
        self.grad = 0.0
        self._backward = lambda: None  # no-op by default
        self._prev = set(_children)
        self._op = _op  # for debugging

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

The grad field starts at zero. It accumulates contributions from all downstream nodes during the backward pass.


Operations: Forward + Local Gradient

For each operation, we:

  1. Compute the output value (forward pass)
  2. Define a _backward closure that accumulates gradients into the inputs using the chain rule

Addition

Mathematically: if $c = a + b$, then $\frac{\partial c}{\partial a} = 1$ and $\frac{\partial c}{\partial b} = 1$.

So when the backward pass tells us $\frac{\partial L}{\partial c}$ (stored in c.grad), we accumulate:

$$ \frac{\partial L}{\partial a} \mathrel{+}= \frac{\partial L}{\partial c} \cdot 1 $$
def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')

    def _backward():
        self.grad  += 1.0 * out.grad
        other.grad += 1.0 * out.grad
    out._backward = _backward
    return out

Note the +=. A node can be used multiple times in a computation graph — gradients accumulate, they don’t overwrite.

Multiplication

If $c = a \cdot b$, then $\frac{\partial c}{\partial a} = b$ and $\frac{\partial c}{\partial b} = a$.

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad  += other.data * out.grad
        other.grad += self.data  * out.grad
    out._backward = _backward
    return out

Power

If $y = x^n$, then $\frac{\partial y}{\partial x} = n \cdot x^{n-1}$.

def __pow__(self, exponent):
    assert isinstance(exponent, (int, float))
    out = Value(self.data ** exponent, (self,), f'**{exponent}')

    def _backward():
        self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
    out._backward = _backward
    return out

ReLU

$$ \text{ReLU}(x) = \max(0, x) \qquad \frac{\partial \text{ReLU}}{\partial x} = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases} $$
def relu(self):
    out = Value(max(0, self.data), (self,), 'ReLU')

    def _backward():
        self.grad += (out.data > 0) * out.grad
    out._backward = _backward
    return out

Convenience wrappers

def __neg__(self):         return self * -1
def __sub__(self, other):  return self + (-other)
def __truediv__(self, other): return self * other**-1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def __rsub__(self, other): return other + (-self)

The r variants handle expressions like 2 * v where the left operand is a Python number.


The Backward Pass: Topological Sort

Now the key: how do we call _backward on every node in the right order?

Right order means: before processing a node, all nodes that depend on it must already have their gradients computed. That is exactly reverse topological order.

graph LR
  subgraph "Forward order (topo)"
    direction LR
    A["a"] --> C["c=a*b"] --> E["L=c+b"]
    B["b"] --> C
    B --> E
  end
  subgraph "Backward order (reversed)"
    direction RL
    E2["L"] --> C2["c"] --> A2["a"]
    E2 --> B2["b"]
    C2 --> B2
  end
  style E fill:#d4edda,stroke:#009900,color:#111
  style E2 fill:#d4edda,stroke:#009900,color:#111

We find this order with a depth-first search, then walk it in reverse:

def backward(self):
    topo = []
    visited = set()

    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)

    build_topo(self)

    # Seed: dL/dL = 1
    self.grad = 1.0
    for node in reversed(topo):
        node._backward()

The DFS appends a node after recursing into its children — so topo comes out in topological order (leaves first, root last). We reverse it so we start at the root and work toward leaves.


Putting It Together: Full Engine

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = float(data)
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out

    def __pow__(self, exponent):
        out = Value(self.data ** exponent, (self,), f'**{exponent}')
        def _backward():
            self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
        out._backward = _backward
        return out

    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')
        def _backward():
            self.grad += (out.data > 0) * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo, visited = [], set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev: build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

    def __neg__(self):          return self * -1
    def __sub__(self, other):   return self + (-other)
    def __truediv__(self, other): return self * other**-1
    def __radd__(self, other):  return self + other
    def __rmul__(self, other):  return self * other
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

That is the entire engine. 82 lines. Let’s verify it works.


Verification: Checking Against Calculus

Let $L = (a \cdot b + c)^2$ with $a=2, b=3, c=1$.

Forward: $L = (6 + 1)^2 = 49$.

By hand: $\frac{\partial L}{\partial a} = 2(ab+c) \cdot b = 2 \cdot 7 \cdot 3 = 42$ $\frac{\partial L}{\partial b} = 2(ab+c) \cdot a = 2 \cdot 7 \cdot 2 = 28$ $\frac{\partial L}{\partial c} = 2(ab+c) \cdot 1 = 14$

a = Value(2.0)
b = Value(3.0)
c = Value(1.0)

L = (a * b + c) ** 2
L.backward()

print(a.grad)  # 42.0 ✓
print(b.grad)  # 28.0 ✓
print(c.grad)  # 14.0 ✓

Building a Neural Network on Top

With gradients working, we can build neurons and layers. A single neuron computes $y = \text{ReLU}(w \cdot x + b)$ for a weight vector $w$ and bias $b$.

graph TD
  subgraph "MLP(2, [4, 4, 1])"
    I1((x₁)) & I2((x₂))

    subgraph "Layer 1 — 4 neurons"
      N1((n)) & N2((n)) & N3((n)) & N4((n))
    end

    subgraph "Layer 2 — 4 neurons"
      N5((n)) & N6((n)) & N7((n)) & N8((n))
    end

    subgraph "Output — 1 neuron"
      OUT((y))
    end

    I1 & I2 --> N1 & N2 & N3 & N4
    N1 & N2 & N3 & N4 --> N5 & N6 & N7 & N8
    N5 & N6 & N7 & N8 --> OUT
  end

  style OUT fill:#d4edda,stroke:#009900,color:#111
import random

class Neuron:
    def __init__(self, n_inputs):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
        self.b = Value(0.0)

    def __call__(self, x):
        activation = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
        return activation.relu()

    def parameters(self):
        return self.w + [self.b]


class Layer:
    def __init__(self, n_inputs, n_outputs):
        self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]

    def __call__(self, x):
        return [n(x) for n in self.neurons]

    def parameters(self):
        return [p for n in self.neurons for p in n.parameters()]


class MLP:
    def __init__(self, n_inputs, layer_sizes):
        sizes = [n_inputs] + layer_sizes
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(layer_sizes))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x[0] if len(x) == 1 else x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

Training Loop

# XOR dataset
X = [[0,0], [0,1], [1,0], [1,1]]
y = [0,     1,     1,     0    ]

model = MLP(2, [4, 4, 1])

for step in range(100):
    # Forward pass
    preds = [model(x) for x in X]

    # Mean squared error loss
    loss = sum((pred - target)**2 for pred, target in zip(preds, y))

    # Backward pass
    for p in model.parameters():
        p.grad = 0.0   # zero gradients before backward
    loss.backward()

    # Gradient descent
    for p in model.parameters():
        p.data -= 0.05 * p.grad

    if step % 10 == 0:
        print(f"step {step:3d} | loss {loss.data:.4f}")

Output (approximate):

step   0 | loss 2.1847
step  10 | loss 1.3204
step  20 | loss 0.8951
step  30 | loss 0.5623
...
step  90 | loss 0.0412

The loss drops. Our engine is computing real gradients.


What PyTorch Does Differently

Our engine works on scalars. PyTorch operates on tensors (n-dimensional arrays), which is the same idea but vectorised with C++/CUDA under the hood. The concepts map exactly:

graph LR
  subgraph "Our engine"
    V["Value\n(scalar float)"] --> VD["value.data"]
    V --> VG["value.grad"]
    V --> VB["value._backward\n(closure)"]
    V --> VP["value._prev\n(set of parents)"]
  end
  subgraph "PyTorch"
    T["Tensor\n(n-dim array)"] --> TD["tensor.data"]
    T --> TG["tensor.grad"]
    T --> TB["tensor.grad_fn\n(C++ Function obj)"]
    T --> TP["grad_fn.next_functions\n(tuple of parents)"]
  end
  V -. "same concept\nscaled up" .-> T
  style V fill:#fff,stroke:#dddddd,color:#111
  style T fill:#d4edda,stroke:#009900,color:#111
Our engine PyTorch
Value.data tensor.data
Value.grad tensor.grad
Value._backward tensor.grad_fn
Value._prev tensor.grad_fn.next_functions
loss.backward() loss.backward()
p.grad = 0.0 optimizer.zero_grad()

The key difference: PyTorch’s DAG stores functions (AddBackward, MulBackward, etc.) not values. But the topological traversal and chain rule application are identical.


Common Pitfalls

Forgetting to zero gradients. Because grad accumulates with +=, calling .backward() twice without zeroing will double-count. Always zero before each backward pass.

Reusing leaf nodes. If a appears in two branches of the computation, its gradient correctly receives contributions from both branches via accumulation. This is a feature, not a bug.

In-place operations. Modifying .data directly after using a value in a computation breaks the graph. Never do v.data = new_val mid-computation.

Circular graphs. The topological sort assumes a DAG. Feedback loops aren’t supported (and shouldn’t be — use RNNs through time which unroll into a DAG).


Summary

The whole autograd idea in one sentence: wrap every number in an object that remembers how it was created, then walk creation history in reverse applying the chain rule.

The implementation has three parts:

  1. Value class — wraps scalar, stores grad and backward closure
  2. Operation overloads — each op computes output and registers local gradient rule
  3. backward() — topological sort + chain rule traversal from root to leaves

Everything in PyTorch, JAX, and every other framework is this idea scaled up to tensors and compiled to run on GPUs.


Watch It Built Live

Andrej Karpathy’s lecture that inspired this post — watch him build micrograd from scratch in real time:


References

Primary sources — code and implementations

  1. Karpathy, A. (2020). micrograd: A tiny scalar-valued autograd engine. GitHub. https://github.com/karpathy/micrograd

  2. Karpathy, A. (2022). The spelled-out intro to neural networks and backpropagation: building micrograd. YouTube, Neural Networks: Zero to Hero series. https://www.youtube.com/watch?v=VMj-3S1tku0

  3. Smit, J. (2022). DIY auto-grad engine: A step-by-step guide to calculating derivatives automatically. jordismit.com. https://jordismit.com/blog/diy-auto-grad-engine-a-step-by-step-guide-to-calculating-derivatives-automatically/

  4. Loh, J. (2024). Reverse-mode autodiff from scratch. jlohding.github.io. https://jlohding.github.io/posts/autodiff/

  5. Sankhla, B. (2023). DIY Deep Learning: Crafting your own autograd engine from scratch. Medium / SFU CSPMP. https://medium.com/sfu-cspmp/diy-deep-learning-crafting-your-own-autograd-engine-from-scratch-for-effortless-backpropagation-ddab167faaf5

Official documentation

  1. PyTorch. Automatic Differentiation with torch.autograd. PyTorch Tutorials. https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html

  2. JAX. Forward- and reverse-mode autodiff in JAX. JAX Documentation. https://docs.jax.dev/en/latest/jacobian-vector-products.html

  3. Dive into Deep Learning. 2.5 Automatic Differentiation. d2l.ai. https://d2l.ai/chapter_preliminaries/autograd.html

Theory

  1. Wikipedia. Automatic differentiation. Wikimedia Foundation. https://en.wikipedia.org/wiki/Automatic_differentiation

  2. University of Amsterdam. Module 3: The Reverse Mode of Automatic Differentiation. auto-ed.readthedocs.io. https://auto-ed.readthedocs.io/en/latest/mod3.html