Every time you call .backward() in PyTorch, a small miracle happens: gradients flow backward through millions of operations, each one correctly applying calculus without you writing a single derivative by hand. This post tears that miracle apart and rebuilds it from first principles.
By the end you will have a working autograd engine in ~80 lines of Python that can train a neural network.

The Problem: We Need Derivatives
Training a neural network means minimizing a loss function $L$ by nudging parameters in the direction that decreases $L$. That direction is the gradient: a vector of partial derivatives $\frac{\partial L}{\partial w_i}$ for every weight $w_i$.
For a network with millions of parameters, computing these derivatives by hand is impossible. We need a machine to do it.
Three approaches exist:
| Method | How | Problem |
|---|---|---|
| Symbolic diff | Algebra: manipulate expressions | Expression explosion |
| Numerical diff | Finite differences: $\frac{f(x+h) - f(x)}{h}$ | Slow, numerically unstable |
| Automatic diff | Track ops at runtime, apply chain rule | This is what we build |
Automatic differentiation (autograd) is exact (up to floating point), fast, and works for any Python control flow — loops, conditionals, recursion.
graph LR
subgraph "Differentiation methods"
S["Symbolic\ndiff"] -->|"Expression\nexplosion"| X1["✗"]
N["Numerical\ndiff"] -->|"Slow +\nunstable"| X2["✗"]
A["Automatic\ndiff"] -->|"Exact +\nfast"| OK["✓"]
end
style A fill:#d4edda,stroke:#009900,color:#111
style OK fill:#d4edda,stroke:#009900,color:#111
style X1 fill:#f8d7da,stroke:#990000,color:#111
style X2 fill:#f8d7da,stroke:#990000,color:#111
The Chain Rule Is All You Need
Every neural network computation decomposes into elementary operations: add, multiply, exponentiate, etc. The chain rule tells us how to compose derivatives through a sequence of operations.
If $L = f(g(x))$, then:
More generally, for a node $v$ in a computation graph with successors $s_1, s_2, \ldots$:
where $\bar{v} = \frac{\partial L}{\partial v}$ is called the adjoint of $v$.
This is the entire mathematical foundation. Everything else is implementation.
Computation Graphs
Every expression you write builds a directed acyclic graph (DAG) of operations. Consider:
a = 2.0
b = 3.0
c = a * b # c = 6
d = c + b # d = 9
L = d * 2 # L = 18
The graph looks like:
graph LR
a("a = 2.0") --> mul["× multiply"]
b("b = 3.0") --> mul
mul --> c("c = 6.0")
c --> add["+ add"]
b --> add
add --> d("d = 9.0")
d --> mul2["× 2"]
mul2 --> L("L = 18.0")
style L fill:#d4edda,stroke:#009900,color:#111
style a fill:#fff,stroke:#dddddd,color:#111
style b fill:#fff,stroke:#dddddd,color:#111
Leaf nodes (a, b) are inputs. The root (L) is the output. Every interior node stores:
- Its value (forward pass result)
- Which children created it
- How to backprop gradients to those children
The Value Class
We wrap every scalar in a Value object. Each Value stores its data, its gradient, its parent nodes, and a closure _backward that knows how to distribute gradients to parents.
classDiagram
class Value {
+float data
+float grad
+set _prev
+string _op
+_backward()
+__add__(other)
+__mul__(other)
+__pow__(exp)
+relu()
+backward()
}
note for Value "grad starts at 0.0\n_backward is a closure\nstored at creation time"
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = float(data)
self.grad = 0.0
self._backward = lambda: None # no-op by default
self._prev = set(_children)
self._op = _op # for debugging
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
The grad field starts at zero. It accumulates contributions from all downstream nodes during the backward pass.
Operations: Forward + Local Gradient
For each operation, we:
- Compute the output value (forward pass)
- Define a
_backwardclosure that accumulates gradients into the inputs using the chain rule
Addition
Mathematically: if $c = a + b$, then $\frac{\partial c}{\partial a} = 1$ and $\frac{\partial c}{\partial b} = 1$.
So when the backward pass tells us $\frac{\partial L}{\partial c}$ (stored in c.grad), we accumulate:
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out
Note the +=. A node can be used multiple times in a computation graph — gradients accumulate, they don’t overwrite.
Multiplication
If $c = a \cdot b$, then $\frac{\partial c}{\partial a} = b$ and $\frac{\partial c}{\partial b} = a$.
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
Power
If $y = x^n$, then $\frac{\partial y}{\partial x} = n \cdot x^{n-1}$.
def __pow__(self, exponent):
assert isinstance(exponent, (int, float))
out = Value(self.data ** exponent, (self,), f'**{exponent}')
def _backward():
self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
out._backward = _backward
return out
ReLU
def relu(self):
out = Value(max(0, self.data), (self,), 'ReLU')
def _backward():
self.grad += (out.data > 0) * out.grad
out._backward = _backward
return out
Convenience wrappers
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def __rsub__(self, other): return other + (-self)
The r variants handle expressions like 2 * v where the left operand is a Python number.
The Backward Pass: Topological Sort
Now the key: how do we call _backward on every node in the right order?
Right order means: before processing a node, all nodes that depend on it must already have their gradients computed. That is exactly reverse topological order.
graph LR
subgraph "Forward order (topo)"
direction LR
A["a"] --> C["c=a*b"] --> E["L=c+b"]
B["b"] --> C
B --> E
end
subgraph "Backward order (reversed)"
direction RL
E2["L"] --> C2["c"] --> A2["a"]
E2 --> B2["b"]
C2 --> B2
end
style E fill:#d4edda,stroke:#009900,color:#111
style E2 fill:#d4edda,stroke:#009900,color:#111
We find this order with a depth-first search, then walk it in reverse:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Seed: dL/dL = 1
self.grad = 1.0
for node in reversed(topo):
node._backward()
The DFS appends a node after recursing into its children — so topo comes out in topological order (leaves first, root last). We reverse it so we start at the root and work toward leaves.
Putting It Together: Full Engine
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = float(data)
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def __pow__(self, exponent):
out = Value(self.data ** exponent, (self,), f'**{exponent}')
def _backward():
self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
out._backward = _backward
return out
def relu(self):
out = Value(max(0, self.data), (self,), 'ReLU')
def _backward():
self.grad += (out.data > 0) * out.grad
out._backward = _backward
return out
def backward(self):
topo, visited = [], set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev: build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward()
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
That is the entire engine. 82 lines. Let’s verify it works.
Verification: Checking Against Calculus
Let $L = (a \cdot b + c)^2$ with $a=2, b=3, c=1$.
Forward: $L = (6 + 1)^2 = 49$.
By hand: $\frac{\partial L}{\partial a} = 2(ab+c) \cdot b = 2 \cdot 7 \cdot 3 = 42$ $\frac{\partial L}{\partial b} = 2(ab+c) \cdot a = 2 \cdot 7 \cdot 2 = 28$ $\frac{\partial L}{\partial c} = 2(ab+c) \cdot 1 = 14$
a = Value(2.0)
b = Value(3.0)
c = Value(1.0)
L = (a * b + c) ** 2
L.backward()
print(a.grad) # 42.0 ✓
print(b.grad) # 28.0 ✓
print(c.grad) # 14.0 ✓
Building a Neural Network on Top
With gradients working, we can build neurons and layers. A single neuron computes $y = \text{ReLU}(w \cdot x + b)$ for a weight vector $w$ and bias $b$.
graph TD
subgraph "MLP(2, [4, 4, 1])"
I1((x₁)) & I2((x₂))
subgraph "Layer 1 — 4 neurons"
N1((n)) & N2((n)) & N3((n)) & N4((n))
end
subgraph "Layer 2 — 4 neurons"
N5((n)) & N6((n)) & N7((n)) & N8((n))
end
subgraph "Output — 1 neuron"
OUT((y))
end
I1 & I2 --> N1 & N2 & N3 & N4
N1 & N2 & N3 & N4 --> N5 & N6 & N7 & N8
N5 & N6 & N7 & N8 --> OUT
end
style OUT fill:#d4edda,stroke:#009900,color:#111
import random
class Neuron:
def __init__(self, n_inputs):
self.w = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
self.b = Value(0.0)
def __call__(self, x):
activation = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
return activation.relu()
def parameters(self):
return self.w + [self.b]
class Layer:
def __init__(self, n_inputs, n_outputs):
self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]
def __call__(self, x):
return [n(x) for n in self.neurons]
def parameters(self):
return [p for n in self.neurons for p in n.parameters()]
class MLP:
def __init__(self, n_inputs, layer_sizes):
sizes = [n_inputs] + layer_sizes
self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(layer_sizes))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x[0] if len(x) == 1 else x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
Training Loop
# XOR dataset
X = [[0,0], [0,1], [1,0], [1,1]]
y = [0, 1, 1, 0 ]
model = MLP(2, [4, 4, 1])
for step in range(100):
# Forward pass
preds = [model(x) for x in X]
# Mean squared error loss
loss = sum((pred - target)**2 for pred, target in zip(preds, y))
# Backward pass
for p in model.parameters():
p.grad = 0.0 # zero gradients before backward
loss.backward()
# Gradient descent
for p in model.parameters():
p.data -= 0.05 * p.grad
if step % 10 == 0:
print(f"step {step:3d} | loss {loss.data:.4f}")
Output (approximate):
step 0 | loss 2.1847
step 10 | loss 1.3204
step 20 | loss 0.8951
step 30 | loss 0.5623
...
step 90 | loss 0.0412
The loss drops. Our engine is computing real gradients.
What PyTorch Does Differently
Our engine works on scalars. PyTorch operates on tensors (n-dimensional arrays), which is the same idea but vectorised with C++/CUDA under the hood. The concepts map exactly:
graph LR
subgraph "Our engine"
V["Value\n(scalar float)"] --> VD["value.data"]
V --> VG["value.grad"]
V --> VB["value._backward\n(closure)"]
V --> VP["value._prev\n(set of parents)"]
end
subgraph "PyTorch"
T["Tensor\n(n-dim array)"] --> TD["tensor.data"]
T --> TG["tensor.grad"]
T --> TB["tensor.grad_fn\n(C++ Function obj)"]
T --> TP["grad_fn.next_functions\n(tuple of parents)"]
end
V -. "same concept\nscaled up" .-> T
style V fill:#fff,stroke:#dddddd,color:#111
style T fill:#d4edda,stroke:#009900,color:#111
| Our engine | PyTorch |
|---|---|
Value.data |
tensor.data |
Value.grad |
tensor.grad |
Value._backward |
tensor.grad_fn |
Value._prev |
tensor.grad_fn.next_functions |
loss.backward() |
loss.backward() |
p.grad = 0.0 |
optimizer.zero_grad() |
The key difference: PyTorch’s DAG stores functions (AddBackward, MulBackward, etc.) not values. But the topological traversal and chain rule application are identical.
Common Pitfalls
Forgetting to zero gradients. Because grad accumulates with +=, calling .backward() twice without zeroing will double-count. Always zero before each backward pass.
Reusing leaf nodes. If a appears in two branches of the computation, its gradient correctly receives contributions from both branches via accumulation. This is a feature, not a bug.
In-place operations. Modifying .data directly after using a value in a computation breaks the graph. Never do v.data = new_val mid-computation.
Circular graphs. The topological sort assumes a DAG. Feedback loops aren’t supported (and shouldn’t be — use RNNs through time which unroll into a DAG).
Summary
The whole autograd idea in one sentence: wrap every number in an object that remembers how it was created, then walk creation history in reverse applying the chain rule.
The implementation has three parts:
Valueclass — wraps scalar, stores grad and backward closure- Operation overloads — each op computes output and registers local gradient rule
backward()— topological sort + chain rule traversal from root to leaves
Everything in PyTorch, JAX, and every other framework is this idea scaled up to tensors and compiled to run on GPUs.
Watch It Built Live
Andrej Karpathy’s lecture that inspired this post — watch him build micrograd from scratch in real time:
References
Primary sources — code and implementations
-
Karpathy, A. (2020). micrograd: A tiny scalar-valued autograd engine. GitHub. https://github.com/karpathy/micrograd
-
Karpathy, A. (2022). The spelled-out intro to neural networks and backpropagation: building micrograd. YouTube, Neural Networks: Zero to Hero series. https://www.youtube.com/watch?v=VMj-3S1tku0
-
Smit, J. (2022). DIY auto-grad engine: A step-by-step guide to calculating derivatives automatically. jordismit.com. https://jordismit.com/blog/diy-auto-grad-engine-a-step-by-step-guide-to-calculating-derivatives-automatically/
-
Loh, J. (2024). Reverse-mode autodiff from scratch. jlohding.github.io. https://jlohding.github.io/posts/autodiff/
-
Sankhla, B. (2023). DIY Deep Learning: Crafting your own autograd engine from scratch. Medium / SFU CSPMP. https://medium.com/sfu-cspmp/diy-deep-learning-crafting-your-own-autograd-engine-from-scratch-for-effortless-backpropagation-ddab167faaf5
Official documentation
-
PyTorch. Automatic Differentiation with torch.autograd. PyTorch Tutorials. https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html
-
JAX. Forward- and reverse-mode autodiff in JAX. JAX Documentation. https://docs.jax.dev/en/latest/jacobian-vector-products.html
-
Dive into Deep Learning. 2.5 Automatic Differentiation. d2l.ai. https://d2l.ai/chapter_preliminaries/autograd.html
Theory
-
Wikipedia. Automatic differentiation. Wikimedia Foundation. https://en.wikipedia.org/wiki/Automatic_differentiation
-
University of Amsterdam. Module 3: The Reverse Mode of Automatic Differentiation. auto-ed.readthedocs.io. https://auto-ed.readthedocs.io/en/latest/mod3.html