I’m still chugging through chapter 3 of
Sebastian Raschka‘s
“Build a Large Language Model (from Scratch)“.
Last time I covered causal attention,
which was pretty simple when it came down to it. Today it’s another
quick and easy one — dropout.
The concept is pretty simple: you want knowledge to be spread broadly across your
model, not concentrated in a few places. Doing that means that all
of your parameters are pulling their weight, and you don’t have a bunch of them
sitting there doing nothing.
So, while you’re training (but, importantly, not during inference)
you randomly ignore certain parts — neurons, weights, whatever — each time
around, so that their “knowledge” gets spread over to other bits.
Simple enough! But the implementation is a little more fun, and there were a
couple of oddities that
I needed to think through.
Code-wise, it’s really easy: PyTorch
provides a useful torch.nn.Dropout
class that you create with the dropout rate
that you want — 0.5 in the example in the book — and if you call it as a function on a
matrix, it will zero out that proportion of the values. Raschka mentions
that the dropout of 0.5 — that is, half of the attention scores
are ignored — is an example, and says that 0.1 – 0.2 would be more typical in a real-world
training run. That seemed surprisingly high to me, but Claude agrees:
For training large language models (LLMs), a typical dropout rate for attention
scores usually falls in the range of 10-15%.
So there you go! If the LLMs agree, it must be true…
So how do you use it? With a normal neural network, you might ignore
a subset of your neurons during one batch of your training run, then a different
subset the next time. So you’d
call the dropout function on the activations from each layer, zeroing out some at random
so that they don’t contribute to the “downstream”
calculations. (As I understand it, this means that they are also not adjusted during
back-propagation — if nothing else, it would be terribly unfair to the poor ignored
neurons to have their weights changed when they didn’t contribute to the error.)
For LLMs like the one we’re working on in this book, we can either run the dropout
function on the attention weights
or “after applying the attention weights to the value vectors”. I was a bit confused by
the latter, but after a bit of research (I asked Claude, ChatGPT and Grok 3 again 😉
it turns out that it just means that you run dropout on the matrix — the one
that has one row per input token, each row being that token’s context vector — with
random elements in the context vector being zeroed out for each token.
The book uses the example of doing dropout on the attention weights, and the code
was simple enough. But one thing that did confuse me was the way it rebalances the matrix
post-dropout. Let’s start with this causal attention weight matrix:
Token | A(“The”) | A(“fat”) | A(“cat”) | A(“sat”) | A(“on”) | A(“the”) | A(“mat”) |
---|---|---|---|---|---|---|---|
The | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
fat | 0.4633 | 0.5367 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
cat | 0.3221 | 0.3324 | 0.3454 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
sat | 0.2355 | 0.2334 | 0.2613 | 0.2698 | 0.0000 | 0.0000 | 0.0000 |
on | 0.1893 | 0.1910 | 0.1974 | 0.2031 | 0.2192 | 0.0000 | 0.0000 |
the | 0.1613 | 0.1613 | 0.1613 | 0.1613 | 0.1630 | 0.1918 | 0.0000 |
mat | 0.1344 | 0.1344 | 0.1369 | 0.1489 | 0.1463 | 0.1440 | 0.1551 |
After a 50% dropout it might look like this:
Token | A(“The”) | A(“fat”) | A(“cat”) | A(“sat”) | A(“on”) | A(“the”) | A(“mat”) |
---|---|---|---|---|---|---|---|
The | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
fat | 0.4633 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
cat | 0.0000 | 0.3324 | 0.3454 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
sat | 0.2355 | 0.0000 | 0.0000 | 0.2698 | 0.0000 | 0.0000 | 0.0000 |
on | 0.0000 | 0.1910 | 0.0000 | 0.2031 | 0.0000 | 0.0000 | 0.0000 |
the | 0.0000 | 0.1613 | 0.0000 | 0.1613 | 0.0000 | 0.1918 | 0.0000 |
mat | 0.1344 | 0.1344 | 0.0000 | 0.1489 | 0.0000 | 0.0000 | 0.1551 |
So far we’ve treated it as super-important that every row sums up to 1. But the Dropout
class doesn’t know anything about that — indeed, it knows nothing about what the
structure of the matrix is. It just zeros out random values.
But after that, it has to do something to rebalance the matrix — so it divides what’s left
by where is the dropout value. That’s in this
case, so that means that the remaining numbers are all
doubled, like this:
Token | A(“The”) | A(“fat”) | A(“cat”) | A(“sat”) | A(“on”) | A(“the”) | A(“mat”) |
---|---|---|---|---|---|---|---|
The | 2.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
fat | 0.9266 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
cat | 0.0000 | 0.6648 | 0.6908 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
sat | 0.4710 | 0.0000 | 0.0000 | 0.5396 | 0.0000 | 0.0000 | 0.0000 |
on | 0.0000 | 0.3820 | 0.0000 | 0.4062 | 0.0000 | 0.0000 | 0.0000 |
the | 0.0000 | 0.3226 | 0.0000 | 0.3226 | 0.0000 | 0.3836 | 0.0000 |
mat | 0.2688 | 0.2688 | 0.0000 | 0.2978 | 0.0000 | 0.0000 | 0.3102 |
The first row is 2 and none of the others sum to 1 either! That scaling is exactly what
the Dropout
class is meant to do, but it definitely feels like we must be using it wrong
in the light of what we’ve been doing so far.
That surprised me enough that I reread the section and checked
the code in the next section to make sure that the dropout was not meant to be applied
to the attention scores, pre-softmax, rather than the attention weights, but it’s
definitely not. I don’t have a strong intuition about why that might be, or why it
might not matter (apart from the fact that if you were working with the attention
scores you’d need to replace the dropped-out values with rather
than zero, and the Dropout
class
doesn’t seem to support that).
While finishing off this post, I ran it past a few LLMs to check for accuracy.
ChatGPT tells me that in real-world scenarios, people often do run dropout on
the attention scores (using something other than PyTorch’sDropout
so that they
can put in rather than zero) and then run softmax. That’s interesting!
Either it’s wrong on that (though it did seem very certain, and other LLMs agreed
when queried) or this is more of a pedagogical example for the sake of the book.
Another one for the “further investigation needed” list.
But I guess that in practice, with a 10% dropout rate, it probably doesn’t matter
too much. The attention weights for the first “The” summing to 2 in the example
is obviously crazy, but with 10% we’d be dividing by , and a sum
of 1.111 would be much less obviously weird.
So that’s it for dropout. I was originally
going to combine this one with the next section, which brings everything together to
show a full causal attention class with dropout in PyTorch — but the book glosses
over something for that, something that I want to dig into in a little more depth
than the book does — how do we work with the third-order tensors that are required
to handle batches in an LLM?
All of the maths I’ve blogged about so far has topped
out at second-order tensors — matrices — and so this is a big jump. I don’t
think there is any super-heavy intellectual lifting to do to get past it, but at
the same time it feels like something worth — for me — spending a little more
time on than I could in a post that covered dropout too.
So: more soon 🙂
{
const url = new URL(document.querySelector(“link[rel=\”canonical\”]”).href);
url.host = “www.gilesthomas.com”;
return url.toString();
})()
}”
>