# Attention Is Off By One

By Evan Miller

*July 24, 2023*

About which one cannot speak, one must pass over in silence. –Wittgenstein

Do you see the off-by-one error in this formula?

\[ \textrm{Attention}(Q, K, V) = \textrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V \]
The attention formula is the central equation of modern AI,
but there’s a bug in it that has been driving me nuts the last week. I tried
writing a serious-looking research paper about the bug and my proposed fix, but
I lost a series of pitched battles against Pytorch and `biblatex`

,
so I figured I’d just write a blog post instead. (History is written by the
winners; blogs are written by…)

In this post I’ll explain how (IMHO) the current generation of AI models have an off-by-one error in a crucial place, and it’s making everyone’s Transformer models needlessly difficult to compress and deploy. Consider this an opinion piece, but if anyone out there on the Internet wants to run some experiments and prove me right, we can collaborate and call ourselves scientists.

## It’s All About Outliers

First let’s talk about why this off-by-one error matters. *ChatGPT works
great, what’s your problem bro?* I first caught the scent of something
amiss whilst minding my business and perusing research papers on quantization,
the technique by which LLM Edgers cram
big models onto Mac Minis and Rasberry Pis and
jailbroken home thermostats.
Everybody in AI is limited by
RAM, so the less RAM you use the more cool stuff you can do, both Up
There In The Cloud and down here on the edge. LLMs have billions of weights,
and if we can make these weights tighten their haunches and suck in their
paunches, we can compose better sonnets or plagiarize superior essays or
accelerate the end of days, whatever your personal motivation for using
language may be.

RAM stores information. This sounds like a tautology, but hang with me. Information is negative log-probability, and is how many bits we need to store things. If a stream of numbers is highly predictable, for example is always contained in a limited range, we need fewer bits to store them. If a stream of numbers is not predictable, like once in a blue moon a mega-number shows up, we need more binary digits to encode the Colossus.

This is what’s been happening in LLMs – for reasons that are only partially
understood, Transformer models contain these outlier weights and are
emitting Black Swan mega-activations that are much, much, much larger, like
orders of magnitude larger, than their peers. *But no one can get rid of them*;
the megalodons seem to be critical to the operation of these models, and their
existence is contrary to everything we thought we knew about neural networks
prior to building ones that worked so well. Many papers have been written about
these outlier values, and people have been cooking up all kinds of bit-burning
schemes to encode them with fewer ones and zeroes, because right now we get
pretty gnarly performance degradation with vanilla scale-and-bias integer
quantization. Georgi can tell you more; I’m getting ahead of myself.

The best analysis of all of this comes from a team at Qualcomm AI Research in this paper:
**Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing**. The authors traced the existence of these outlier values to the
attention mechanism’s
softmax function,
the seemingly innocent exponentiator that no one thought capable of such
kurtotic barbarities. The researchers came *this close* to finding the
off-by-one error, like killer-in-the-closet close, but they must all be on
summer vacation in Italy as none of them are responding to my email overtures, and so I
must appeal to the international community of scholars the old-fashioned way.

If you read the linked paper, just ignore their proposals. Sounds harsh, but hear me out. The clipped softmax comes with a wheel-spinning zero gradient, and their gated attention proposal, while workable, introduces millions of new parameters to solve what is really just a failure to increment. There’s a simple and hindsight-obvious solution here that, from all of my reading, no one has thought to try.

All right, I’ve talked up this silly error enough. I’ve alluded to its location. Let’s talk about the softmax function and why it’s not-quite-the-right tool for the job when it comes to attention.

## The Trouble With Softmax

Now to explain the bug you really have to understand what the attention mechanism is trying to do. Most numerical bugs are programmers implementing equations incorrectly. However, when you’re dealing not with bad code, but with bad math, you need to understand where that equation comes from and what it’s supposed to be doing before you have any hope of fixing it.

I had to read like 50 arXiV papers to understand all this, and probably I should have taken a Udemy course or binge-watched Andrej Karpathy’s YouTube channel instead, but let’s start with what is called the input embedding, which is a floating-point vector that represents a word in the input string.

This vector seems to get taller every model year, for example the recent LLaMA
2 model from Meta uses an embedding vector of length 3,204, which works out to
6KB+ in half-precision floating-point, just to represent *one word* in
the vocabulary, which typically contains 30,000 - 50,000 entries.

Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than \(2^{16}\)=65,384, we only need 16 bits to represent an entry, yeah?

Well, here is what the Transformer is actually doing: it *transforms*
(eh?) that input vector to an output vector *of the same size*, and that
final 6KB output vector needs to encode *absolutely everything needed to
predict the token after the current one*. The job of each layer of the
Transformer is quite literally adding information to the original, single-word
vector. This is where the residual (née skip) connections come in: all of the
attention machinery is just adding supplementary material to that original two
bytes’ worth of information, analyzing the larger context to indicate, for
instance, that the word *pupil* is referring to a student, and not to the
hole in your eye. Repeat the attention mechanism a few dozen times and you’ve mastered
the English language and all its vasty contents.

Now the final step of Transformer is to multiply this output vector by a rectangular matrix, and cram the resulting vocabulary-length vector into a softmax, treating those exponentiated outputs as next-token probabilities. This is reasonable, but everyone knows it’s not quite right, as no sane and respected model out there regards those output probabilities as correct, but instead every implementation and its sister uses a sampling mechanism to hide the fact that softmax is over-representing low probabilities.

That’s all fine and workable; softmax in the output step gives us a gradient for every word in the vocabulary, and it’s a reasonable choice until something better comes along.

But what I want to argue is that sauce for the goose shouldn’t be slathered on
the gander; the Transformer’s *output* softmax serves a very different
purpose from the attention mechanism’s *internal* softmax, and we’d all
do well to get rid of the latter, or at least prop up its denominator with something
handy ⛱️.

### An exponential peg…

So what is softmax? It started out in statistical mechanics as a way to predict the distribution of states based on their energy levels:

\[ p_i \propto \exp\left(-\frac{\varepsilon_i}{kT}\right) \]Then the economists got a hold of it and realized that if the noise term in people’s otherwise linear utility functions happened to follow a Gumbel distribution (doesn’t yours?), then the probability someone will choose an item will be proportional to the exponent of the utility inputs:

\[ \Pr(Y_i=k)=\frac{\exp(X_i \beta_k)}{\sum_j \exp(X_i \beta_j)} \]This gave softmax a life in multinomial logistic functions; this is where I got to know the old chap, as in I hand-derived the Hessian of this sucker in grad school and coded it up to run on GPUs using linear fixed effects, which to my knowledge no one else has been foolish enough to attempt before or since. I just mention this because I know the softmax function better than I know many of my humanoid friends, and I can recognize when it’s being used for things it oughtn’t.

Softmax is a kind of cheat code to map real-valued numbers to probabilities that sum to one. In physics, it works quite well; in economics, it’s a little bit of a lie, but by the time it got to machine learning, it just became a thing that seemed to work whenever a discrete choice was involved. Softmax the vector and pick something, all right?

\[ (\textrm{softmax(x)})_i =\frac{\exp(x_i)}{\sum_j \exp(x_j)} \]
Here’s the core mechanic of softmax: *it forces a choice among competing
alternatives*, whether it’s particles picking an energy state or consumers
choosing a car. That is, if a softmax mechanism *doesn’t want to choose
anything at all*, softmax will require modification, or else we would
expect the softmax to produce distortions once it encounters actual data.

In the context of LLMs, one of those distortions is to heavily weight
non-semantic tokens (commas and such), and those beefy weights become those
difficult-to-compress outlier values that make Life on the Edge harder
than it should be. *The Qualcomm AI researchers found that 97%+ of outlier
activations in LLMs occur in whitespace and punctuation positions.*
Something’s fishy around here… and I thought I ordered the chicken.

### …in a linear hole

Let’s dive in to how softmax is used inside of attention and see exactly where it goes wrong. Here’s the equation again:

\[ \textrm{Attention}(Q, K, V) = \textrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V \]Breaking it down: in decoder-only models (i.e., everything since ChatGPT), \(Q\), \(K\), and \(V\) all originate from the same input sequence. They’re not identical, as they’ve been projected in different ways along the way, but in each layer they all start with the same annotated (added-to) embedding vector.

Now: \(QK^T\) is looking for correlations between token (embedding) vectors in different positions, essentially building up a square matrix of correlation (dot-product scaled by \(1/\sqrt{d}\)) values, where each column and row corresponds to a token position. Then each row of this square matrix is softmaxed, with the resulting probabilities used as a mixing function for the value vectors in the \(V\) matrix. The probability-mixed \(V\) gets added to the input vector, with the sum passed on down the neural network for further processing.

Multi-head attention goes through this process several times, in parallel, per layer. Essentially it divides up the embedding vector, and each head uses information in the entire vector to annotate one (non-overlapping) segment of the output vector. If you were confused by the Concatenation operation in the original Transformer paper, that’s all that’s happening: Head 1 adds information to Segment 1, Head 2 adds information to Segment 2, and so on.

*The problem with using softmax is that it forces each attention head to
make an annotation, even if it has no information to add to the output
vector.* Using softmax to choose among discrete alternatives is great;
using it for optional annotation (i.e. as input into addition) is, like,
not cool, man. The problem here is exacerbated with multi-head attention, as a
specialized head is more likely to want to “pass” than a general-purpose one.
These attention heads are needlessly noisy, a deafening democracy where
abstention is disallowed.

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

Are you ready?

Are you *ready-ready*?

I can’t hear you.

## Softmax One and Quiet Attention

All right, here it is and you are welcome, the long-awaited Softmax Super-Mod that is soon to set the LLM hacker channels aflame:

\[ (\textrm{softmax}_1(x))_i = \frac{\exp(x_i)}{1+\sum_j \exp(x_j)} \]Bit of a let-down, eh? All I did was added one to the denominator. This lets the vector as a whole tend to zero if it wants, but otherwise just shrinks the values by a small amount, a shrinkage which will be made up for during normalization, which happens right after attention.

The key difference is in the negative limit, when entries in \(x\) are significantly less than zero and the model is trying to avoid an annotation altogether. Compare the limiting behavior of the original softmax

\[ \lim_{x_1 \to -\infty} \ldots \lim_{x_k \to -\infty} (\textrm{softmax}(x))_i = \frac{1}{k} \gt 0 \]
to that of the new and improved softmax_{1}:

Vanilla softmax will always emit the same total weight; softmax_{1}
mostly looks the same, but comes with an escape hatch in the negative orthant.
To be clear: the core issue here is *mathematical* and not
*numerical* in nature. Extra precision won’t save the softmax; all
Transformers are affected.

You’ll notice a couple of other things about softmax_{1}. The
derivative is positive, so we always have a non-zero gradient, and its sum is
between zero and one, so the output isn’t out of control. The function maintains the property

i.e. relative values in the output vector are unchanged.

Originally I wanted to call this
function *ghostmax*, as you can think of there being an extra
zero-valued entry in \(x\) (as \(\exp(0)=1\)), as well as a zero vector in the
\(V\) matrix that attenuates the result. But I didn’t want to scare anyone.

Even though softmax_{1} is facially quite boring, I’m 99.44% sure that it will
resolve the outlier feedback loop that’s making quantization the subject of
cascades of research. If you want to run some experiments and prove me right,
DM me on Twitter and we’ll get a
paper going.

We can call the improved mechanism *QuietAttention*, as it allows
attention heads to keep their yaps shut. So may I propose to a receptive public

I think a test could be hacked together fairly quickly: if you prefix every
input context with a zero vector, and ensure that your neural network of choice
doesn’t add any bias (including with the positional encoding), then the zero
should pass through unaltered and will have the effect of adding unity to every
subsequent softmax denominator; that way you won’t lose your mind mucking with
gradient code. I *think* this could be done with a LLaMA model that uses
fixed embeddings and a special prefix token, but
I’d very much like to get
back to my non-existent hobbies, and
I’ve already spent more time
on this problem than my therapist needs to know about.

You’d still have to re-train the model, so don’t try this on an RPi just yet.
But do let me know how those weight kurtoses and activation infinity norms are
looking after a few runs. I’m thinking those numbers will make for a handsome
table in a soon-to-be influential arXiV paper, either when those Qualcomm AI
researchers step off the plane from Italy, or someone in an LLM hacker channel
figures out `biblatex`

, whichever happens first.

*You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:
*

- How Not To Sort By Average Rating
- Winkel Tripel Warping Trouble
- You Can’t Spell CUPED Without Frisch-Waugh-Lovell
- Evaluating Splatoon’s Ranking System

*Get new articles as they’re published, via LinkedIn, Twitter, or RSS.*

*Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!*

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter