Attention Is Off By One

By Evan Miller

July 24, 2023

About which one cannot speak, one must pass over in silence. –Wittgenstein

Do you see the off-by-one error in this formula?

\[ \textrm{Attention}(Q, K, V) = \textrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V \]

The attention formula is the central equation of modern AI, but there’s a bug in it that has been driving me nuts the last week. I tried writing a serious-looking research paper about the bug and my proposed fix, but I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead. (History is written by the winners; blogs are written by…)

In this post I’ll explain how (IMHO) the current generation of AI models have an off-by-one error in a crucial place, and it’s making everyone’s Transformer models needlessly difficult to compress and deploy. Consider this an opinion piece, but if anyone out there on the Internet wants to run some experiments and prove me right, we can collaborate and call ourselves scientists.

It’s All About Outliers

First let’s talk about why this off-by-one error matters. ChatGPT works great, what’s your problem bro? I first caught the scent of something amiss whilst minding my business and perusing research papers on quantization, the technique by which LLM Edgers cram big models onto Mac Minis and Rasberry Pis and jailbroken home thermostats. Everybody in AI is limited by RAM, so the less RAM you use the more cool stuff you can do, both Up There In The Cloud and down here on the edge. LLMs have billions of weights, and if we can make these weights tighten their haunches and suck in their paunches, we can compose better sonnets or plagiarize superior essays or accelerate the end of days, whatever your personal motivation for using language may be.

RAM stores information. This sounds like a tautology, but hang with me. Information is negative log-probability, and is how many bits we need to store things. If a stream of numbers is highly predictable, for example is always contained in a limited range, we need fewer bits to store them. If a stream of numbers is not predictable, like once in a blue moon a mega-number shows up, we need more binary digits to encode the Colossus.

This is what’s been happening in LLMs – for reasons that are only partially understood, Transformer models contain these outlier weights and are emitting Black Swan mega-activations that are much, much, much larger, like orders of magnitude larger, than their peers. But no one can get rid of them; the megalodons seem to be critical to the operation of these models, and their existence is contrary to everything we thought we knew about neural networks prior to building ones that worked so well. Many papers have been written about these outlier values, and people have been cooking up all kinds of bit-burning schemes to encode them with fewer ones and zeroes, because right now we get pretty gnarly performance degradation with vanilla scale-and-bias integer quantization. Georgi can tell you more; I’m getting ahead of myself.

The best analysis of all of this comes from a team at Qualcomm AI Research in this paper: Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. The authors traced the existence of these outlier values to the attention mechanism’s softmax function, the seemingly innocent exponentiator that no one thought capable of such kurtotic barbarities. The researchers came this close to finding the off-by-one error, like killer-in-the-closet close, but they must all be on summer vacation in Italy as none of them are responding to my email overtures, and so I must appeal to the international community of scholars the old-fashioned way.

If you read the linked paper, just ignore their proposals. Sounds harsh, but hear me out. The clipped softmax comes with a wheel-spinning zero gradient, and their gated attention proposal, while workable, introduces millions of new parameters to solve what is really just a failure to increment. There’s a simple and hindsight-obvious solution here that, from all of my reading, no one has thought to try.

All right, I’ve talked up this silly error enough. I’ve alluded to its location. Let’s talk about the softmax function and why it’s not-quite-the-right tool for the job when it comes to attention.

The Trouble With Softmax

Now to explain the bug you really have to understand what the attention mechanism is trying to do. Most numerical bugs are programmers implementing equations incorrectly. However, when you’re dealing not with bad code, but with bad math, you need to understand where that equation comes from and what it’s supposed to be doing before you have any hope of fixing it.

I had to read like 50 arXiV papers to understand all this, and probably I should have taken a Udemy course or binge-watched Andrej Karpathy’s YouTube channel instead, but let’s start with what is called the input embedding, which is a floating-point vector that represents a word in the input string.

This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.

Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than \(2^{16}\)=65,384, we only need 16 bits to represent an entry, yeah?

Well, here is what the Transformer is actually doing: it transforms (eh?) that input vector to an output vector of the same size, and that final 6KB output vector needs to encode absolutely everything needed to predict the token after the current one. The job of each layer of the Transformer is quite literally adding information to the original, single-word vector. This is where the residual (née skip) connections come in: all of the attention machinery is just adding supplementary material to that original two bytes’ worth of information, analyzing the larger context to indicate, for instance, that the word pupil is referring to a student, and not to the hole in your eye. Repeat the attention mechanism a few dozen times and you’ve mastered the English language and all its vasty contents.

Now the final step of Transformer is to multiply this output vector by a rectangular matrix, and cram the resulting vocabulary-length vector into a softmax, treating those exponentiated outputs as next-token probabilities. This is reasonable, but everyone knows it’s not quite right, as no sane and respected model out there regards those output probabilities as correct, but instead every implementation and its sister uses a sampling mechanism to hide the fact that softmax is over-representing low probabilities.

That’s all fine and workable; softmax in the output step gives us a gradient for every word in the vocabulary, and it’s a reasonable choice until something better comes along.

But what I want to argue is that sauce for the goose shouldn’t be slathered on the gander; the Transformer’s output softmax serves a very different purpose from the attention mechanism’s internal softmax, and we’d all do well to get rid of the latter, or at least prop up its denominator with something handy ⛱️.

An exponential peg…

So what is softmax? It started out in statistical mechanics as a way to predict the distribution of states based on their energy levels:

\[ p_i \propto \exp\left(-\frac{\varepsilon_i}{kT}\right) \]

Then the economists got a hold of it and realized that if the noise term in people’s otherwise linear utility functions happened to follow a Gumbel distribution (doesn’t yours?), then the probability someone will choose an item will be proportional to the exponent of the utility inputs:

\[ \Pr(Y_i=k)=\frac{\exp(X_i \beta_k)}{\sum_j \exp(X_i \beta_j)} \]

This gave softmax a life in multinomial logistic functions; this is where I got to know the old chap, as in I hand-derived the Hessian of this sucker in grad school and coded it up to run on GPUs using linear fixed effects, which to my knowledge no one else has been foolish enough to attempt before or since. I just mention this because I know the softmax function better than I know many of my humanoid friends, and I can recognize when it’s being used for things it oughtn’t.

Softmax is a kind of cheat code to map real-valued numbers to probabilities that sum to one. In physics, it works quite well; in economics, it’s a little bit of a lie, but by the time it got to machine learning, it just became a thing that seemed to work whenever a discrete choice was involved. Softmax the vector and pick something, all right?

\[ (\textrm{softmax(x)})_i =\frac{\exp(x_i)}{\sum_j \exp(x_j)} \]

Here’s the core mechanic of softmax: it forces a choice among competing alternatives, whether it’s particles picking an energy state or consumers choosing a car. That is, if a softmax mechanism doesn’t want to choose anything at all, softmax will require modification, or else we would expect the softmax to produce distortions once it encounters actual data.

In the context of LLMs, one of those distortions is to heavily weight non-semantic tokens (commas and such), and those beefy weights become those difficult-to-compress outlier values that make Life on the Edge harder than it should be. The Qualcomm AI researchers found that 97%+ of outlier activations in LLMs occur in whitespace and punctuation positions. Something’s fishy around here… and I thought I ordered the chicken.

…in a linear hole

Let’s dive in to how softmax is used inside of attention and see exactly where it goes wrong. Here’s the equation again:

\[ \textrm{Attention}(Q, K, V) = \textrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V \]

Breaking it down: in decoder-only models (i.e., everything since ChatGPT), \(Q\), \(K\), and \(V\) all originate from the same input sequence. They’re not identical, as they’ve been projected in different ways along the way, but in each layer they all start with the same annotated (added-to) embedding vector.

Now: \(QK^T\) is looking for correlations between token (embedding) vectors in different positions, essentially building up a square matrix of correlation (dot-product scaled by \(1/\sqrt{d}\)) values, where each column and row corresponds to a token position. Then each row of this square matrix is softmaxed, with the resulting probabilities used as a mixing function for the value vectors in the \(V\) matrix. The probability-mixed \(V\) gets added to the input vector, with the sum passed on down the neural network for further processing.

Multi-head attention goes through this process several times, in parallel, per layer. Essentially it divides up the embedding vector, and each head uses information in the entire vector to annotate one (non-overlapping) segment of the output vector. If you were confused by the Concatenation operation in the original Transformer paper, that’s all that’s happening: Head 1 adds information to Segment 1, Head 2 adds information to Segment 2, and so on.

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man. The problem here is exacerbated with multi-head attention, as a specialized head is more likely to want to “pass” than a general-purpose one. These attention heads are needlessly noisy, a deafening democracy where abstention is disallowed.

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

Are you ready?

Are you ready-ready?

I can’t hear you.

Softmax One and Quiet Attention

All right, here it is and you are welcome, the long-awaited Softmax Super-Mod that is soon to set the LLM hacker channels aflame:

\[ (\textrm{softmax}_1(x))_i = \frac{\exp(x_i)}{1+\sum_j \exp(x_j)} \]

Bit of a let-down, eh? All I did was added one to the denominator. This lets the vector as a whole tend to zero if it wants, but otherwise just shrinks the values by a small amount, a shrinkage which will be made up for during normalization, which happens right after attention.

The key difference is in the negative limit, when entries in \(x\) are significantly less than zero and the model is trying to avoid an annotation altogether. Compare the limiting behavior of the original softmax

\[ \lim_{x_1 \to -\infty} \ldots \lim_{x_k \to -\infty} (\textrm{softmax}(x))_i = \frac{1}{k} \gt 0 \]

to that of the new and improved softmax1:

\[ \lim_{x_1 \to -\infty} \ldots \lim_{x_k \to -\infty} (\textrm{softmax}_1(x))_i = 0 \]

Vanilla softmax will always emit the same total weight; softmax1 mostly looks the same, but comes with an escape hatch in the negative orthant. To be clear: the core issue here is mathematical and not numerical in nature. Extra precision won’t save the softmax; all Transformers are affected.

You’ll notice a couple of other things about softmax1. The derivative is positive, so we always have a non-zero gradient, and its sum is between zero and one, so the output isn’t out of control. The function maintains the property

\[ \frac{(\textrm{softmax}_1(x))_i}{(\textrm{softmax}_1(x))_j} = \frac{(\textrm{softmax}(x))_i}{(\textrm{softmax}(x))_j} = \frac{\exp(x_i)}{\exp(x_j)} \quad \forall \ i, j \]

i.e. relative values in the output vector are unchanged.

Originally I wanted to call this function ghostmax, as you can think of there being an extra zero-valued entry in \(x\) (as \(\exp(0)=1\)), as well as a zero vector in the \(V\) matrix that attenuates the result. But I didn’t want to scare anyone.

Even though softmax1 is facially quite boring, I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research. If you want to run some experiments and prove me right, DM me on Twitter and we’ll get a paper going.

We can call the improved mechanism QuietAttention, as it allows attention heads to keep their yaps shut. So may I propose to a receptive public

\[ \textrm{QuietAttention}(Q, K, V) := \textrm{softmax}_1 \left(\frac{QK^T}{\sqrt{d}}\right)V \]

I think a test could be hacked together fairly quickly: if you prefix every input context with a zero vector, and ensure that your neural network of choice doesn’t add any bias (including with the positional encoding), then the zero should pass through unaltered and will have the effect of adding unity to every subsequent softmax denominator; that way you won’t lose your mind mucking with gradient code. I think this could be done with a LLaMA model that uses fixed embeddings and a special prefix token, but I’d very much like to get back to my non-existent hobbies, and I’ve already spent more time on this problem than my therapist needs to know about.

You’d still have to re-train the model, so don’t try this on an RPi just yet. But do let me know how those weight kurtoses and activation infinity norms are looking after a few runs. I’m thinking those numbers will make for a handsome table in a soon-to-be influential arXiV paper, either when those Qualcomm AI researchers step off the plane from Italy, or someone in an LLM hacker channel figures out biblatex, whichever happens first.

You’re reading, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Statistics the Mac way

Back to Evan Miller’s home pageSubscribe to RSSLinkedInTwitter