Coding Theory

In this final chapter, we will see how entropy connects to compression and what that tells us about large language models.

Story time
πŸ’‘If there is one thing you remember from this chapter...

Entropy is the unbeatable lower bound for lossless compression, if we treat letters independently.

Code intro

Say we have a long DNA string built from letters {A,C,G,T}\{\mathsf{A},\mathsf{C},\mathsf{G},\mathsf{T}\}. We want to store it using as little disk space as possible. Here's a simple plan: We assign a binary code for each letter and encode the string using that code.

For example, we can use . The string gets stored using just 2 bits per letter. Done!

Can we do better? Sometimes, yes! Here's a riddle: In the following widget, try to build a code that encodes the following string with just 1.75 bits per letter on average.

While playing, notice that whenever you use a short code name to a letter, e.g. A→0\mathsf{A} \rightarrow \textsf{0}, no other letter can get a code name starting with the same code name 0\textsf{0}. The reason is that otherwise, the resulting code sometimes wouldn't be decodable. E.g., if you use code names and , the message "" has two meanings.

Text:AACATACG

Binary Tree (4 layers)

Available Letters:ACGT
Drag and drop letters to the nodes of the tree

.

.

SPOILER

.

.

Here's the solution: . Since the letter frequencies in the text are , this encoding only uses

bits per letter.

OK, using shorter codes for more frequent letters makes sense, but what's the best way to do this? Coding theory has the answer. Roughly speaking, it tells us to try to give a letter with frequency pip_i a code-name of length log⁑1/pi\log 1/p_i.

For example, looking at [Eq. code-example?], you can rewrite the left-hand side as:

Every letter with frequency pp got code-name of length exactly log⁑1/p\log 1/p! Notice that for general strings of length NN, the length of codes satisfying the pβ†’log⁑1/pp \rightarrow \log 1/p rule is this:

That is, if you manage to construct a code with pβ†’log⁑1/pp \rightarrow \log 1/p, you spend H(p)H(p) bits per letter on average.

If we implement our pβ†’log⁑1/pp \rightarrow \log 1/p-rule-of-thumb with the most direct algorithm possible, we get what's known as Shannon's code. This code sorts the letter-frequencies down from the largest one, and assigns to each letter of frequency pp a code of length ⌈log⁑1pβŒ‰\lceil \log \frac{1}{p} \rceil (we round up because we can't have a code name with 1.5 bits). Here it is for English alphabet:

Shannon Code Constructor

E
12.7%
T
9.1%
A
8.2%
O
7.5%
I
7.0%
N
6.8%
S
6.3%
H
6.1%
R
6.0%
D
4.3%
L
4.0%
C
2.8%
U
2.8%
M
2.4%
W
2.4%
F
2.2%
G
2.0%
Y
2.0%
P
1.9%
B
1.3%
V
1.0%
K
0.8%
J
0.2%
X
0.2%
Q
0.1%
Z
0.1%

Drag the bars to adjust letter probabilities.

You can notice that the number of bits per letter (also called rate) is a bit north of the entropy of the underlying distribution. This is because of the rounding log⁑1/pβ†’βŒˆlog⁑1/pβŒ‰\log 1/p \rightarrow \lceil \log 1/p \rceil. For example, if there is a letter with frequency p=0.49p = 0.49, we would like to give it a code name of length log⁑10.49β‰ˆ1.03\log \frac{1}{0.49} \approx 1.03. Too bad that code names have to be integers and Shannon's code assigns it a code name of length 22. As a general rule, Shannon's code can use up to 1 bit per letter more than what's the entropy. 1

Unfortunately, there are situations where Shannon's code really is almost 1 bit worse than the entropy. Take a string that uses just two characters A,B\mathsf{A}, \mathsf{B}, but the frequency of B\mathsf{B} is close to zero. Then, the entropy is close to zero, but Shannon can't do better than 1 bit per letter.

The source coding theorem

The source coding theorem is the most important theorem in coding theory. It basically says that on one hand, there are codes that get very close to the entropy bound. On the other hand, you can't beat that bound.

Here's the setup of the theorem more precisely. We model texts like this: There is a source that has a distribution over letters. This source keeps sampling letters independently, and sends them to us. This goes on for a long time, say nn steps; our goal is to save some binary string to the memory so that later on, we can recover from that string what nn letters the source sent.

This setup, with the source sampling independent letters, models that we only want to think about the problem of giving code names to letters based on frequencies. We are not trying to use the correlations like 'th\mathsf{th} often follows by e\mathsf{e}' to get better compression.

The source coding theorem says that first, there's a way to store the emitted string using close to H(p)H(p) bits per letter. We have come close to that bound above - Shannon's code is only 1 bit worse than entropy. But with more complicated codes, we can even get rid of this 1-bit slack (at the expense that the code is no longer as simple as mapping single letters to code names).

⚠️
More on better codes

The second part of the theorem says that we can't beat the rate H(p)H(p). Details for that in the expand box.

⚠️
Beating entropy is impossible

Prediction = Compression

Using the fact that good codes can achieve a rate close to entropy can give us a lot of intuition!

Let's go through the terms in our dictionary and see how we can interpret them using the coding-theory language.

We defined the surprisal log⁑1/p\log 1/p as a way of measuring how surprised we are given an outcome that happens with probability pp. But equivalently, it's also how many bits you need to store the outcome in memory, using the optimal code.

We introduced entropy as the "average surprisal" of a distribution. But equivalently, it's the rate of how many bits on average we need to store outcomes sampled from a distribution pp: We can interpret the formula βˆ‘i=1npilog⁑1/pi\sum_{i = 1}^n p_i \log 1/p_i as "The ii-th symbol comes with frequency pip_i and whenever it comes, we need log⁑1/pi\log 1/p_i bits to store it".

The cross-entropy is very similar; it's how many bits we need when data comes from pp but we store them using the code that's optimized for a different distribution qq. Finally, KL divergence (relative entropy) is how much worse our mismatched code is compared to the optimal one.

Here it is again, more concisely in a table:

ConceptFormulaBayesian detectiveCompression
SurprisalSurprise of an outcomeBits to store a letter
EntropyExpected surprisalAverage bits per letter
Cross-entropyExpected surprisal of model qAverage bits per letter for code optimized for q
KL divergenceAdditional surprise from p β‰  qAdditional storage from p β‰  q

LLM compression

So far, all our discussion was about encoding each letter independently of its context. The real fun begins when we want to compress a given text as much as possible, period. Now we definitely want to use the fact that th\textsf{th} usually follows by e\textsf{e} as much as possible.

I want to explain a hypothetical compression algorithm based on an LLM.

First, a technicality we have to go through quickly. Although we discussed that LLMs are predicting the next letter, that's a simplification. Before training any LLM, we split all texts into tokens and the input and output of LLMs consists of tokens, not letters. There is no fixed definition of what a token is, but for example GPT-2 works with around 50 00050\,000 different tokens. That's enough to cover all frequent syllables, medium-length words, common names, and special symbols.

LLMs can be used to compress text. Here's how: We will assume that both the compression and the decompression algorithm have access to the same LLM, say GPT-2. The compression algorithm runs GPT-2 on the input text; for each position ii in the text, GPT-2 generates a probability distribution qLLM(ti)q_{LLM}(t_i) over the ii token, based on the tokens t1,…,tiβˆ’1t_1, \dots, t_{i-1}. We can use a code like Shannon's code to turn this distribution to a code assignment. For example, if GPT-2 predicts that the\textsf{the} has 50% probability of coming next, the Shannon's code would assign the code 0\textsf{0} to the\textsf{the}. We encode the actual next token tit_i using this code.

That is, our compression algorithm uses a different code for each token, each code corresponding to the distribution over that token predicted by GPT-2. The decoding algorithm works analogously: For each token, it uses GPT-2 & Shannon's code to construct the code used to read the next token code.

The overall length of the encoded text is if we use Shannon's code. In the next widget, you can see how it works.

GPT2 Compression Visualization

62/500 characters

Our compression algorithm isn't very practicalβ€”it's very slow. More importantly, both parties also have to have GPT-2 stored on their computer, so we are getting some compression only if the text is much larger than the size of GPT-2, which is in gigabytes.

But here's the punchline. Using more complicated codes, we can get rid of the annoying βŒˆβŒ‰\lceil \rceil and encode the input text with length very close to

This is exactly the cross-entropy loss used to train neural nets, as we discussed earlier. We are now getting an equivalent definition of what neural nets are being trained for! LLMs are trained to compress the internet as much as possible!

I think this is a powerful intuition that can help us understand where the knowledge / intelligence of LLMs is coming from. If you want to compress text efficiently, you have to learn the patterns in it so that you can recreate the same text just by storing a few hints about what it was about. To be good in compression, you have to learn that th\textsf{th} often follows by e\textsf{e}, texts about Paris\textsf{Paris} often involve France\textsf{France}, and what kinds of talking points might a virtue-ethicist use to critique consequentialist ethics.

This is a complementary intuition to our intuition that being good in the text-prediction game requires knowledge. Indeed, prediction and compression are two sides of the same coin!

Riddle
Chinese room

What's next?

You are done, congrats! If you liked the mini-course, continue to the Graveyard that contains bonus chapters that I dropped from the main flow. Check out the resources, too. I will be grateful if you leave feedback here.