Coding Theory

In this final chapter, we will see how entropy connects to compression and what that tells us about large language models.

💡If there is one thing you remember from this chapter...

Entropy is the unbeatable lower bound for lossless compression, if we treat letters independently.

Code intro

Say we have a long DNA string built from letters $\{\mathsf{A},\mathsf{C},\mathsf{G},\mathsf{T}\}$ . We want to store it using as little disk space as possible. Here's a simple plan: We assign a binary code for each letter and encode the string using that code.

For example, we can use . The string gets stored using just 2 bits per letter. Done!

Can we do better? Sometimes, yes! Here's a riddle: In the following widget, try to build a code that encodes the following string with just 1.75 bits per letter on average.

While playing, notice that whenever you use a short code name to a letter, e.g. $\mathsf{A} \rightarrow \textsf{0}$ , no other letter can get a code name starting with the same code name $\textsf{0}$ . The reason is that otherwise, the resulting code sometimes wouldn't be decodable. E.g., if you use code names and , the message "" has two meanings.

Text:AACATACG

Binary Tree (4 layers)

Drag and drop letters to the nodes of the tree

Test Text:

Here's the solution: . Since the letter frequencies in the text are , this encoding only uses

bits per letter.

The upshot: it makese sense to use shorter codes for more frequent letters. But what's the best way to do this? Coding theory has the answer. Roughly speaking, it tells us to try to give a letter with frequency $p_i$ a code-name of length $\log 1/p_i$ .

For example, looking at [Eq. code-example?], you can rewrite the left-hand side as:

Every letter with frequency $p$ got code-name of length exactly $\log 1/p$ ! Notice that for general strings of length $N$ , the length of codes satisfying the $p \rightarrow \log 1/p$ rule is this:

That is, if you manage to construct a code with $p \rightarrow \log 1/p$ , you spend $H(p)$ bits per letter on average.

If we implement our $p \rightarrow \log 1/p$ -rule-of-thumb with the most direct algorithm possible, we get what's known as Shannon's code. This code sorts the letter-frequencies down from the largest one, and assigns to each letter of frequency $p$ a code of length $\lceil \log \frac{1}{p} \rceil$ (we round up because we can't have a code name with 1.5 bits). Here it is for English alphabet:

You can notice that the number of bits per letter (also called rate) is a bit north of the entropy of the underlying distribution. This is because of the rounding $\log 1/p \rightarrow \lceil \log 1/p \rceil$ . For example, if there is a letter with frequency $p = 0.49$ , we would like to give it a code name of length $\log \frac{1}{0.49} \approx 1.03$ . Too bad that code names have to be integers and Shannon's code assigns it a code name of length $2$ . As a general rule, Shannon's code can use up to 1 bit per letter more than what's the entropy. ¹

Unfortunately, there are situations where Shannon's code really is almost 1 bit worse than the entropy. Take a string that uses just two characters $\mathsf{A}, \mathsf{B}$ , but the frequency of $\mathsf{B}$ is close to zero. Then, the entropy is close to zero, but Shannon can't do better than 1 bit per letter.

The source coding theorem

The source coding theorem is the most important theorem in coding theory. It basically says that on one hand, there are codes that get very close to the entropy bound. On the other hand, you can't beat that bound.

Here's the setup of the theorem more precisely. We model texts like this: There is a source that has a distribution over letters. This source keeps sampling letters independently, and sends them to us. This goes on for a long time, say $n$ steps; our goal is to save some binary string to the memory so that later on, we can recover from that string what $n$ letters the source sent.

This setup, with the source sampling independent letters, models that we only want to think about the problem of giving code names to letters based on frequencies. We are not trying to use the correlations like ' $\mathsf{th}$ often follows by $\mathsf{e}$ ' to get better compression.

The source coding theorem says that first, there's a way to store the emitted string using close to $H(p)$ bits per letter. We have come close to that bound above - Shannon's code is only 1 bit worse than entropy. But with more complicated codes, we can even get rid of this 1-bit slack (at the expense that the code is no longer as simple as mapping single letters to code names).

⚠️

Prediction = Compression

Using the fact that good codes can achieve a rate close to entropy can give us a lot of intuition!

Let's go through the terms in our dictionary and see how we can interpret them using the coding-theory language.

We defined the surprisal $\log 1/p$ as a way of measuring how surprised we are given an outcome that happens with probability $p$ . But equivalently, it's also how many bits you need to store the outcome in memory, using the optimal code.

We introduced entropy as the "average surprisal" of a distribution. But equivalently, it's the rate of how many bits on average we need to store outcomes sampled from a distribution $p$ : We can interpret the formula $\sum_{i = 1}^n p_i \log 1/p_i$ as "The $i$ -th symbol comes with frequency $p_i$ and whenever it comes, we need $\log 1/p_i$ bits to store it".

The cross-entropy is very similar; it's how many bits we need when data comes from $p$ but we store them using the code that's optimized for a different distribution $q$ . Finally, KL divergence (relative entropy) is how much worse our mismatched code is compared to the optimal one.

Here it is again, more concisely in a table:

Concept	Bayesian detective	Compression
Surprisal	Surprise of an outcome	Bits to store a letter
Entropy	Expected surprisal	Average bits per letter
Cross-entropy	Expected surprisal of model q	Average bits per letter for code optimized for q
KL divergence	Additional surprise from p ≠ q	Additional storage from p ≠ q

You know, information is entropy

LLM compression

So far, all our discussion was about encoding each letter independently of its context. The real fun begins when we want to compress a given text as much as possible, period. Now we definitely want to use the fact that $\textsf{th}$ usually follows by $\textsf{e}$ as much as possible.

I want to explain a hypothetical compression algorithm based on an LLM.

First, a technicality we have to go through quickly. Although we discussed that LLMs are predicting the next letter, that's a simplification. Before training any LLM, we split all texts into tokens and the input and output of LLMs consists of tokens, not letters. There is no fixed definition of what a token is, but for example GPT-2 works with around $50\,000$ different tokens. That's enough to cover all frequent syllables, medium-length words, common names, and special symbols.

LLMs can be used to compress text. Here's how: We will assume that both the compression and the decompression algorithm have access to the same LLM, say GPT-2. The compression algorithm runs GPT-2 on the input text; for each position $i$ in the text, GPT-2 generates a probability distribution $q_{LLM}(t_i)$ over the $i$ token, based on the tokens $t_1, \dots, t_{i-1}$ . We can use a code like Shannon's code to turn this distribution to a code assignment. For example, if GPT-2 predicts that $\textsf{the}$ has 50% probability of coming next, the Shannon's code would assign the code $\textsf{0}$ to $\textsf{the}$ . We encode the actual next token $t_i$ using this code.

That is, our compression algorithm uses a different code for each token, each code corresponding to the distribution over that token predicted by GPT-2. The decoding algorithm works analogously: For each token, it uses GPT-2 & Shannon's code to construct the code used to read the next token code.

The overall length of the encoded text is if we use Shannon's code. In the next widget, you can see how it works.

GPT2 Compression Visualization

62/500 characters

Our compression algorithm isn't very practical—it's very slow. More importantly, both parties also have to have GPT-2 stored on their computer, so we are getting some compression only if the text is much larger than the size of GPT-2, which is in gigabytes.

But here's the punchline. Using more complicated codes, we can get rid of the annoying $\lceil \rceil$ and encode the input text with length very close to

This is exactly the cross-entropy loss used to train neural nets, as we discussed earlier. We are now getting an equivalent definition of what neural nets are being trained for! LLMs are trained to compress the internet as much as possible!

I think this is a powerful intuition that can help us understand where the knowledge / intelligence of LLMs is coming from. If you want to compress text efficiently, you have to learn the patterns in it so that you can recreate the same text just by storing a few hints about what it was about. To be good in compression, you have to learn that $\textsf{th}$ often follows by $\textsf{e}$ , texts about $\textsf{Paris}$ often involve $\textsf{France}$ , and what kinds of talking points might a virtue-ethicist use to critique consequentialist ethics.

This is a complementary intuition to our intuition that being good in the text-prediction game requires knowledge. Indeed, prediction and compression are two sides of the same coin!

Chinese room

What's next?

One more chapter about Kolmogorov complexity.

Loss functions

Kolmogorov complexity