Kolmogorov Complexity

In the previous chapter, we saw what coding theory tells us about compressing strings. If we treat letters of the text independently, coding theory is especially powerful. But that's an assumption that good compression algorithms can't make.

Kolmogorov complexity is a different approach to the same compressing question. It's the ultimate limit for how much a file can be compressed. This makes it a very powerful theoretical concept to have in our dictionary!

💡If there is one thing you remember from this chapter...

Kolmogorov complexity of an object is the length of its shortest specification. This is the ultimate limit of compression.

The Mandelbrot Set

Take a good look at the following picture.1 It shows the Mandelbrot set—we color each pixel of the plane based on how quickly a certain sequence of numbers shoots to infinity.

If you take a screenshot of this picture and save it on your disk, it will take a few MB. But what if you instead save the instruction for how to create this image? All relevant ingredients are stored in the two boxes below the picture—the formula used to create it, and a bunch of numbers like the coordinates you are looking at. We are talking about less than a kilobyte of memory now. 2

50
Drag • Scroll to zoom • Double-click
↓ Kolmogorov complexity of the picture ↓
Mandelbrot Set Formula
z₀ = 0, zn+1 = zn² + c
c ∈ M if |zn| remains bounded
Max iterations: 50
Viewport Coordinates
Left: -2.50
Right: 1.50
Top: -1.50
Bottom: 1.50

This is the gist of Kolmogorov complexity. For any given object—represented as a binary string—its Kolmogorov complexity is the length of the shortest program that prints that string. This is the limit on how much a file can be compressed.

Here are a few more examples.

  • Although digits of have many random-like properties, the Kolmogorov complexity of its first million digits is extremely small. That's because there are some extremely short programs printing it.
  • Larger numbers (written down in binary) typically have larger Kolmogorov complexity. But there are huge numbers like with very small Kolmogorov complexity.
  • Whenever you can zip a file to a size of 100 MB, you can say that "Kolmogorov complexity of the file is at most 100 MB"
  • If you flip a fair coin times, the resulting sequence is likely to have Kolmogorov complexity of about . There's no good way of compressing it.

And here's one more example: The final riddle we haven't discussed yet—Hutter's challenge is the challenge of upper bounding the Kolmogorov complexity of 1GB of Wikipedia as much as possible.

Riddle

Definition of Kolmogorov complexity

Kolmogorov complexity is a great concept that helps me think clearly about all kinds of things. The unfortunate part is that it takes a bit of time to get accustomed to it, and to understand some subtleties. The two main ones: How do we define it formally, and how do we compute it?

Choosing the programming language
How do we compute it?

Kolmogorov vs entropy:

Both Kolmogorov complexity and entropy try to measure something very similar: complexity, information, compression limit. Naturally, they are closely connected! It's a bit hard to see this on the first glance though: Entropy talks about probabilities, and their logarithms. Kolmogorov complexity talks about computer programs and their length.

The connection goes like this: If you have a reasonable distribution () over bit strings and you sample from it (), then the entropy of the distribution is roughly equal to the expected Kolmogorov complexity of the sample. I.e.,

Example: uniform distribution

Flip a fair coin times. The result is a uniform distribution over binary strings of length . The entropy of this distribution is . Now let's examine the Kolmogorov complexity of a random nn-bit string.

On one hand, the Kolmogorov complexity of any -bit string is at most That's because in any (reasonable) language, you can just print the string:

print('01000...010110')

But can the Kolmogorov complexity of an nn-bit string be much smaller than ? Sure -- notice that our uniform distribution contains many strings with very small Kolmogorov complexity, like the string . However, those highly compressible strings don't really matter on average, and the reason is coding theory.

Let's construct a concrete code that assigns a code name to each nn-bit string in our distribution. For any string xx, its code name is the shortest program that prints it.

The average length of this code is exactly . But we have seen that any code has its average length at least as large as the entropy. Hence, . We conclude that .

If you look at the above example, you can notice that we did not really use that the distribution is uniform, it works pretty much for any distribution.3 I.e., you can think of the relation between the two as The entropy is the average Kolmogorov complexity.

The case of a uniform distribution is the most interesting one: it tells us that most -bit strings can't really be compressed, since their Kolmogorov complexity is close to . In fact, -bit strings with Kolmogorov complexity are called Kolmogorov random, to highlight the insight that bits are random if we can't compress them.

Long walks

Solomonoff prior

Dartmouth Summer Research Project

Let's finish this minicourse where we started—with our Bayesian detective.

Bayes' rule and derived formulas like KL divergence are great to understand what's happening as we are getting more and more evidence. Yet, there's a big elephant in the room here—how should the detective set up her prior probabilities?

Solomonoff prior is the right answer to this question. It says:

A hypothesis should start with probability proportional to .

Before we try to justify it, let's see it on an example.

A fishy situation

This time, our Bayesian detective isn't flipping a coin, but she has a black box with a button -- each press gives back one bit (a red or a blue chip), possibly correlated with the past outputs of the box.

Our detective pressed the button ten times and this is what she sees:

R
B
R
R
R
B
R
B
R
B

What now? You can notice that our detective is now in an opposite situation to the first chapter. In that chapter, she started with a fixed prior and we just focused on the evolution of that prior as she was getting evidence.

Now, she has the data and has to come up with some hypotheses and their priors. This is the situation that we find ourselves in much more often than the first-chapter situation. It's also much more dangerous: this is where we can fall into the trap of overfitting (discussed later).

Solomonoff gives the detective the following advice for what to do in our situation: just start with all possible hypotheses you can imagine and give each hypothesis a prior proportional to . Then compute the posterior.

In theory, this is an extremely appealing approach, and pretty much the answer for how to do statistics, if you ask me. In practice, there are two problems with this: First, we can't work with infinitely many hypotheses. Second, we can't compute their Kolmogorov complexity.

Computing priors

This sounds pretty bad but it's not the end of the world. To overcome the first problem, our detective may simply come up with a few plausible hypotheses after she sees the data. Most hypotheses have probabilities close to zero anyway.

Here are four of them:

  • : Each dot is 50/50 red or blue, independently
  • : Each dot is independently red with 60% probability
  • : The sequence alternates red and blue—each time there's 10% probability of the other color, though.
  • : The sequence alternates blocks of length five that are either "mostly red" (2/3 red probability) or "mostly blue" (2/3 blue probability)

The second problem is that we can't compute Kolmogorov complexity; and it's not even well-defined unless we fix some programming language. That's also not necessarily that bad. We can often get some reasonable approximation. Let me try to do it here:

  • : I give this one zero bits of penalty, i.e., . This looks weird since all programs that generate this hypothesis as their output have to spend at least a few bits to do stuff like "print(0.5)". But this kind of constant overhead is the same for all four hypothesis and adding a constant to all log-priors does not change anything, so we might as well forget this and give 0 penalty to the simplest hypothesis—this one.
  • : The difference between this and the previous hypothesis is that now we have to specify a number—0.6. If you asked me to start listing different numbers between 0 and 1 in order of how "simple" they are, I would probably list 0.6 somewhere around 20-50th place, so let me give this hypothesis a penalty of .
  • : This hypothesis is more complicated than treating each dot independently, but we still have only two types of dots. I give this penalty of 3 bits since I can think of a bunch of different mechanisms roughly as simple as this one, even if this one is among the simplest. The number 0.9 seems roughly as "simple" as 0.6 to me, so I give it 5 bits. In total: .
  • : This one seems the most complicated. I give 3 bits for alterations, and additional 3-4 more bits for the blocks having length 5. Finally, the number is actually pretty simple, so bits for that. In total: .

The following widget shows how these hypotheses fare on the data: 4

Observed Sequence:

R
B
R
R
R
B
R
B
R
B

Click on any dot to flip it between R and B

Posterior

Uniform:
0.00 + -10.00 = -10.00 bits
Biased:
-5.00 + -9.71 = -14.71 bits
Alternating:
-8.00 + -4.69 = -12.69 bits
Half-half:
-9.00 + -8.85 = -17.85 bits

Some observations:

  • If you compare and hypotheses, they both have almost the same likelihoods (-10 vs -9.71 bits). Seeing 10 data points is simply not enough to start making meaningful inferences about whether the bias is 50%, or only "close to 50% but not quite".
  • hypothesis is making very strong predictions. Since it matches the data pretty well, it hence accumulated a lot of evidence in its favor. With just 10 data points, the evidence is roughly offset by its complexity. But if we generate a few more data points from either or distribution, the hypothesis quickly gets to the top, or to the bottom.
  • The hypothesis looks a lot like overfitting5: We got a modest improvement at evidence at the cost of a large complexity of the hypothesis.
Von Neumann's elephant
⚠️
AIC & BIC model selection

Intuition

If a detective uses both Bayes' theorem to update her beliefs, and starts with the Solomonoff prior, we say she does Solomonoff induction. Why is this the right approach? There are some more complicated and rigorous justifications,6 but here's how I think about this.

First of all, the formula shouldn't feel so shocking. After all, we spent an entire chapter arguing that if you care about , the natural distributions look like . So using this prior really represents a belief that the complexity of the hypothesis is the important factor.

But here's a better justification that is more down to earth.

Let's think of an experiment where our Bayesian detective samples nn samples from an unknown distribution pp; we have different models q1,,qkq_1, \dots, q_k. Solomonoff prior of qiq_i is 2K(qi)2^{-K(q_i)} and the evidence going for that model is 2H(p,qi)2^{-H(p, q_i)}.

First, remember: under Solomonoff prior, a hypothesis qq has prior 2K(q)2^{-K(q)} and if the data are pp, the evidence for qq is 2H(p,q)2^{-H(p,q)}. So, choosing the most likely hypothesis means choosing the qq that minimize K(q)+nH(p,q)K(q) + n \cdot H(p,q).

But we know what that formula means! K(q)K(q) is how many bits of memory we need to store the model qq, and nH(p,q)n \cdot H(p,q) is how many bits of memory we need to store pp, once we know qq. 7

That is, using Solomonoff prior means that we are simply supposed to prefer the models that compress the data the most. This is also called the Minimum description length principle.

At last, philosophy is solved.

What's next?

Nothing! If you enjoyed this journey through probability and information theory, let me know. You can also leave feedback. Also, check out the resources page.