Kolmogorov Complexity

In the previous chapter, we saw what entropy tells us about compressing strings. If we treat letters independently, coding theory is especially powerful. But that's an assumption that good compression algorithms can't make.

Kolmogorov complexity is the ultimate limit for how much a file can be compressed. This makes it a very powerful theoretical concept to have in our dictionary!

πŸ’‘If there is one thing you remember from this chapter...

Kolmogorov complexity of an object is the length of its shortest specification. This is the ultimate limit of compression.

The Mandelbrot Set

Take a good look at the following picture.1 It shows a so called Mandelbrot set -- we color each pixel of the plane based on how quickly a certain sequence of numbers shoots to infinity.

If you take print screen of this picture and save it on your disk, it's going to take a few MB. But what if you instead save the instruction for how to create this image? All relevant ingredients are stored in the two boxes below the picture - the formula used to create it, and the coordinates of the plane you are looking at. We are talking about less than kilobyte of memory now.

50
Drag β€’ Scroll to zoom β€’ Double-click
↓ Kolmogorov complexity of the picture ↓
Mandelbrot Set Formula
zβ‚€ = 0, zn+1 = znΒ² + c
c ∈ M if |zn| remains bounded
Max iterations: 50
Viewport Coordinates
Left: -2.50
Right: 1.50
Top: -1.50
Bottom: 1.50

This is the gist of Kolmogorov complexity. For any given object - say, represented as a binary string - it's Kolmogorov complexity is the length of the shortest program that prints that string.

Here are a few more examples.

  • Although digits of Ο€\pi have many random-like properties, the Kolmogorov complexity of its first million digits is extremely small. That's because there are some extremely short programs printing it.
  • Larger numbers (written down in binary) typically have larger Kolmogorov complexity. But there are huge numbers like with very small Kolmogorov complexity.
  • Whenever you can ZIP a file to a size of 100MB, you can say that "Kolmogorov complexity of the file is at most 100MB"
  • The Hutter's challenge from coding theory chapter is about estimating the Kolmogorov complexity of 1GB of Wikipedia
  • If you keep flipping a coin nn times, the resulting sequence is likely to have Kolmogorov complexity of about nn. There's no good way of compressing it.

Choosing the language

There is an awkward problem with the definition of Kolmogorov complexity. It's the length of the shortest program -- but what programming language do we use? Python? C? Assembly? Turing machine? Do we allow languages and libraries? Printing million digits of Ο€\pi can then reduce to this:

import sympy
print(sympy.N(sympy.pi, 1000000))

The important insight is that, at least if we stay on the theoretical side of things, the choice does not matter that much. The trick is that in any (Turing-complete) programming language, we can build an interpreter of any other programming language. Interpreter is a piece of code that reads a code written in some other language, and executes its instructions.

In any reasonable programming language, you can write an interpreter for any other reasonable language in at most, say, 1MB. But this means that Kolmogorov complexity of any object is fixed 'up to 1MB': If you have a 100MB Python script that prints the file, then you have a 101MB C, Assembly, Java, ... script printing the same file - just write a code where the first 1MB is an interpreter of Python tasked to execute the remaining 100MB.

So for large objects (like the 1GB Wikipedia file from Hutter's prize), there's nothing awkward in using Kolmogorov complexity. The flip side is that it's pretty meaningless to argue whether the Kolmogorov complexity of Ο€\pi is 200 or 300 bytes. That difference depends on the choice of programming language too much.

Kolmogorov vs entropy: E[K(x)]β‰ˆH(X)E[K(x)] \approx H(X)

Both Kolmogorov complexity and entropy are trying to measure something very similar: complexity, information, compression limit. Naturally, there are closely connected. The connection goes like this: If you have a reasonable distribution (XX) over bit strings and you sample from it (xx), then the entropy of the distribution is roughly equal to the expected Kolmogorov complexity of the sample. I.e., .

Example: uniform distribution

Let's see why this is true on an example. Take a sequence of nn fair coin flips. This is a uniform distribution over 2n2^n binary strings of length nn. The entropy of this distribution is nn. Now let's examine the Kolmogorov complexity.

On one hand, the Kolmogorov complexity of any nn-bit string is at most

That's because in any (reasonable) language, you can just print the string:

print('01000...010110')

But can the average Kolmogorov complexity be much smaller than nn? Notice that our uniform distribution contains many strings with very small Kolmogorov complexity, like the string 000…0000\dots 0. The reason why those special snowflakes don't really matter on average is coding theory. We can construct a concrete code like this: for any string in our distribution, its code name is the shortest program that prints it.2 The average length of this code is exactly E[K(x)]E[K(x)]. But we have seen that any code has its average length at least as big as the entropy. Hence, E[K(x)]β‰₯H(X)=nE[K(x)] \ge H(X) = n and we have the formula:

If you look at above proof sketch, you can notice that we did not really use that the distribution is uniform, it works pretty much for any distribution.3

However, the case of a uniform distribution is the most interesting: It tells us that most nn-bit strings can't really be compressed, since their Kolmogorov complexity is close to nn! In fact, nn-bit strings with Kolmogorov complexity β‰₯n\ge n are called Kolmogorov random, to highlight the insight that bits are random if we can't compress them.

Long runs

Solomonoff prior & induction

What's the most natural distribution over all (finite) binary strings, if what we care about is their Kolmogorov complexity? The maximum entropy principle says we should consider the distribution , for some Ξ»\lambda. In a minute, we will see that the right Ξ»\lambda is such that p(x)∝2βˆ’K(x)p(x) \propto 2^{-K(x)} 4 This distribution is called Solomonoff prior. 5

It's called prior because this is how we typically want to use this distribution. Similarly to other maximum entropy distributions, this is where should start before we start making observations and updating it. I find it philosophically very appealing, feel free to check the expand boxes.

⚠️
Solomonoff induction
⚠️
Solving epistemology
⚠️
🐘 Von Neumann's elephant & overfitting
⚠️
AIC & BIC model selection
⚠️
Uncomputability
⚠️
Clustering by relative Kolmogorov complexity
⚠️
What the hell is randomness?