Entropy properties

In this chapter, we'll go over the fundamental properties of KL-divergence, entropy, and cross-entropy. We've already seen most of these properties, but let's dive deeper to understand them better.

This chapter contains a few exercises. I encourage you to try them to check your understanding.

💡If there is one thing you remember from this chapter...

KL divergence D(p,q)D(p,q) is always nonnegative. Equivalently, H(p,q)H(p)H(p, q) \ge H(p).

KL divergence can blow up

Recall that KL divergence is algebraically defined like this:

Here's the key difference between KL and more standard geometric distance measures like 1\ell_1 norm (piqi\sum |p_i - q_i|) or 2\ell_2 norm (). Consider these two possibilities:

  1. pi=0.5,qi=0.49p_i = 0.5, q_i = 0.49
  2. pi=0.01,qi=0.0p_i = 0.01, q_i = 0.0

Regular norms (1,2\ell_1, \ell_2) treat these errors as roughly equivalent. But KL knows better: the first situation is basically fine, while the second is catastrophic! For example, the letters "God" are rarely followed by "zilla", but any model of language should understand that this may sometimes happen. If q(’zilla’’God’)=0.0q(\textrm{'zilla'} \mid \textrm{'God'}) = 0.0, the model will be infinitely surprised when 'Godzilla' appears!

Try making KL divergence infinite in the widget below. Next level: Try to make it infinite while keeping 1\ell_1 and 2\ell_2 norm close to zero (say <0.1< 0.1).

KL Divergence Explorer

0.20.40.60.81.012345Probability

Drag bars to adjust probabilities

KL Divergence
D(Blue, Red)
0.366
L1 Distance
Manhattan
0.600
L2 Distance
Euclidean
0.292

KL divergence is asymmetrical

The KL formula isn't symmetrical—in general, D(p,q)D(q,p)D(p,q) \neq D(q,p). Sometimes this is described as a disadvantage, especially when comparing KL to simple symmetric distance functions like 1\ell_1 or 2\ell_2. But I want to stress that the asymmetry is a feature, not a bug! KL measures how well a model qq fits the true distribution pp. That's inherently asymmetrical, so we need an asymmetrical formula—and that's perfectly fine.

In fact, that's why people call it a divergence instead of a distance. Divergences are kind of wonky distance measures that are not necessarily symmetric.

Example

Imagine the true probability pp is 50%/50% (fair coin), but our model qq says 100%/0%. KL divergence is ...

... infinite. That's because there's a 50% chance we gain infinitely many bits of evidence toward pp (our posterior jumps to 100% fair, 0% biased).

Now flip it around: truth is 100%/0%, model is 50%/50%. Then

Every flip gives us heads, so we gain one bit of evidence that the coin is biased. As we keep flipping, our belief in fairness drops exponentially fast, but it never hits zero. We've gotta account for the (exponentially unlikely) possibility that a fair coin just coincidentally came heads in all our past flips.

Here's a question: The following widget contains two distributions—one peaky and one broad. Which KL is larger? 1

KL Divergence Asymmetry

KL divergence between a broad and a peaky distribution. The blue distribution is the "truth", the red one the "model".

KL is nonnegative

If you plug in the same distribution into KL twice, you get:

because log1=0\log 1 = 0. This makes sense—you can't distinguish the truth from itself. 🤷

This is the only occasion on which KL can be equal to zero. Otherwise, KL divergence is always positive. This fact is sometimes called Gibbs inequality. I think we built up a pretty good intuition for this in the first chapter. Imagine sampling from pp while Bayes' rule increasingly convinces you that you're sampling from some other distribution qq. That would be really messed up!

This is not a proof though, just an argument that the world with possibly negative KL is not worth living in. Check out the formal proof if you're curious.

Proof of nonnegativity

Since KL can be written as the difference between cross-entropy and entropy, we can equivalently rewrite D(p,q)0D(p, q) \ge 0 as

That is, the best model of pp that accumulates the surprisal at the least possible rate is ... 🥁 🥁 🥁 ... pp itself.

Additivity

Whenever we keep flipping coins, the total entropy/cross-entropy/relative entropy just keeps adding up. This property is called additivity and it's so natural that it's very simple to forget how important it is. We've used this property implicitly in earlier chapters, whenever we talked about repeating the flipping experiment and summing surprisals.

More formally: Say you've got a distribution pair pp and qq - think qq is a model of pp - and another pair pp' and qq'. Let's use ppp \otimes p' for the product distribution -- a joint distribution with marginals p,pp,p' where they are independent. In this setup, we have this:

Entropy also has an even stronger property called subadditivity: Imagine any distribution rr with marginals p,qp, q. Then,

For example, imagine you flip a coin and record the same outcome twice. Then, the entropy of each record is 1 bit and subadditivity says that the total entropy is at most 22 bits. In this case, it's actually still just 1 bit.

Anthem battle

I collected the national anthems of the USA, UK, and Australia, and put them into one file. The other text file contains anthems of various other countries. For both text files, I compute the frequencies of 26 letters 'a' to 'z'. So there are two distributions p1p_1 (English-speaking) and p2p_2 (others). The question is: which one has larger entropy? And which of the two KL divergences D(p1,p2),D(p2,p1)D(p_1, p_2), D(p_2, p_1) is larger?

Make your guess before revealing the answer.

Loading default texts...

Next steps

We now understand pretty well what KL divergence and cross-entropy stand for.

We will now ponder what happens if we try to make KL divergence small. This will explain a lot about ML loss functions, and includes some fun applications of probability to several of our riddles. See you in the next chapter about maximum likelihood!