Entropy properties
In this chapter, we'll go over the fundamental properties of KL-divergence, entropy, and cross-entropy. We've already seen most of these properties, but let's dive deeper to understand them better.
This chapter contains a few exercises. I encourage you to try them to check your understanding.
KL divergence is always nonnegative. Equivalently, .
KL divergence can blow up
Recall that KL divergence is algebraically defined like this:
Here's the key difference between KL and more standard geometric distance measures like norm () or norm (). Consider these two possibilities:
Regular norms () treat these errors as roughly equivalent. But KL knows better: the first situation is basically fine, while the second is catastrophic! For example, the letters "God" are rarely followed by "zilla", but any model of language should understand that this may sometimes happen. If , the model will be infinitely surprised when 'Godzilla' appears!
Try making KL divergence infinite in the widget below. Next level: Try to make it infinite while keeping and norm close to zero (say ).
KL Divergence Explorer
KL divergence is asymmetrical
The KL formula isn't symmetrical—in general, . Sometimes this is described as a disadvantage, especially when comparing KL to simple symmetric distance functions like or . But I want to stress that the asymmetry is a feature, not a bug! KL measures how well a model fits the true distribution . That's inherently asymmetrical, so we need an asymmetrical formula—and that's perfectly fine.
In fact, that's why people call it a divergence instead of a distance. Divergences are kind of wonky distance measures that are not necessarily symmetric.
Imagine the true probability is 50%/50% (fair coin), but our model says 100%/0%. KL divergence is ...
... infinite. That's because there's a 50% chance we gain infinitely many bits of evidence toward (our posterior jumps to 100% fair, 0% biased).
Now flip it around: truth is 100%/0%, model is 50%/50%. Then
Every flip gives us heads, so we gain one bit of evidence that the coin is biased. As we keep flipping, our belief in fairness drops exponentially fast, but it never hits zero. We've gotta account for the (exponentially unlikely) possibility that a fair coin just coincidentally came heads in all our past flips.
Here's a question: The following widget contains two distributions—one peaky and one broad. Which KL is larger? 1
KL Divergence Asymmetry
KL is nonnegative
If you plug in the same distribution into KL twice, you get:
because . This makes sense—you can't distinguish the truth from itself. 🤷
This is the only occasion on which KL can be equal to zero. Otherwise, KL divergence is always positive. This fact is sometimes called Gibbs inequality. I think we built up a pretty good intuition for this in the first chapter. Imagine sampling from while Bayes' rule increasingly convinces you that you're sampling from some other distribution . That would be really messed up!
This is not a proof though, just an argument that the world with possibly negative KL is not worth living in. Check out the formal proof if you're curious.
Since KL can be written as the difference between cross-entropy and entropy, we can equivalently rewrite as
That is, the best model of that accumulates the surprisal at the least possible rate is ... 🥁 🥁 🥁 ... itself.
Additivity
Whenever we keep flipping coins, the total entropy/cross-entropy/relative entropy just keeps adding up. This property is called additivity and it's so natural that it's very simple to forget how important it is. We've used this property implicitly in earlier chapters, whenever we talked about repeating the flipping experiment and summing surprisals.
More formally: Say you've got a distribution pair and - think is a model of - and another pair and . Let's use for the product distribution -- a joint distribution with marginals where they are independent. In this setup, we have this:
Entropy also has an even stronger property called subadditivity: Imagine any distribution with marginals . Then,
For example, imagine you flip a coin and record the same outcome twice. Then, the entropy of each record is 1 bit and subadditivity says that the total entropy is at most bits. In this case, it's actually still just 1 bit.
Anthem battle
I collected the national anthems of the USA, UK, and Australia, and put them into one file. The other text file contains anthems of various other countries. For both text files, I compute the frequencies of 26 letters 'a' to 'z'. So there are two distributions (English-speaking) and (others). The question is: which one has larger entropy? And which of the two KL divergences is larger?
Make your guess before revealing the answer.
Next steps
We now understand pretty well what KL divergence and cross-entropy stand for.
We will now ponder what happens if we try to make KL divergence small. This will explain a lot about ML loss functions, and includes some fun applications of probability to several of our riddles. See you in the next chapter about maximum likelihood!