KL properties & (cross-)entropy

In this chapter, we'll see how KL divergence can be split into two pieces called cross-entropy and entropy.

💡If there is one thing you remember from this chapter...

Formula for KL divergence Cross-entropy measures the average surprisal of model qq on data from pp. When p=qp = q, we call it entropy.

KL = cross-entropy - entropy

Quick refresher: In the previous chapter, we saw how KL divergence comes from repeatedly using Bayes' theorem with log-space updating:

Bayes Sequence Explorer (Log Space)

Coin sequence:
H
T
T
H
T
Prior log-odds
1
vs
0
Posterior log odds
1.00
vs
0.00
Posterior odds
2.000
:
1.000
Posterior probability
66.7%
33.3%

Each step adds surprisals (log1/p\log 1/p) to track evidence. Last time, we focused on the differences between surprisals to see how much evidence we got for each hypothesis. Our Bayesian detective just keeps adding up these differences.

Alternatively, the detective could add up the total surprisal for each hypothesis (green and orange numbers in the above widget), and then compare overall Total surprisal values. This corresponds to writing KL divergence like this:

These two pieces on the right are super important: H(p,q)H(p,q) is called cross-entropy and H(p)H(p) is entropy. Let's build intuition for what they mean.

Cross-entropy

Think of cross-entropy this way: how surprised you are on average when seeing data from pp while modeling it as qq?

Explore this in the widget below. The widget shows what happens when our Bayesian detective from the previous chapter keeps flipping her coin. The red dashed line shows cross-entropy—the expected surprisal of the model qq as we keep flipping the coin with bias pp. The orange line shows the entropy, which is the expected surprisal when both the model and actual bias are pp. KL divergence is the difference between cross-entropy and entropy. Notice that the cross-entropy line is always above the entropy line (equivalently, KL divergence is always positive).

If you let the widget run, you will also see a blue and a green curve - the actual surprisal measured by our detective in the flipping simulation. We could also say that these curves measure cross-entropy—it's the cross-entropy between the empirical distribution p^\hat{p} (the actual outcomes of the flips) and the model qq (blue curve) or pp (green curve). The empirical cross-entropies are tracking the dashed lines due to the law of large numbers.

Cross-Entropy Simulator

H(p)=0.722 bits/flipH(p,q)=1.922 bits/flip\begin{align*} H(p) &= 0.722\text{ bits/flip} \\ H(p,q) &= 1.922\text{ bits/flip} \end{align*}

Bottom line: Better models are less surprised by the data and have smaller cross-entropy. KL divergence measures how far our model is from the best one.

Entropy

The term H(p)=H(p,p)=i=1npilog1/piH(p) = H(p, p) = \sum_{i = 1}^n p_i \log 1 / p_i is a special case of cross-entropy called just plain entropy. It's the best possible cross-entropy you can get for distribution pp—when you model it perfectly as itself.

Intuitively, entropy tells you how much surprisal or uncertainty is baked into pp. Even if you know you're flipping a fair coin and hence p=q=12p = q = \frac12, you still don't know which way the coin will land. There's inherent uncertainty in that—the outcome still carries surprisal, even if you know the coin's bias. This is what entropy measures.

The fair coin's entropy is bit. But entropy can get way smaller than 1 bit. If you flip a biased coin where heads are very unlikely—say p(H=0.05)p(\textsf{H} = 0.05)—the entropy of the flip gets close to zero. Makes sense! Sure, if you happen to flip heads, that's super surprising (log1/0.054.32\log 1/0.05 \approx 4.32). However, most flips are boringly predictable tails, so the average surprise gets less than 1 bit. You can check in the widget below that H({H:0.05,T:0.95})0.29H(\{\textsf{H}: 0.05, \textsf{T}: 0.95\}) \approx 0.29 bits per flip. Entropy hits zero when one outcome has 100% probability.

Entropy can also get way bigger than 1 bit. Rolling a die has entropy log2(6)2.6\log_2(6) \approx 2.6 bits. In general, a uniform distribution over kk options has entropy log2k\log_2 k, which is the maximum entropy possible for kk options. Makes sense—you're most surprised on average when the distribution is, in a sense, most uncertain.

Entropy of die-rolling

0.20.40.60.81.0ABCDEF0.170.170.170.170.170.17Probability
Drag any bar up or down to adjust probabilities. Other bars adjust automatically.
Entropy
2.585 bits
Maximum possible: 2.585 bits
Example: correct horse battery staple

Relative entropy

KL divergence can be interpreted as the gap between cross-entropy and entropy. It tells us how far your average surprisal (cross-entropy) is from the best possible (entropy). That's why in some communities, people call KL divergence the relative entropy between pp and qq. 1

What's next?

We're getting the hang of KL divergence, cross-entropy, and entropy! Quick recap:

Formula for KL divergence

In the next chapter, we will do a recap of what kind of properties these functions have and then we are ready to get to the cool stuff.