Bayes' rule & KL divergence

In this chapter, we'll discuss Bayes' rule and use it to introduce KL divergence.1

💡If there is one thing you remember from this chapter...

KL divergence measures how well a distribution pp is fitted by a model distribution qq.

Bayes' rule

Let's review Bayes' rule—it's how you update your beliefs when you get new information. Here's a setup that we will try to understand very deeply in the next few chapters:

Bayesian detective

A detective has a coin that might, or might not be rigged. To keep it simple, let's say there are only two options: it's either fair (50/50 for heads/tails) or it's biased toward tails (25/75 for heads/tails).

A detective keeps flipping this coin hoping to learn the truth.

Let's see how Bayes' rule helps our detective. To use the rule, she needs a starting guess—a prior—for the probability of each hypothesis. Let's say that the detective thinks there's a 2/3 chance the coin is fair and 1/3 chance it's biased.

She flips the coin: H\textsf{H}eads! This is evidence for the fair\textsf{fair} hypothesis, since heads are more likely with a fair coin.

Our detective can use Bayes' rule to calculate the new probability—the posterior—that the coin is fair. At this point, let's have a quick refresher on how Bayes' rule works.

The point of the rule is that if you have two events A,BA, B, then you can write down the probability that both AA and BB happens in two ways, using conditional probabilities:

For example, let's say that 'the coin is fair' and 'we flipped heads'. Then, our detective is especially interested in the value of : That's her posterior probability that the coin is fair, after she flipped heads. She can rearrange [Eq. simple-bayes?] a bit to get:

This formula is typically called Bayes' rule.

Divide and conquer

There's a different way to write down Bayes' rule that I find much more intuitive.

To get the message, though, we have to get used to think in odds instead of probabilities. Gamblers love this—instead of "1/3 chance of this, 2/3 chance of that," they say "odds are 1:2."

Odds are like probabilities that don't need to sum to one (1:2 is the same as 2:4). This often comes pretty handy - it allows us to focus on the more important parts of the picture. Multiplying the odds so that they add up to one is called normalization. So, thinking with odds is like thinking with probabilities whenever we don't want to worry about normalization.

With odds, Bayes' formula is super clean! First, let's take the formula [Eq. bayes-formula?] and write it one more time, this time for the biased hypothesis:

If we divide [Eq. bayes-formula?] and [Eq. bayes-formula-2?], we get this equation:

This is actually a very relatable formula! is simply the prior probability that our detective has on the two hypotheses, written as odds. For example, if P(fair)=2/3P(\textsf{fair}) = 2/3, this ratio is just 2/3:1/3=2:12/3 : 1/3 = 2:1.

The fraction on the left-hand side is pretty much the same - it is the posterior probability after we've seen the heads, written as odds.

Finally, there's the ratio —that's how much more likely heads are under the fair\textsf{fair} hypothesis. Conditional probabilities are typically called likelihoods, so this ratio is the likelihood ratio.

Here's how it works in practice. 2 Applying Bayes' rule is as simple as

  1. Write down the prior odds
  2. Multiply with likelihood ratios
  3. Profit!

You can also test in the following widget that it works the same for more than two hypotheses (we won't explicitly need this right now, but it's good to know).

Bayes Calculator

Prior odds:
Fair
Biased
:
Likelihood of heads:
:
Posterior odds:
1.00:0.25
Posterior probability:
80.0%20.0%

In our detective's case, she increased her probability of fair coin from 66.7%66.7\% to 80%80\%. Her suspicion that the coin is fair got boosted a bit. But it could also be a fluke. She has to flip the coin a few more times.

Go forth and multiply

Let's see what happens when our detective keeps flipping over and over, updating her beliefs each time using Bayes' rule. For example, if she gets {H,T,T,H,T}\{H, T, T, H, T\}, here's what's happening:

Bayes sequence calculator

Coin sequence:
H
T
T
H
T
Prior odds
2
:
1
Posterior odds
2.000
:
1.000
Posterior probability
66.7%
33.3%

With every flip, she multiplies her odds by a likelihood ratio: 2/12/1 for heads, 2/32/3 for tails.

In this example, three tails out of five slightly favor the fair-coin hypothesis. After converting back to probabilities, her initial 66.7% belief in fairness increased to about 70%. Going to need way more flips to know for sure!

Before we start pondering about what's gonna happen long-term, let's simplify our calculator a bit. Things get clearer if we take logarithms of everything. Instead of multiplying odds and likelihood, we will be adding so-called log-odds.

Here's the same step-by-step calculator after taking logarithms. For example, "prior 1:0" in the calculator means that the prior is 21:202^1 : 2^0. 3

Bayes Sequence Explorer (Log Space)

Coin sequence:
H
T
T
H
T
Prior log-odds
1
vs
0
Posterior log odds
1.00
vs
0.00
Posterior odds
2.000
:
1.000
Posterior probability
66.7%
33.3%

Notice that all the likelihoods (numbers in orange rows) are now negative numbers. That makes sense - probabilities are smaller than one, so after taking the log, they become negative numbers. It's often useful to talk about absolute values of those numbers, which is why people define a quantity called surprisal: Whenever you have something happening with probability pp, the expression log1/p\log 1/p can be called a surprisal and its units are bits (because we use binary logarithm).

This is a logarithmic viewpoint on how surprised you are when something happens. Getting heads when you thought it was 1% likely? Super surprising (log1/0.016.6bits\log 1/0.01 \approx 6.6 \textrm{bits}). Getting heads on a fair coin? Meh (log1/0.5=1bit\log 1/0.5 = 1 \textrm{bit}).

Notice how in the widget, I replaced ':' with 'vs'. This is because in the logarithmic world, it's no longer the ratio that's important, it's the difference. For example, on heads, fair\textsf{fair} hypothesis is surprised by 11 bit, while the biased\textsf{biased} hypothesis is surprised by 22 bits. The difference of the two surprises - 1 bit - is the most important takeaway from the flip. This difference measures the bits of evidence that the flipped heads provide. We are essentially adding up this difference in the posterior log odds, and it ultimately translates into the posterior probability.

Expected evidence

Remember, whenever our detective flips heads, she is getting 21=12 - 1 = 1 bit of evidence that the coin is fair. Whenever she flips tails, she's getting 10.58=0.421 - 0.58 = 0.42 bits of evidence that the coin is biased.

This holds regardless of which one of the two hypothesis is true. Now, let's say the coin actually is biased. How fast will our Bayesian detective figure this out?

This all comes down to adding bits of evidence. We can calculate the average number of bits that she learns per flip:

bits of evidence toward the truth.

What does this mean in practice? After five flips, the detective gets 50.1915 \cdot 0.19 \approx 1 bit of evidence on average. So if she starts thinking 2:1 the coin is fair, after about five flips she'll be at 1:1. Another five flips gets her to 2:1 the coin is biased, the next five flips to 4:1, and so on.

The actual odds fluctuate around this average. But thanks to the law of large numbers, after NN flips the total evidence will be close to 0.19N0.19 \cdot N. 4

Try it yourself below! I recommend checking edge cases like 50% vs 51% to get intuition about when the law of large numbers kicks in or what's the difference between 50% vs 1% and 1% vs 50%.

Evidence Accumulation Simulator

D(p,q)=0.25log2(0.250.50)+0.75log2(0.750.50)=0.1887 bits per flip\begin{align*} D(p, q) &= 0.25\log_2\left(\frac{0.25}{0.50}\right) + 0.75\log_2\left(\frac{0.75}{0.50}\right) \\ &= 0.1887\text{ bits per flip} \end{align*}

KL divergence

KL divergence is just the general formula for expected evidence accumulation. Say you've got two distributions pp and qq, where pp is what's really happening, but you only know it's either pp or qq.

You can keep sampling from the unknown distribution and play the Bayesian game: Whenever you sample outcome ii, compare the likelihoods pip_i and qiq_i and update your beliefs. In classical Bayes' rule, you compare the so-called likelihood ratio pi/qip_i / q_i, but if there are more updates, it's easier to work with the bits of evidence, or more technically the log-likelihood ratio logpi/qi\log p_i / q_i.

On average, each sample from the true distribution pp gives you:5

D(p,q)=i=1npilogpiqiD(p,q) = \sum_{i = 1}^n p_i \cdot \log \frac{p_i}{q_i}

bits of evidence toward the truth.

When D(p,q)D(p, q) is small, the distribution qq is a good probabilistic model of pp.

That is, it takes many samples before our detective figures out that she's indeed sampling from pp, and not from the imposter qq.

If D(p,q)D(p,q) is small (less than one), you can also think of 1/D(p,q)1/D(p,q) as how many samples until I get one bit of evidence. One bit of evidence means that the odds for the true hypothesis doubled. 6

Notice that KL divergence is about the evidence, not your starting beliefs. It tells you how fast beliefs change, no matter whether the detective started with 1:21:2 or 1000:11000:1.

What's next?

In the next section, we'll dig deeper into the KL formula and see how it connects to entropy and cross-entropy.