Bayes' rule & KL divergence

In this chapter, we'll discuss Bayes' rule and use it to introduce KL divergence.¹

💡If there is one thing you remember from this chapter...

KL divergence measures how well a distribution $p$ is fitted by a model distribution $q$ .

Bayes' rule

Let's review Bayes' rule—it's how you update your beliefs when you get new information. Here's a setup that we will try to understand very deeply in the next few chapters:

Bayesian detective

A detective has a coin that might, or might not be rigged. To keep it simple, let's say there are only two options: it's either fair (50/50 for heads/tails) or it's biased toward tails (25/75 for heads/tails).

A detective keeps flipping this coin hoping to learn the truth.

Let's see how Bayes' rule helps our detective. To use the rule, she needs a starting guess—a prior—for the probability of each hypothesis. Let's say that the detective thinks there's a 2/3 chance the coin is fair and 1/3 chance it's biased.

She flips the coin: $\textsf{H}$ eads! This is evidence for the $\textsf{fair}$ hypothesis, since heads are more likely with a fair coin.

Our detective can use Bayes' rule to calculate the new probability—the posterior—that the coin is fair. At this point, let's have a quick refresher on how Bayes' rule works.

The point of the rule is that if you have two events $A, B$ , then you can write down the probability that both $A$ and $B$ happens in two ways, using conditional probabilities:

For example, let's say that 'the coin is fair' and 'we flipped heads'. Then, our detective is especially interested in the value of : That's her posterior probability that the coin is fair, after she flipped heads. She can rearrange [Eq. simple-bayes?] a bit to get:

This formula is typically called Bayes' rule.

Divide and conquer

There's a different way to write down Bayes' rule that I find much more intuitive.

To get the message, though, we have to get used to think in odds instead of probabilities. Gamblers love this—instead of "1/3 chance of this, 2/3 chance of that," they say "odds are 1:2."

Odds are like probabilities that don't need to sum to one (1:2 is the same as 2:4). This often comes pretty handy - it allows us to focus on the more important parts of the picture. Multiplying the odds so that they add up to one is called normalization. So, thinking with odds is like thinking with probabilities whenever we don't want to worry about normalization.

With odds, Bayes' formula is super clean! First, let's take the formula [Eq. bayes-formula?] and write it one more time, this time for the biased hypothesis:

If we divide [Eq. bayes-formula?] and [Eq. bayes-formula-2?], we get this equation:

This is actually a very relatable formula! is simply the prior probability that our detective has on the two hypotheses, written as odds. For example, if $P(\textsf{fair}) = 2/3$ , this ratio is just $2/3 : 1/3 = 2:1$ .

The fraction on the left-hand side is pretty much the same - it is the posterior probability after we've seen the heads, written as odds.

Finally, there's the ratio —that's how much more likely heads are under the $\textsf{fair}$ hypothesis. Conditional probabilities are typically called likelihoods, so this ratio is the likelihood ratio.

Here's how it works in practice. ² Applying Bayes' rule is as simple as

Write down the prior odds
Multiply with likelihood ratios
Profit!

You can also test in the following widget that it works the same for more than two hypotheses (we won't explicitly need this right now, but it's good to know).

Bayes Calculator

Prior odds:

Fair

Biased

Likelihood of heads:

Posterior odds:

1.00:0.25

Posterior probability:

80.0%20.0%

In our detective's case, she increased her probability of fair coin from $66.7\%$ to $80\%$ . Her suspicion that the coin is fair got boosted a bit. But it could also be a fluke. She has to flip the coin a few more times.

Go forth and multiply

Let's see what happens when our detective keeps flipping over and over, updating her beliefs each time using Bayes' rule. For example, if she gets $\{H, T, T, H, T\}$ , here's what's happening:

Bayes sequence calculator

Coin sequence:

Prior odds

Posterior odds

2.000

1.000

Posterior probability

66.7%

33.3%

With every flip, she multiplies her odds by a likelihood ratio: $2/1$ for heads, $2/3$ for tails.

In this example, three tails out of five slightly favor the fair-coin hypothesis. After converting back to probabilities, her initial 66.7% belief in fairness increased to about 70%. Going to need way more flips to know for sure!

Before we start pondering about what's gonna happen long-term, let's simplify our calculator a bit. Things get clearer if we take logarithms of everything. Instead of multiplying odds and likelihood, we will be adding so-called log-odds.

Here's the same step-by-step calculator after taking logarithms. For example, "prior 1:0" in the calculator means that the prior is $2^1 : 2^0$ . ³

Bayes Sequence Explorer (Log Space)

Coin sequence:

Prior log-odds

Posterior log odds

1.00

0.00

Posterior odds

2.000

1.000

Posterior probability

66.7%

33.3%

Notice that all the likelihoods (numbers in orange rows) are now negative numbers. That makes sense - probabilities are smaller than one, so after taking the log, they become negative numbers. It's often useful to talk about absolute values of those numbers, which is why people define a quantity called surprisal: Whenever you have something happening with probability $p$ , the expression $\log 1/p$ can be called a surprisal and its units are bits (because we use binary logarithm).

This is a logarithmic viewpoint on how surprised you are when something happens. Getting heads when you thought it was 1% likely? Super surprising ( $\log 1/0.01 \approx 6.6 \textrm{bits}$ ). Getting heads on a fair coin? Meh ( $\log 1/0.5 = 1 \textrm{bit}$ ).

Notice how in the widget, I replaced ':' with 'vs'. This is because in the logarithmic world, it's no longer the ratio that's important, it's the difference. For example, on heads, $\textsf{fair}$ hypothesis is surprised by $1$ bit, while the $\textsf{biased}$ hypothesis is surprised by $2$ bits. The difference of the two surprises - 1 bit - is the most important takeaway from the flip. This difference measures the bits of evidence that the flipped heads provide. We are essentially adding up this difference in the posterior log odds, and it ultimately translates into the posterior probability.

Expected evidence

Remember, whenever our detective flips heads, she is getting $\log \frac{1}{1/4} - \log \frac{1}{1/2} = 2 - 1 = 1$ bit of evidence that the coin is fair. Whenever she flips tails, she's getting $\log \frac{1}{1/2} - \log \frac{1}{3/4} \approx 1 - 0.58 = 0.42$ bits of evidence that the coin is biased.

This holds regardless of which one of the two hypothesis is true. Now, let's say the coin actually is biased. How fast will our Bayesian detective figure this out?

This all comes down to adding bits of evidence. We can calculate the average number of bits that she learns per flip:

bits of evidence toward the truth.

What does this mean in practice? After five flips, the detective gets $5 \cdot 0.19 \approx 1$ bit of evidence on average. So if she starts thinking 2:1 the coin is fair, after about five flips she'll be at 1:1. Another five flips gets her to 2:1 the coin is biased, the next five flips to 4:1, and so on.

The actual odds fluctuate around this average. But thanks to the law of large numbers, after $N$ flips the total evidence will be close to $0.19 \cdot N$ . ⁴

Try it yourself below! I recommend checking edge cases like 50% vs 51% to get intuition about when the law of large numbers kicks in or what's the difference between 50% vs 1% and 1% vs 50%.

Evidence Accumulation Simulator

True Heads Probability: 25%

Model Heads Probability: 50%

\begin{align*} D(p, q) &= 0.25\log_2\left(\frac{0.25}{0.50}\right) + 0.75\log_2\left(\frac{0.75}{0.50}\right) \\ &= 0.19\text{ bits per flip} \end{align*}

Number of Flips: 200

Simulation Speed (faster ←→ slower)

KL divergence

KL divergence is just the general formula for expected evidence accumulation. Say you've got two distributions $p$ and $q$ , where $p$ is what's really happening, but you only know it's either $p$ or $q$ .

You can keep sampling from the unknown distribution and play the Bayesian game: Whenever you sample outcome $i$ , compare the likelihoods $p_i$ and $q_i$ and update your beliefs. In classical Bayes' rule, you compare the so-called likelihood ratio $p_i / q_i$ , but if there are more updates, it's easier to work with the bits of evidence, or more technically the log-likelihood ratio $\log p_i / q_i$ .

On average, each sample from the true distribution $p$ gives you:⁵

D(p,q) = \sum_{i = 1}^n p_i \cdot \log \frac{p_i}{q_i}

bits of evidence toward the truth.

When $D(p, q)$ is small, the distribution $q$ is a good probabilistic model of $p$ .

That is, it takes many samples before our detective figures out that she's indeed sampling from $p$ , and not from the imposter $q$ .

If $D(p,q)$ is small (less than one), you can also think of $1/D(p,q)$ as how many samples until I get one bit of evidence. One bit of evidence means that the odds for the true hypothesis doubled. ⁶

Notice that KL divergence is about the evidence, not your starting beliefs. It tells you how fast beliefs change, no matter whether the detective started with $1:2$ or $1000:1$ .

Bayes versus Hume

The certainty of uncertainty

What's next?

In the next section, we'll dig deeper into the KL formula and see how it connects to entropy and cross-entropy.

Riddles

Crossentropy & Entropy