Bayes' rule & KL divergence
In this chapter, we'll discuss Bayes' rule and use it to introduce KL divergence.1
KL divergence measures how well a distribution is fitted by a model distribution .
Bayes' rule
Let's review Bayes' rule—it's how you update your beliefs when you get new information. Here's a setup that we will try to understand very deeply in the next few chapters:
A detective has a coin that might, or might not be rigged. To keep it simple, let's say there are only two options: it's either fair (50/50 for heads/tails) or it's biased toward tails (25/75 for heads/tails).
A detective keeps flipping this coin hoping to learn the truth.
Let's see how Bayes' rule helps our detective. To use the rule, she needs a starting guess—a prior—for the probability of each hypothesis. Let's say that the detective thinks there's a 2/3 chance the coin is fair and 1/3 chance it's biased.
She flips the coin: eads! This is evidence for the hypothesis, since heads are more likely with a fair coin.
Our detective can use Bayes' rule to calculate the new probability—the posterior—that the coin is fair. At this point, let's have a quick refresher on how Bayes' rule works.
The point of the rule is that if you have two events , then you can write down the probability that both and happens in two ways, using conditional probabilities:
For example, let's say that 'the coin is fair' and 'we flipped heads'. Then, our detective is especially interested in the value of : That's her posterior probability that the coin is fair, after she flipped heads. She can rearrange [Eq. simple-bayes?] a bit to get:
This formula is typically called Bayes' rule.
Divide and conquer
There's a different way to write down Bayes' rule that I find much more intuitive.
To get the message, though, we have to get used to think in odds instead of probabilities. Gamblers love this—instead of "1/3 chance of this, 2/3 chance of that," they say "odds are 1:2."
Odds are like probabilities that don't need to sum to one (1:2 is the same as 2:4). This often comes pretty handy - it allows us to focus on the more important parts of the picture. Multiplying the odds so that they add up to one is called normalization. So, thinking with odds is like thinking with probabilities whenever we don't want to worry about normalization.
With odds, Bayes' formula is super clean! First, let's take the formula [Eq. bayes-formula?] and write it one more time, this time for the biased hypothesis:
If we divide [Eq. bayes-formula?] and [Eq. bayes-formula-2?], we get this equation:
This is actually a very relatable formula! is simply the prior probability that our detective has on the two hypotheses, written as odds. For example, if , this ratio is just .
The fraction on the left-hand side is pretty much the same - it is the posterior probability after we've seen the heads, written as odds.
Finally, there's the ratio —that's how much more likely heads are under the hypothesis. Conditional probabilities are typically called likelihoods, so this ratio is the likelihood ratio.
Here's how it works in practice. 2 Applying Bayes' rule is as simple as
- Write down the prior odds
- Multiply with likelihood ratios
- Profit!
You can also test in the following widget that it works the same for more than two hypotheses (we won't explicitly need this right now, but it's good to know).
Bayes Calculator
In our detective's case, she increased her probability of fair coin from to . Her suspicion that the coin is fair got boosted a bit. But it could also be a fluke. She has to flip the coin a few more times.
Go forth and multiply
Let's see what happens when our detective keeps flipping over and over, updating her beliefs each time using Bayes' rule. For example, if she gets , here's what's happening:
Bayes sequence calculator
With every flip, she multiplies her odds by a likelihood ratio: for heads, for tails.
In this example, three tails out of five slightly favor the fair-coin hypothesis. After converting back to probabilities, her initial 66.7% belief in fairness increased to about 70%. Going to need way more flips to know for sure!
Before we start pondering about what's gonna happen long-term, let's simplify our calculator a bit. Things get clearer if we take logarithms of everything. Instead of multiplying odds and likelihood, we will be adding so-called log-odds.
Here's the same step-by-step calculator after taking logarithms. For example, "prior 1:0" in the calculator means that the prior is . 3
Bayes Sequence Explorer (Log Space)
Notice that all the likelihoods (numbers in orange rows) are now negative numbers. That makes sense - probabilities are smaller than one, so after taking the log, they become negative numbers. It's often useful to talk about absolute values of those numbers, which is why people define a quantity called surprisal: Whenever you have something happening with probability , the expression can be called a surprisal and its units are bits (because we use binary logarithm).
This is a logarithmic viewpoint on how surprised you are when something happens. Getting heads when you thought it was 1% likely? Super surprising (). Getting heads on a fair coin? Meh ().
Notice how in the widget, I replaced ':' with 'vs'. This is because in the logarithmic world, it's no longer the ratio that's important, it's the difference. For example, on heads, hypothesis is surprised by bit, while the hypothesis is surprised by bits. The difference of the two surprises - 1 bit - is the most important takeaway from the flip. This difference measures the bits of evidence that the flipped heads provide. We are essentially adding up this difference in the posterior log odds, and it ultimately translates into the posterior probability.
Expected evidence
Remember, whenever our detective flips heads, she is getting bit of evidence that the coin is fair. Whenever she flips tails, she's getting bits of evidence that the coin is biased.
This holds regardless of which one of the two hypothesis is true. Now, let's say the coin actually is biased. How fast will our Bayesian detective figure this out?
This all comes down to adding bits of evidence. We can calculate the average number of bits that she learns per flip:
bits of evidence toward the truth.
What does this mean in practice? After five flips, the detective gets bit of evidence on average. So if she starts thinking 2:1 the coin is fair, after about five flips she'll be at 1:1. Another five flips gets her to 2:1 the coin is biased, the next five flips to 4:1, and so on.
The actual odds fluctuate around this average. But thanks to the law of large numbers, after flips the total evidence will be close to . 4
Try it yourself below! I recommend checking edge cases like 50% vs 51% to get intuition about when the law of large numbers kicks in or what's the difference between 50% vs 1% and 1% vs 50%.
Evidence Accumulation Simulator
KL divergence
KL divergence is just the general formula for expected evidence accumulation. Say you've got two distributions and , where is what's really happening, but you only know it's either or .
You can keep sampling from the unknown distribution and play the Bayesian game: Whenever you sample outcome , compare the likelihoods and and update your beliefs. In classical Bayes' rule, you compare the so-called likelihood ratio , but if there are more updates, it's easier to work with the bits of evidence, or more technically the log-likelihood ratio .
On average, each sample from the true distribution gives you:5
bits of evidence toward the truth.
When is small, the distribution is a good probabilistic model of .
That is, it takes many samples before our detective figures out that she's indeed sampling from , and not from the imposter .
If is small (less than one), you can also think of as how many samples until I get one bit of evidence. One bit of evidence means that the odds for the true hypothesis doubled. 6
Notice that KL divergence is about the evidence, not your starting beliefs. It tells you how fast beliefs change, no matter whether the detective started with or .
What's next?
In the next section, we'll dig deeper into the KL formula and see how it connects to entropy and cross-entropy.