Fisher information & statistics

In this chapter, we will explain a bit more about how KL divergence connects to statistics. In particular, one of the key statistical concepts is a so-called Fisher information, and we will see how one can derive it using KL divergence.

💡If there is one thing you remember from this chapter...

Fisher information is the limit of KL divergence of close distributions.

The polling problem

In one of our riddles, we have been talking about the problem with polling: If you want to learn the percentage of the vote for one party up to ε\eps, you need sample about 1/ε21/\eps^2 voters.

Polling Error Calculator

0.1%10%
45/5550/5055/45

Required Sample Size

1K
people needed
Estimated cost:$2K
% of US voters:<0.001%

Quick Scenarios

The law of large numbers

One way to understand why we need Ω(1/ε2)\Omega(1/\eps^2) samples is via the law of large numbers. This law tells you that if you flip a fair coin nn times, you typically get around n/2±O(n)n/2 \pm O(\sqrt{n}) heads. 1. If your coin is biased and heads come out with probability 1/2+ε1/2 + \eps, then the law says you will typically see around (1+ε)n/2±O(n)(1+\eps)n/2 \pm O(\sqrt{n}) heads. If you want to be able to separate the two options, you need the two intervals n/2±O(n)n/2 \pm O(\sqrt{n}) and (1+ε)n/2±O(n)(1+\eps)n/2 \pm O(\sqrt{n}) to be disjoint, which means that you have to choose ε\eps such that εnn\eps \cdot n \approx \sqrt{n}, i.e., nn should be at least around 1/ε21/\eps^2.

KL intuition

But I want to offer a complementary intuition for the polling problem. Let's go all the way back to the first chapter where we introduced KL. Our running example was a Bayesian experimenter who wanted to distinguish whether his coin is fair or not fair. If the two distributions to distinguish are (50%,50%)(50\%, 50\%) and (50%+ε,50%ε)(50\% + \eps, 50\% - \eps), then the problem of our experimenter gets extremely similar to the problem of estimating the mean up to ε\eps!

Remember, if we have two very similar distributions, KL between them is going to be positive, but very small. It turns out that

D((0.5+ε,0.5ε),(0.5,0.5))2ε2D((0.5 + \eps, 0.5 - \eps), (0.5, 0.5)) \approx 2\eps^2

The intuition we should take from this is that each time we flip a coin, we get around "2ε22\eps^2 bits of information" about whether the coin is fair or biased by ε\eps. Intuitively, once we get around 1 bit of information, we begin to be able to say which of the two coins we are looking at. We conclude that the number of trials we need to estimate the bias of a coin should scale like n=1/ε2n = 1/\eps^2.

Fisher information

The trick with looking at KL divergence between two very similar distributions can be generalized further. Think of a setup where you have a family of distributions characterized by some number. For example:

  • biased coins with heads-probability pp for all p[0,1]p \in [0,1].
  • gaussians N(μ,1)N(\mu, 1) for all μR\mu \in \mathbb{R}.
  • gaussians N(0,σ2)N(0, \sigma^2) for all σ20\sigma^2 \ge 0.

More generally, think of a set of distributions pθp_\theta parameterized by a number θ\theta.

One of the most basic tasks in statistics is that we have an empirical distribution pp and we want to find the θ\theta so that pθp_\theta fits the data the best, like below.

fit

We have already seen this a few times, and we have seen in the chapter about minimizing KL why the maximum likelihood estimators are a reasonable approach to this.

But we can go a bit further -- if we assume that pp has literally been generated by one of the distributions pθp_\theta for some θ\theta, we may ask how many samples we need to estimate θ\theta up to additive ε\eps. Our polling question is just a special case of this more general question.

Our intuition about KL tells us if the data were generated by pθp_\theta, to estimate θ\theta it's key to understand the value of . This is the general (and a bit opaque) formula we thus need to understand:

We will estimate the value of the right-hand side by computing the derivatives and . In particular, we will estimate with the following Taylor-polynomial approximation.

After plugging it in and some calculations 2, we get

for

The term I(θ)I(\theta) is called Fisher information. Let's not try to understand what this integral mean. The short story is that you can approximate KL divergence of two similar distributions as I(θ)ε2I(\theta)\cdot \eps^2 and I(θ)I(\theta) can often be computed by integration.

Our KL intuition is telling us that estimating a parameter θ\theta up to ε\eps should require about many samples. Let's see what this intuition is saying in some cases:

  1. If you want to estimate pp of a biased coin, I(p)=1p(1p)I(p) = \frac{1}{p(1-p)}. Thus, our KL intuition suggests that estimating the value of pp (with constant probability) should take about p(1p)/ε2p(1-p)/\eps^2 samples. This is indeed the right formula (see the widget above).

  2. If you want to find the mean of normal random variable with known variance σ2\sigma^2, we have I(μ)=1/σ2I(\mu) = 1/\sigma^2. Hence, our KL intuition suggests that estimating μ\mu up to ε\eps requires about σ2/ε2\sigma^2 / \eps^2 samples.

  3. If you want to find the variance of normal random variable with known mean μ\mu, we have I(σ2)=1/(2σ4)I(\sigma^2) = 1/(2\sigma^4). Hence, our KL intuition suggests that estimating σ2\sigma^2 requires about σ4/ε2\sigma^4 / \eps^2 samples.

In all above cases, it turns out to be the case that our intuition is correct and really describes how many samples you need to solve given problem. In fact, there are two important results in statistics: One of them is Cramér-Rao lower bound that basically says that our formula of n=1/(I(θ)ε2)n = 1/(I(\theta)\eps^2) is indeed a lower bound. On the other hand, the so-called asymptotic normality theorem for MLE says that the MLE estimator is as good as our formula. 3

The upshot is that computing KL between two close distributions basically answers the most fundamental question of (frequentist) statistics -- how many samples do I need to get a good estimate of a parameter.

Fisher metric

Although KL divergence is asymetrical, it gets symmetrical in the limit. That is, if we do the computation for instead of , we would get the same expression up to some O(ε3)O(\eps^3) terms. So, in this setup where we try to understand two distributions that are very close together, we can really think of it as being symetric. This way, we can define a proper distance function on the space of distributions -- it's called Fisher information metric and can give useful intuitions. For example, for the case of biased-coin-flipping distributions where coins have bias p[0,1]p \in [0,1], we can represent all of them using a segment like this:

Fisher metric

The Fisher information metric says that locally, the KL distance around a point pp is proportional to I(p)=1/(p(1p))I(p) = 1/(p(1-p)). That is, the probabilities around 1/21/2 are "closer together" than the probabilities at the edges. This is a supercharged-abstract-fancy way of saying what we have already seen many times -- probabilities should often be handled multiplicatively, not additively.