Math and science::INF ML AI

Fisher information

Fisher information tells you how much information a potential observation of a random variable would provide about a parameter of that variable's probability distribution. It can be considered "information" in the sense that it adds for independent random samples.

Setup

Let X be a random variable, \( X : \Omega \to Z \), where \( Z \subseteq \mathbb{R}^n \). Let \( X \) have a probability distribution \( X \sim p_{\theta}(x) \) taken from a family of such distributions, parameterized by \( \theta \in \Theta \). We will assume \( \theta \) to be a real, but the ideas apply even if \( \theta \) is a vector of reals.

I write \(p(x; \theta) \) interchangeably as \( p_{\theta}(x) \). The latter is useful when emphasizing a distribution for a fixed \( \theta \). But, for Fisher Information, we will be interested in viewing \( p \) as being a function with input \( \theta \) also.

See the reverse for a more precise setup.

There are three commonly used and equivalent ways to define Fisher information. We start with the relative entropy perspective.

The entropy of a single distribution, \( p_{\theta} \) is:

\[ H(p_{\theta}) = \int_{Z} -p_{\theta}(x) \log {p_{\theta}(x)} \dd x \]

The relative entropy of another member distribution \( p_{\phi} \) with respect to \( p_{\theta} \) is:

\[ H(p_{\phi} || p_{\theta}) = \int_{Z} p_{\phi}(x) \log \frac{p_{\phi}(x)}{p_{\theta}(x)} \dd x \]

Then we can define Fisher Information:

Fisher Information

Fisher Information \( I(\theta) \) is the second derivative of the relative entropy \( H(p_{\phi} || p_{\theta}) \) with respect to \( \phi \) evaluated at \( \theta \):

\[ I(\theta) = \pdv[2]{\phi} H(p_{\phi} || p_{\theta}) \bigg|_{\phi = \theta} \]

From the perspective of how quickly \( p_{\phi} \) deviates from \( p_{\theta} \):

Consider the map \( \phi \mapsto H(f_{\phi} | f_{\theta}) \) for fixed \( \theta \). This function is minimized at \( \phi = \theta \), with a value of zero, and the first derivative is zero at this point also. The second derivative gives a measure of how quickly the distribution \( f_{\phi} \) deviates from \( f_{\theta} \) as \( \phi \) moves away from \( \theta \). This is the Fisher information \( I(\theta) \) of the family of distributions, evaluated at \( \theta \). All of this without ever observing data.

Can you remember two other equivalent forms of the definition?

Equivalent statements

Equivalent to the above definition, Fisher information \( I(\theta) \) is also:

The expected value of the squared score function. This is also the variance of the score function, as the expected value of the score function is zero.
\[ \mathbb{E}_{z \sim p_{\theta}(z)}\left[\left( \pdv{\phi} \log p(z; \phi) \right)^2\right] \]
Or written as:
\[ \int_{Z} \dd{z} \; p(z; \theta) \left( \pdv{\phi} \log p(z; \phi) \right)^2 \]
The curvature of the log-likelihood function at \( \theta \). This is the expected value of the log-likelihood's 2^nd derivative, made to be a positive value.
\[ -\mathbb{E}_{z \sim p_{\theta}(z)}\left[ \pdv[2]{\phi} \log p(z; \phi) \right] \]
or written as:
\[ -\int_{Z} \dd{z} \; p(z; \theta) \pdv[2]{\phi} \log p(z; \phi) \]
When the likelihood at \( \theta \) is a minimum, which is the case when \( \theta \) is arrived at by maximum likelihood estimation, the second derivative is negative, and the negative of this is a positive quantity that represents the curvature around \( \theta \).

Equivalences. Poof that definition ⇔ (1)

We will first build up some lemmas.

Lemma 1. Weighted score is equal to the derivative of probability

\[ p(z; \phi) \pdv{\phi} \log p(z; \phi) = \pdv{\phi} p(z; \phi) \]

Proof. By the chain rule, we have:

\[ \pdv{\phi} \log p(z; \phi) = \frac{1}{p(z; \phi)} \pdv{\phi} p(z; \phi) \]

Rewriting the RHS of the target equation, we have:

\[ \begin{aligned} p(z; \phi) \pdv{\phi} \log p(z; \phi) &= p(z; \phi) \cdot \frac{1}{p(z; \phi)} \pdv{\phi} p(z; \phi) \\ &= \pdv{\phi} p(z; \phi) && \qed \end{aligned} \]

Lemma 2. Integral of weighted score is zero

\[ \int_{Z} \dd{z} \; p(z; \phi) \pdv{\phi} \log p(z; \phi) = 0 \]

Proof. As we have:

\[ \begin{aligned} \int_{Z} \dd{z} \pdv{\phi} \; p(z; \phi) &= \pdv{\phi} \int_{Z} \dd{z} \; p(z; \phi) \\ &= \pdv{\phi} (1) \\ &= 0 && \qed \end{aligned} \]

This implies that integrating the RHS of Lemma 1 also gives zero.

Lemma 3. Derivative of prob times derivative of log-prob is weighted score-squared

This is such a short jump from Lemma 1 that we will include the single intermediate step:

\[ \begin{aligned} \pdv{\phi} p(z; \phi) \pdv{\phi} \log p(z; \phi) &= \left( p(z; \phi) \pdv{\phi} \log p(z; \phi) \right) \pdv{\phi} \log p(z; \phi) \\ &= p(z; \phi) \left( \pdv{\phi} \log p(z; \phi) \right)^2 \end{aligned} \]

Integrating that term is the expected value of the squared score—our target. Next, we calculate the first derivative of the relative entropy, going halfway torwards an expression for the second derivative.

Lemma 4. First derivative of relative entropy

\[ \pdv{\phi} \left( \int_{Z} \dd{z} \; p(z; \phi) \log \frac{p(z; \phi)}{p(z; \theta)} \right) = \int_{Z} \dd{z} \; \left( \pdv{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} \]

It's relatively surprising that the derivative passes through the integral just to the \( p(z; \phi) \) term.

Proof. We will apply the product rule, and see that the second term is zero. It will be zero because both \( \pdv{\phi} \log p(z; \theta) \) is zero and due to Lemma 2.

\[ \begin{aligned} &\quad \int_{Z} \dd{z} \; \pdv{\phi} \left( p(z; \phi) \log \frac{p(z; \phi)}{p(z; \theta)} \right) && \text{(next apply prod. rule...)} \\ &= \int_{Z} \dd{z} \; \left( \pdv{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} + p(z; \phi) \left( \pdv{\phi} \log p(z; \phi) + 0 \right) && \text{ ( as } \pdv{\phi} \log p(z; \theta) = 0 \text{)} \\ &= \int_{Z} \dd{z} \; \left( \pdv{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} + 0 && \text{ (from Lemma 2)} \\ &= \int_{Z} \dd{z} \; \left( \pdv{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} && \qed \\ \end{aligned} \]

The second derivative of the relative entropy is the variance of the score

\[ \pdv[2]{\phi} \left( \int_{Z} \dd{z} \; p(z; \phi) \log \frac{p(z; \phi)}{p(z; \theta)} \right) \bigg\rvert_{\phi=\theta} = \operatorname{Var}_{z \sim p(z; \theta)} \left[ \pdv{\phi} \log p(z; \phi) \right] \]

Proof.

\[ \begin{aligned} & \pdv[2]{\phi} \left( \int_{Z} \dd{z} \; p(z; \phi) \log \frac{p(z; \phi)}{p(z; \theta)} \right) && \text{ (by Lemma 4)} \\ &= \pdv{\phi} \int_{Z} \dd{z} \; \left( \pdv{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} && \text{ (by Lemma 4)} \\ &= \int_{Z} \dd{z} \; \left( \pdv[2]{\phi} p(z; \phi) \right) \log \frac{p(z; \phi)}{p(z; \theta)} + \int_{Z} \dd{z} \; \pdv{\phi} p(z; \phi) \left( \pdv{\phi} \log \frac{p(z; \phi)}{p(z; \theta)} \right) && \text{ (prod. rule)} \\ \end{aligned} \]

The first term is zero because \( \log \frac{p(z; \phi)}{p(z; \theta)} = 0 \) at \( \phi = \theta \). The second term loses \( \pdv{\phi} \log p(z; \theta) \), as it is zero, to become the LHS of Lemma 4:

\[ \begin{aligned} &= 0 + \int_{Z} \dd{z} \; \pdv{\phi} p(z; \phi) \pdv{\phi} \log p(z; \phi) \\ &= \int_{Z} \dd{z} \; p(z; \phi) \left( \pdv{\phi} \log p(z; \phi) \right)^2 && \text{ (by Lemma 3)} \\ &= \mathbb{E}_{z \sim p(z; \theta)} \left[ \left( \pdv{\phi} \log p(z; \phi) \right)^2 \right] \\ &= \operatorname{Var}_{z \sim p(z; \theta)} \left[ \pdv{\phi} \log p(z; \phi) \right] && \text{ (as } \mathbb{E}[\text{score}] = 0 \text{)} \quad \qed \\ \end{aligned} \]

Equivalences. Poof that (1) ⇔ (2)

We know that the expected score is zero for all distributions \( p_{\theta} \), and so the derivative of the expected score wrt \( \theta \) is also zero. Applying the derivative produces two terms (through the product rule); one of the terms is the variance of the score, and the other is the expected value of the second derivative of the log-likelihood. They must sum to zero, so are the negative of each other.

\[ \begin{aligned} & \pdv{\phi} \int_{Z} \dd{z} \; p(z; \phi) \pdv{\phi} \log p(z; \phi) = 0 \\ &\implies \int_{Z} \dd{z} \; p(x; \phi) \left( \pdv[2]{\phi} \log p(z; \phi) \right) + \int_{Z} \dd{z} \; \pdv{\phi} p(z; \phi) \pdv{\phi} \log p(z; \phi) = 0 \\ \end{aligned} \]

The first term has the form of an expectation, and the second term is the variance of the score function, following Lemma 3.

\[ \begin{aligned} &\iff \mathbb{E}_{z \sim p(z; \phi)} \left[ \pdv[2]{\phi} \log p(z; \phi) \right] + \operatorname{Var}_{z \sim p(z; \phi)} \left[ \pdv{\phi} \log p(z; \phi) \right] = 0 \\ &\iff -\mathbb{E}_{z \sim p(z; \phi)} \left[ \pdv[2]{\phi} \log p(z; \phi) \right] = \operatorname{Var}_{z \sim p(z; \phi)} \left[ \pdv{\phi} \log p(z; \phi) \right] && \qed \\ \end{aligned} \]

A measure of sensitivity of the likelihood

Fisher information tells you how much information a potential observation of a random variable would provide about a parameter of its own probability distribution. If changing \( \theta \) causes a big change in the likelihood \( p (x; \theta ) \) of an observation, then the observation provides a lot of information about \( \theta \). If, on the other hand, changing \( \theta \) barely affects the likelihood, then an observation of \( X \) does not provide much information about \( \theta \). This "sensitivity" is captured by the Fisher information.

Drawbacks and comparison to the 2^nd derivative of the data log-likelihood

After estimating a parameter \( \theta \) via MLE, one could inspect the Fisher information, \( I(\theta) \); however, \( \theta \) is not the true parameter value, and so our estimate of the curvature won't be around the true parameter either. Given that you have carried out MLE, you must have data available, and so, why not just calculate the realized curvature of the log-likelihood rather than expected curvature?:

\[ - \pdv[2]{\theta} \log p(x_{\text{data}}; \theta) \]

From this perspective, Fisher information can be thought of as asking: assume \( \theta \) is the true parameter, what is the curvature calculated from hypothetical data (calculated in expectation)?

Relative entropy. Definition.

Let \( \Omega \) be a measure space, and let \( (f_{\theta})_{\theta \in \Theta} \) be a smooth family of probability density functions on \( \Omega \), indexed by the real \( \theta \) over an interval \( \Theta \subset \mathbb{R} \). We treat only the scalar case, but the idea extends to vector parameters.

With both \( \phi \in \Theta \) and \( \theta \in \Theta \), the relative entropy of one distribution, \( f_{\phi} \), with respect to another, \( f_{\theta} \), is:

\[ H(f_{\phi} | f_{\theta}) = \int_{\Omega} f_{\phi}(x) \log \frac{f_{\phi}(x)}{f_{\theta}(x)} \dd x \]

Setup (more thorough)

Let \( (\Omega, \mathcal{F}, \mathbb{P}_\theta) \) be a family of probability spaces indexed by \( \theta \in \Theta \subseteq \mathbb{R}^d \). Let \( X : \Omega \to \mathbb{R}^n \) be a measurable random variable, where \( \mathbb{R}^n \) is equipped with the Borel σ-algebra \( \mathcal{B}(\mathbb{R}^n) \). The distribution of \( X \) under \( \mathbb{P}_\theta \), denoted \( \mathbb{P}_\theta^X := \mathbb{P}_\theta \circ X^{-1} \), is a probability measure on \( (\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n)) \). We assume that \( \mathbb{P}_\theta^X \) admits a density \( p(x; \theta) \), so that for any \( A \in \mathcal{B}(\mathbb{R}^n) \),

\[ \mathbb{P}_\theta^X(A) = \int_A p(x; \theta)\, dx. \]

Example

Let's consider the family of probabilities distributions parmeterized by \( \phi \), where the family is the set of Gaussians fixed at zero mean, with a variance of \( \phi \).

\[ p_{\phi}(z) = \frac{1}{\sqrt{2\pi}}\frac{1}{\phi} \exp\left(-\frac{z^2}{2\phi^2}\right) \]

The derivative of the probability with respect to \( \phi \) is given by:

\[ \boxed{\pdv{\phi} p(x; \phi) = p(z; \phi) \left( -\frac{1}{\phi} + \frac{z^2}{\phi^3} \right)} \]

Which is a result of the product rule.

From Lemma 1, this is also the weighted score function: \( p(z; \phi) \pdv{\phi} \log p(z; \phi) \).

The score function is similar, just drop the weighting:

\[ \pdv{\phi} \log p(z; \phi) = -\frac{1}{\phi} + \frac{z^2}{\phi^3} \]

The score-squared is then:

\[ \left( \pdv{\phi} \log p(z; \phi) \right)^2 = \left( -\frac{1}{\phi} + \frac{z^2}{\phi^3} \right)^2 \]

The plots of these are shown below:

Derivative of probability (i.e. weighted score):

Score:

Squared score:

Weighted squared score:

On interpreting the figures

I think it's best to read the weighted squared score figure as there being an expected contribution assigned to each z value. Once an actual z is observed, the score function (or squared score function) should be referenced instead.

For example, given an observation \( z=-4 \), a \( \mathcal{N}(0, \phi=1) \) distribution would increase the assigned likelihood quite dramatically by increasing \( \phi \). This can be seen by looking at the score function. The squared score function is only really interesting here as it integrates to something non-zero, so it's interesting in aggregate, but doesn't say anything more about a particular \( z \) compared to the score function. Although \( z = -4 \) would have a dramatic effect, before observing the \( z = -4 \) value, we actually don't think such an observation is very likely, so it's contribution to the weighted score (or weighted score squared) is still small.

You could view the weighted score-squared as representing how much a frequentist stays awake worrying about the prospect of observing a specific \( z \) value and how it will screw up their hypothesis of \( \phi = 1 \). The Fisher information is then the net worry about all possible outcomes.

\( \phi =1 \) vs \( \phi = 2\)

Here is a comparison of the functions for two different values of \( \phi \).

Source

Statlect has a nice video with some visualizations: https://www.statlect.com/glossary/information-matrix
https://math.stackexchange.com/a/5080215/52454