4  Entropies

Published

September 24, 2025

4.1 Surprisal or self-information

See Chris Olah’s brilliant for a visual introduction to information theory.

Say we have a discrete random variable X. We want to define a sort of surprise function \(S(A)\) for every event \(A \in \mathcal X\), using the following axioms:

This set of conditions yields us the function: \[S(A) = -\log \mathbb P(X \in A)\]

This is also known as the information content, self-information or Shannon information, since in a sense it quantifies how much information we gain from the actual occurrence of that event.

4.2 Shannon entropy

Now once we have a way of measuring the surprise of a single event, we attempt to quantify the randomness of a random variable. Let \(p(\cdot)\) represent the probability mass function of \(X\). \[ \mathbb H[X] = \mathbb E [S(X)] = - \sum_{x \in \mathcal X} p(x) \log p(x)\] By convention, the base of the \(\log\) is 2, in which case the unit of entropy is called “bit”. \(H\) is actually a functional- it doesn’t depend on the realised value but rather the distribution or “law” of the random variable.

For a Bernoulli random variable with probability of success \(p\), this is what the entropy looks like as a function of \(p\). The randomness reaches its peak when both outcomes are equally likely, and is least when one of the outcomes is certain.

4.3 Relation with Gibbs entropy

4.4 Cross entropy

\[H(p,q) = - \mathbb E_{X\sim p}[\log q(X)]\]

4.5 Differential entropy