The Shape of Uncertainty — Information Geometry

1. The Statistical Manifold

Consider the family of all Gaussian distributions N(μ, σ). Each is determined by two parameters: the mean μ and the standard deviation σ > 0.

We can represent each Gaussian as a point in the upper half-plane — μ on the horizontal axis, σ on the vertical. This is the statistical manifold of Gaussians.

Every point in this plane is a probability distribution. But what is the “distance” between two distributions? Euclidean distance gets this profoundly wrong.

Click anywhere in the plane to place a Gaussian. Its PDF appears below.

The Gaussian manifold is two-dimensional. More complex families — mixtures, exponential families — give higher-dimensional manifolds. The ideas generalize.

(μ, σ) plane — click to place distributions

PDF curves of placed Gaussians

2. Why Euclidean Fails

Consider two pairs of Gaussians, each separated by the same Euclidean distance in (μ, σ) space:

Pair A: Narrow

N(0, 0.1) and N(0.5, 0.1) — means differ by 0.5, both very precise. Their PDFs barely overlap. Statistically, these are completely different distributions.

Pair B: Wide

N(0, 10) and N(0.5, 10) — same shift in mean, both very spread. Their PDFs are nearly identical. You could not distinguish them from finite samples.

Same Euclidean distance, vastly different statistical distance. The geometry of uncertainty is not flat.

The overlap area tells the truth. Any honest metric must weight shifts by the precision of the distributions involved.

N(0, 0.1) N(0.5, 0.1) Overlap ≈ 0%

N(0, 10) N(0.5, 10) Overlap ≈ 99.98%

Both pairs in (μ, σ) space — same Euclidean distance, different worlds

3. The Fisher Metric — Hyperbolic Space

The Fisher information metric for Gaussians is:

ds² = (dμ² + 2 dσ²) / σ²

Substituting s = σ√2 gives:

ds² = (dμ² + ds²) / s²

This is the Poincaré half-plane metric — the standard model of hyperbolic geometry with constant negative curvature.

Geodesics (shortest paths) are:

Vertical lines when μ&sub1; = μ&sub2;
Semicircles centered on the μ-axis otherwise

The Fisher metric turns probability space into a hyperbolic surface. Geodesics curve toward regions of high uncertainty — it is “cheaper” to travel through vague distributions than to cross between precise ones.

Click two points to compare the Euclidean line vs. the Fisher geodesic. Notice how the geodesic always bows upward.

The Fisher distance between two Gaussians N(μ&sub1;,σ&sub1;) and N(μ&sub2;,σ&sub2;) is the hyperbolic distance in the Poincaré half-plane between (μ&sub1;, σ&sub1;√2) and (μ&sub2;, σ&sub2;√2).

Euclidean line Fisher geodesic Endpoints

4. Natural Gradient

Gradient descent in parameter space ignores the manifold’s curvature. The natural gradient corrects this by premultiplying with the inverse Fisher matrix:

θ_t+1 = θ_t − η G(θ)⁻¹ ∇L

For Gaussians, the inverse Fisher matrix is:

G⁻¹ = diag(σ², σ²/2)

So the natural gradient steps are:

Δμ = −η σ² · ∂L/∂μ

Δσ = −η (σ²/2) · ∂L/∂σ

When σ is small, steps are small — small changes matter more for precise distributions. When σ is large, steps scale up — imprecise distributions are insensitive.

Loss: KL divergence to target N(3, 2), starting from N(0, 0.5).

L = log(σ₀/σ) + (σ² + (μ−μ₀)²)/(2σ₀²) − ½

The natural gradient respects the geometry. It converges faster because it takes equal-sized steps in distribution space, not parameter space.

Euclidean gradient Natural gradient Target N(3, 2)

KL loss over iterations — blue: Euclidean, red: natural

5. The Deep Connection

Chentsov’s Theorem

The Fisher metric is not arbitrary. Chentsov (1972) proved it is the unique Riemannian metric on statistical models (up to a constant factor) that is invariant under sufficient statistics — under any information-preserving transformation of the data.

There is no other geometry of probability. This one is forced on us.

KL Divergence Is Locally Fisher

KL(p_θ ‖ p_θ+dθ) ≈ ½ dθ^T G(θ) dθ

The KL divergence between nearby distributions is, to second order, the squared Fisher distance. KL is the infinitesimal metric of this geometry.

The visualization shows this convergence: as perturbation scale shrinks, KL and ½ d^TG d become indistinguishable.

Active Inference

Brains minimize free energy. Free energy includes KL divergence between beliefs and observations. The geometry of that minimization is information geometry. Perception is geodesic motion on a statistical manifold.

When you update a belief, you are not moving in a flat space. You are moving along geodesics of a curved manifold where the curvature is determined by how much information each parameter carries.

Where This Goes

Information geometry is the foundation of: natural gradient optimization (Amari 1998), variational inference, the EM algorithm, optimal experiment design, neural network loss landscapes, thermodynamic geometry, and the geometry of quantum states. The Fisher metric is the bridge between statistics and differential geometry.

KL divergence vs ½ Fisher distance² — they converge at small scales

The space of all possible beliefs has a shape. That shape is not Euclidean. It is hyperbolic, curved by information, and everything that learns — brains, algorithms, evolution — navigates it.