The previous essay showed that the family of Gaussian distributions forms a curved space—the Poincaré half-plane, equipped with the Fisher information metric. Geodesics on this space are semicircles, distances encode statistical distinguishability, and curvature is constant and negative.
But there is something stranger lurking. This manifold admits two incompatible notions of “straight.” Both are natural. Neither is wrong. A line that is perfectly straight in one sense is curved in the other. The duality between these two straightnesses is the deepest structure in information geometry, and it underpins everything from the EM algorithm to variational inference to the thermodynamics of phase transitions.
Given two Gaussians, there are (at least) three natural paths between them:
The red path is an e-geodesic—straight when we parameterize by the natural (canonical) parameters of the exponential family. The blue path is an m-geodesic—straight when we parameterize by expectation parameters, meaning it corresponds to a mixture interpolation. The gold path is the Levi-Civita geodesic—the one that minimizes Fisher-Rao distance, the Poincaré semicircle.
Three different geodesics, three different connections, one manifold. The relationship between the red and the blue is the subject of this essay.
For an exponential family, every distribution can be written as p(x|θ) = exp(θ·t(x) − A(θ)) where θ are the natural parameters and A(θ) is the log-partition function (cumulant generating function). The expectation parameters are η = ∇A(θ)—they are the expected values of the sufficient statistics under the model.
For the Gaussian family with sufficient statistics t(x) = (x, x²):
The two coordinate systems are related by the Legendre transform of A(θ). In natural coordinates, the e-connection is flat: straight lines are exponential arcs. In expectation coordinates, the m-connection is flat: straight lines are mixture arcs. A line that is straight in one coordinate system is curved in the other.
This is the essence of dually flat geometry. The manifold is “flat” in two incompatible ways simultaneously, and the tension between the two flatnesses encodes all the asymmetry of the KL divergence—the fact that D_KL(p||q) ≠ D_KL(q||p).
On dually flat manifolds, the KL divergence obeys a Pythagorean theorem. Given a submanifold M that is e-flat (defined by linear constraints in natural parameters), a point p off M, and its m-projection r onto M (the point that minimizes D_KL(p||q) for q ∈ M), then for any q ∈ M:
The cross term vanishes. This is exactly the Pythagorean theorem, with KL divergence playing the role of squared distance. The e-geodesic from p to r meets the m-flat submanifold at a right angle (in the dual sense).
This is not a metaphor or a loose analogy. It is a precise mathematical identity. It is the geometric content of the EM algorithm, variational inference, maximum entropy, and exponential family regression.
The two connections yield two natural notions of projection onto a submanifold, corresponding to minimizing the two “directions” of KL divergence:
I-projection (information projection, forward KL): argmin_q D_KL(p||q). This is the m-projection onto an e-flat submanifold. It is zero-avoiding: the approximation q avoids placing zero probability where p has mass. Result: q covers all modes of p, potentially spreading too wide.
M-projection (moment projection, reverse KL): argmin_q D_KL(q||p). This is the e-projection onto an m-flat submanifold. It is zero-forcing: q may place zero probability where p has mass. Result: q locks onto one mode of p, ignoring others.
This is the geometric reason variational inference (which minimizes reverse KL) is mode-seeking, while maximum likelihood (which minimizes forward KL) is mode-covering.
Amari’s α-connections form a one-parameter family that interpolates continuously between the two extremes:
For any α, the ∇(α) and ∇(−α) connections are dual with respect to the Fisher metric. The geodesic equation changes continuously with α, and with it the notion of “straight.”
Watch the geodesic morph as α moves from −1 to +1. At the extremes it straightens in one coordinate system; in between it curves through both. The α=0 path is the unique geodesic that minimizes arc length—the Poincaré semicircle.
At singular points of a statistical model, the Fisher information matrix degenerates—eigenvalues collapse to zero. Information geometry assumes a positive-definite metric everywhere. When this assumption fails, the entire framework breaks: distances become zero in certain directions, geodesics are undefined, and the dual structure collapses.
Consider a model p(x|w&sub1;, w&sub2;) where the output depends on the product w&sub1; · w&sub2;. When either weight is zero, the other is unidentifiable—you can change it freely without changing the distribution. The Fisher matrix has a zero eigenvalue along the unidentifiable direction.
This is precisely the territory of Singular Learning Theory. Where classical information geometry fails, SLT picks up: the learning coefficient λ—the real log canonical threshold—measures effective complexity even at singularities, and it is always less than or equal to d/2. The degeneracy of the Fisher metric is not a bug but a feature: it means the model has fewer effective parameters than its nominal dimension suggests.
The Boltzmann distribution p(x|β) = exp(−βE(x))/Z(β) is an exponential family with natural parameter β (inverse temperature) and sufficient statistic −E(x) (negative energy). The log-partition function is A(β) = ln Z(β).
The Fisher information of β is therefore the variance of the energy: I(β) = Var(E) = ⟨E²⟩ − ⟨E⟩². But in thermodynamics, this is exactly the heat capacity (up to a factor of β²). The Fisher-Rao distance is the thermodynamic length—the minimum entropy production for a finite-time process. Phase transitions, where the heat capacity diverges, are curvature singularities of the statistical manifold.
The duality between e-connection and m-connection is not a curiosity of differential geometry. It is a structural feature that appears everywhere learning meets probability:
The EM algorithm is, geometrically, an alternating projection scheme: the E-step is an m-projection (computing expected sufficient statistics), and the M-step is an e-projection (fitting parameters to moments). Convergence follows from the Pythagorean theorem—each step reduces KL divergence, and the cross terms vanish.
The singularities in the loss landscape live precisely where the duality collapses—where the two notions of straight become ill-defined because the metric itself degenerates. SLT tells us that these singularities are not pathological; they are where the most efficient learning happens, where models achieve the lowest effective complexity.
The two straightnesses are two aspects of the same manifold, and the tension between them—the irreducible gap between the e-geodesic and the m-geodesic, between forward KL and reverse KL, between energy and entropy—is what drives learning.
Part 1: The Shape of Uncertainty (information geometry and the Fisher metric)
Part 3: The Geometry of Learning (where the geometry breaks and SLT takes over)