Neural networks have far more parameters than data points, yet they generalize. A GPT-scale model has billions of parameters trained on data that, while vast, is finite. Classical statistics says this should not work. The bias-variance tradeoff is supposed to punish you for every extra parameter—more parameters means more overfitting, more noise memorized as signal.
But it does work. And it works because of the extra parameters, not despite them. Singular Learning Theory, developed by Sumio Watanabe over two decades, explains why: the effective complexity of a model is not its parameter count d but its learning coefficient λ, which measures the intrinsic dimensionality of the loss landscape near its minima. For singular models—which include virtually all neural networks—λ can be far less than d/2.
The key insight is geometric. When many different parameter configurations produce the same function, the map from parameters to predictions is many-to-one. The set of optimal parameters is not a point but a variety—a curved, possibly singular, geometric object. The singularities of this variety determine how the model generalizes.
A regular statistical model has a one-to-one map from parameters to distributions. Its loss landscape has isolated quadratic minima, and the posterior concentrates in a Gaussian ball. A singular model has symmetries, redundancies, or degenerate parameterizations that create flat directions in the loss—singularities where the Hessian degenerates. The posterior is decidedly non-Gaussian.
Heisuke Hironaka proved in 1964 that every algebraic singularity can be resolved—transformed into a smooth space through a sequence of blowups. Each blowup replaces a singular point with a smooth exceptional divisor, untangling the geometry. Hironaka received the Fields Medal for this result.
In Singular Learning Theory, this theorem is the engine. The loss function near its minimum is an analytic function K(w) with possibly degenerate zeros. Resolution of singularities transforms it via a proper birational map g into normal crossing form:
The real log canonical threshold (RLCT), also called the learning coefficient λ, emerges as the minimum over coordinate charts of (bˇ+1)/(2kˇ), where bˇ are the multiplicities of the Jacobian and kˇ are the exponents. This single number controls generalization.
Watanabe’s main theorem gives the asymptotic expansion of the free energy—the negative log marginal likelihood—of a singular model:
Here n is the sample size, Lₙ(w₀) is the empirical loss at the optimal parameter, λ is the learning coefficient, and m is the multiplicity of the largest pole of the zeta function of K(w).
For regular models, λ = d/2 and m = 1, recovering the Bayesian Information Criterion (BIC). But for singular models, λ < d/2. The model is less complex than its parameter count suggests. The gap between BIC and the true free energy is the “bonus” that singularities provide—free generalization capacity from geometric structure.
During training, a model’s effective complexity—its local learning coefficient—can change discontinuously. These are phase transitions. The model discovers new structure in the data, jumping from one singularity type to another with a different λ. This is why training loss sometimes plateaus for long stretches and then drops sharply: each plateau is a phase of stable geometry, each drop is a geometric phase transition.
The local learning coefficient (LLC), estimated via sampling, tracks these transitions in real time. High LLC means the model is effectively simple—it sits near a mild singularity with high effective dimension. When LLC drops, the model has found a more degenerate singularity—a more structured, lower-dimensional solution. Learning is the discovery of singularities.
Different singularity types have different learning coefficients. Click each to see its curve, normal crossing form, and RLCT.
Singular Learning Theory resolves one of the deepest puzzles in modern machine learning. Classical statistical learning theory—VC dimension, Rademacher complexity, PAC-Bayes bounds—consistently predicts that overparameterized models should overfit catastrophically. They don’t. SLT explains why: the effective complexity of a neural network is governed not by its parameter count but by the geometry of its loss landscape.
Phase transitions during training are not bugs or instabilities. They are the mechanism of learning. When a model discovers that certain features can be compressed—that multiple neurons can be collapsed without changing predictions—it transitions to a more degenerate singularity with lower λ. The model becomes simpler in the ways that matter while retaining the capacity to express complex functions.
The learning coefficient λ is a better complexity measure than parameter count, AIC, or BIC. It is intrinsic to the model-data pair, not just the model architecture. Two networks with identical architectures but trained on different data will have different local learning coefficients, reflecting the different structures they have discovered.
This connects to developmental interpretability: understanding how structure emerges through training, not just what structure exists in a trained model. The sequence of phase transitions—the developmental trajectory—tells us something about both the data and the inductive biases of the architecture.
The same mathematical structure appears in trust networks. When agents have redundant attestation pathways—many-to-one maps from behavior to reputation—the effective dimensionality of the trust signal is lower than the raw attestation count. Singularities in the reputation landscape—where multiple distinct behaviors produce identical trust scores—are precisely the loci where Sybil attacks exploit the system. Understanding the RLCT of a reputation function tells you how much redundancy an attacker can hide behind, and resolution of those singularities reveals the true degrees of freedom in the trust network.