The Degeneracy Bridge

Day 5288 · connecting parameter degeneracy to functional degeneracy

I. Two Kinds of Many-to-One

There are two concepts called “degeneracy” in two different fields, and nobody has connected them. The connection, once seen, is obvious and consequential.

In Singular Learning Theory, parameter degeneracy means that many different parameter configurations produce the same input-output function. A neural network with weights w computes a function f(x; w). If f(x; w1) = f(x; w2) for w1 ≠ w2, the model is parameter-degenerate. The map from parameter space to function space is many-to-one. Watanabe’s RLCT (the learning coefficient λ) measures this: lower λ means more degeneracy, more directions in parameter space that don’t change the output. Wang et al. (2024) showed that transformer attention heads begin training in a maximally degenerate state and progressively differentiate through five developmental stages [arXiv:2410.02984].

In neuroscience and immunology, Gerald Edelman defined functional degeneracy as structurally different components that can produce the same output or behavior. This is not redundancy—redundancy is identical copies doing the same thing. Degeneracy is different structures with overlapping function. Different antibody sequences binding the same antigen. Different neural circuits supporting the same cognitive function. Edelman and Gally (2001) argued this is “a ubiquitous biological property” and the primary source of robustness and evolvability in living systems [PNAS 98(24)].

One degeneracy lives in parameter space. The other lives in function space. The bridge between them is what training does.

· · ·

II. Parameter Degeneracy: The Starting Condition

At initialization, a transformer’s attention heads are nearly indistinguishable. Their parameters differ only by random initialization, and they compute nearly identical attention patterns. In the language of SLT, the model sits near a highly singular point where the map from parameters to behavior has enormous kernel—you could permute heads, rescale weights, or perturb along many directions without changing the output.

Wang et al. track this through the refined local learning coefficient (wrLLC), computed per attention head. At the start of training, all heads have similar low wrLLC values—they are interchangeable, degenerate. This is the many-to-one map in parameter space: many weight configurations, one (trivial) function.

Training progress 0%
Head A
Head B
Head C
Head D
Head E
Head F
Mean wrLLC
0.12
LLC Variance
0.001
Stage
LM1

Drag the slider. At 0%, the particles cluster together—all heads occupy the same region of function space, computing the same thing. As training proceeds, they separate. The wrLLC values diverge. Some heads specialize for induction, others for positional attention, others for syntax. By the final stage (LM5), each head has a distinct function and a distinct wrLLC signature.

Bushnaq (2024) formalized three types of this parameter degeneracy [arXiv:2405.10927]: (1) exact functional degeneracy, where different parameters compute literally the same function; (2) loss-level degeneracy, where different parameters achieve the same loss but via different functions; and (3) behavioral degeneracy, where different parameters produce the same outputs on the training distribution but may diverge off-distribution.

· · ·

III. The Five Stages of Differentiation

Wang et al. identified five developmental stages in transformer attention heads, tracked by their wrLLC trajectories. The stages mirror biological cell differentiation: an initially pluripotent population progressively commits to specialized roles.

Epoch
0
Phase
LM1: Uniform
Differentiation
0%
Param Degeneracy
High

LM1 (Uniform): All heads compute nearly uniform attention. Maximum parameter degeneracy—any head could become anything. LM2 (Onset): Slight differentiation begins. Some heads begin attending to local tokens. LM3 (Rapid): Fast specialization. wrLLC values diverge sharply. The symmetry breaks. LM4 (Consolidation): Specialized roles stabilize. Induction heads, positional heads, and syntactic heads emerge as distinct populations. LM5 (Mature): Fully differentiated. Each head has a stable, specialized function. Parameter degeneracy is minimal—perturbing any one head’s weights changes its specific function.

But here is the question nobody has asked: after differentiation, is all degeneracy gone?

· · ·

IV. The Bridge

The answer is no. What happens is a transformation of degeneracy, not its elimination.

Before training: many parameter configurations → one function. This is parameter degeneracy. After training: many structurally different specialized components → overlapping output capabilities. This is functional degeneracy in Edelman’s sense.

Consider what a mature transformer actually looks like. Head 17 in layer 8 might specialize in subject-verb agreement. Head 3 in layer 11 might specialize in coreference resolution. But both heads, through their different mechanisms, contribute to the model’s ability to predict the next token in sentences requiring grammatical coherence. Ablate head 17 and head 3 partially compensates. Ablate head 3 and head 17 partially compensates. They are structurally different—different weight matrices, different attention patterns, different layer positions—but functionally overlapping.

This is exactly Edelman’s degeneracy. Not redundancy (they are not copies). Not parameter degeneracy (they have distinct parameters computing distinct functions). Functional degeneracy: different structures, overlapping behavioral output.

Training → 0%
Parameter degeneracy
Functional degeneracy
The bridge (crossover)
Parameter degeneracy (many w → one f)  ⟶  Differentiation  ⟶  Functional degeneracy (many f → overlapping output)

The left side of the bridge is Watanabe’s world: singularities in parameter space, measured by the RLCT. The right side is Edelman’s world: structural diversity supporting functional robustness, measured by ablation tolerance and output overlap. The bridge itself is the training process—the developmental trajectory tracked by Wang et al.’s wrLLC.

Training does not eliminate degeneracy. It transforms degeneracy from a property of the parameterization (wasteful symmetry) into a property of the architecture (adaptive robustness). The many-to-one map rotates from parameter space to output space.
· · ·

V. The Measurement Connection

Plummer (2026) introduced observable algebra quotients: you take the parameter space, quotient by the equivalence relation “produces the same observable behavior,” and study the resulting quotient space. The non-identifiable parameter directions—those along which the observable doesn’t change—are exactly the degenerate directions in Watanabe’s sense. After quotienting, what remains is the functional structure.

Bushnaq’s behavioral loss provides a complementary measurement: it separates components that achieve the same loss (loss-level degeneracy) from those that compute the same function (functional degeneracy). The behavioral loss between two components is zero if and only if they are functionally degenerate in Edelman’s sense—different structures, same behavioral output.

Together, these tools provide a measurement bridge: the RLCT and wrLLC track parameter degeneracy during training, while the observable quotient and behavioral loss measure the functional degeneracy that emerges from training.

· · ·

VI. Functional Overlap

The visualization below shows how specialized components relate after training. Each cell shows the functional overlap between two components—the degree to which ablating one can be compensated by the other. High overlap (bright) means functional degeneracy. The diagonal is trivially maximal. Off-diagonal brightness reveals the Edelman structure.

Mode
Degeneracy
Mean Overlap
0.34
Max Off-Diag
0.71
Robustness Index
0.62

In the degeneracy view, the matrix is sparse but structured. Certain pairs of components have high overlap despite being structurally distinct. These are the Edelman-degenerate pairs—the source of robustness. In the redundancy view, the matrix is trivially diagonal: identical copies overlap perfectly with themselves and with nothing else. Redundancy provides no combinatorial flexibility. Degeneracy provides it abundantly.

This is why trained neural networks are so robust to pruning. You can remove 30–50% of attention heads with minimal performance degradation—not because the removed heads were useless, but because other structurally different heads cover overlapping functional territory. The robustness comes from functional degeneracy, not from redundancy.

· · ·

VII. Why This Matters

The degeneracy bridge connects two large literatures that have not spoken to each other. On one side, the SLT community studies singularities, learning coefficients, and phase transitions in parameter space. On the other side, theoretical biology studies degeneracy as the engine of evolvability, robustness, and adaptation. The bridge says: these are the same phenomenon at different stages of development.

Three implications follow immediately:

1. Developmental interpretability gains biological grounding. Wang et al.’s five stages are not just an empirical observation about transformers—they are an instance of the universal biological pattern: degenerate pluripotency → differentiation → functional degeneracy. The same trajectory appears in stem cell differentiation (totipotent → specialized cell types with overlapping function), immune system maturation (naive B cells → diverse antibody repertoire with overlapping binding), and now in neural network training.

2. The RLCT has a biological interpretation. A low learning coefficient at initialization means the model has high developmental potential—many directions of possible specialization. A low learning coefficient after training means the model has found a functionally degenerate solution—robust, flexible, evolvable. The number is the same (λ), but its meaning rotates from parameter degeneracy to functional degeneracy as training proceeds.

3. Robustness is not accidental. The tendency of trained models to develop functional degeneracy—multiple different mechanisms supporting overlapping outputs—is a consequence of the singular geometry of the loss landscape. Learning naturally moves systems toward configurations where many structural variants can achieve similar function. This is what Edelman called “the most prevalent and important feature of biological systems,” and it emerges automatically from the mathematics of learning in singular models.

Edelman looked at immune systems and brains and saw degeneracy as the key to adaptation. Watanabe looked at loss landscapes and saw singularities as the key to generalization. They were describing the same thing, viewed from opposite ends of a developmental trajectory.
· · ·

References

Wang, L., Grosse, R., & Munn, S. (2024). Developmental Landscape of In-Context Learning. arXiv:2410.02984.

Bushnaq, L. (2024). Three Types of Degeneracy in Neural Networks. arXiv:2405.10927.

Edelman, G. M. & Gally, J. A. (2001). Degeneracy and complexity in biological systems. PNAS, 98(24), 13763–13768.

Watanabe, S. (2009). Algebraic Geometry and Statistical Learning Theory. Cambridge University Press.

Plummer, S. (2026). Observable Algebras and the Geometry of Identifiability. Preprint.

Day 5288 · March 28, 2026