HTSR and the 5+1 Phases of Training

Heavy-Tailed Self-Regularization (HTSR) explains how a layer evolves during training. Each layer passes through 5+1 distinct phases that correspond to different states of learning — and each phase matches a different universality class from Heavy-Tailed Random Matrix Theory (RMT).

For full details of HTSR and these phases, see: Martin & Mahoney, JMLR 2021

WeightWatcher estimates the HTSR phase using the α (alpha) metric, which is extracted from the eigenvalue spectrum of the layer weight matrix W. You don’t need training or test data — α and the ESD tell you which phase the layer is in.

The 5+1 Phases (Layer-by-Layer)

The table below is the WeightWatcher cheat sheet for interpreting α values and spectra.

Phase	Name	Spectral Shape	Learning State	Typical α
1	Random-Like	Pure MP bulk (Gaussian-like).	No learning yet; random initialization.	α not meaningful
2	Bleeding-Out	Bulk edge “bleeds” to the right.	Very early training; tiny correlations.	α unstable
3	Bulk + Spikes	MP bulk + a few spikes above the edge.	Weak signal learned; most remains random.	α unstable
4	Bulk Decay / Collapse Weakly Heavy-Tailed	Bulk shrinks; early heavy tail forms.	Some learning; not well-trained yet.	4 ≤ α ≤ 6
5	Heavy-Tailed Fat-Tailed	Clear heavy tail; log-log linear region.	Well-trained; strong stable correlations.	2 ≤ α ≤ 4 (Ideal ≈ 2)
*	Ideal Heavy-Tailed (α ≈ 2, ERG)	Very clean fat tail; stable correlations.	A unique heavy-tailed universality class at α ≈ 2. (not shown above; see SETOL)	α ≈ 2, ERG satisfied
+1	Rank-Collapse	Spectrum becomes singular; mass at 0.	Degenerate / pathological state.	α undefined / divergent

How to Use This in WeightWatcher


pip install weightwatcher
import weightwatcher as ww
watcher = ww.WeightWatcher(model=your_model)

Run watcher.analyze() to compute α and other metrics.
Run watcher.analyze(plot=True) to add ESD plots.
Assign each layer to a phase using α + the ESD shape.
Flag layers stuck in early phases (1–3) or in Phase 6 (rank collapse).
Aim for most important layers to be in Phase 5 with α ≈ 2–3.
To detect α≈2 layers, run watcher.analyze(plot=True, detX=True) (see SETOL).
To analyze fine-tuned models, run watcher.analyze(model=model, base_model=base_model).
Or apply WeightWatcher to an adapter model: watcher.analyze(model=adapter_model).
To detect Correlation Traps, run watcher.analyze(plot=True, randomize=True).

The 5+1 phases make WeightWatcher’s plots interpretable at a glance.

Alpha Histogram

A basic histogran shows the α distribution across layers. Well-trained models cluster with α in [2–4] (d-e). Layers with α>6 look almost Random-Like and potentially underfit and/or too large. Layers with α>2 lie in the Very-Heavy-Tailed (VHT) Universality Class (e), and appear to be overfit (as noted by SETOL). In theory, generally speaking, Layers with α closer to 2 are better trained.


details_df.alpha.plot.hist(bins=100)
ax.axvline(2, color='red', linestyle='--', linewidth=2)
ax.axvline(6, color='orange', linestyle='--', linewidth=2)
ax.set_xlabel("alpha (α)")
ax.set_ylabel("count")

Correlation Flow

This a plot of layer_id vs the HTSR α. table flows indicate healthy training. Different architectures have different Correlation Flow patterns, as shown in the Nature C. paper. For LLMs (Llama in particular), we typically find that the K and Q matrices have α≈2.


ax = details_df.plot(x='layer_id',y='alpha')
ax.axhline(2, color='red', linestyle='--')
ax.axhline(6, color='orange', linestyle='--')
ax.set_ylabel("alpha (α)")
ax.set_xlabel("layer index")

Ideal α ≈ 2

One can detect optimal layers by looking for α≈2, and inspecting the log-log plot to see the fit is good. To avoid false positives, one should also check that the ERG (detX) condition is satisified (see SETOL). You also want to check that there are no Correlation Traps (see below).


details_df = watcher.analyze(plot=True, detX=True)

Overfitting (Very Heavy-Tailed)

Layers with α<2 often memorize patterns or become unstable. Sometimes the first few layers of a model may show this, and we think they may be memorizing abstract patterns necessary for generalization. When later layers show this, we think this is evidence of memorization and/or confusion, making predictions less effective on out-of-distribution examples.


details_df = watcher.analyze(plot=True)

Correlation Traps

Randomized vs trained ESD reveals distorted spectra and traps. Correlation Traps appear as orange vertical lines, far to the right of the red bulk region. They are evidence of undeseriable overfitting. In the details dataframe, they are also called rand_num_spikes. They frequently appear when the layer learning rate is too large, but can appear for other reasons as well.


details_df = watcher.analyze(plot=True, randomize=True)

Comparing Fine-Tuned to Base

Compare α layer-by-layer between a base model and a fine-tuned model. This shows where fine-tuning improved structure (α moves toward 2) or made it worse (α drifts <2 or >6).


details_df = watcher.analyze(model=fine_tuned_model, base_model=base_model)
details_df[['layer_id','alpha','base_alpha']].plot(x='layer_id')
plt.axhline(2, color='red')

Example alpha comparison: base vs fine-tuned

Reproducing These Plots Yourself

All of the plot types shown above (α histograms, Correlation Flow, Ideal α≈2 layers, Very Heavy-Tailed overfitting, Correlation Traps, etc.) can be reproduced by training a simple 3-layer MLP under different learning rates and batch sizes. These experiments naturally generate all the HTSR phase behaviors.

See the example notebooks in the WeightWatcher-Examples GitHub Repository .

These notebooks demonstrate how different training hyperparameters drive the layer spectra into the various HTSR phases.