Heavy-Tailed Self-Regularization (HTSR) explains how a layer evolves during training. Each layer passes through 5+1 distinct phases that correspond to different states of learning — and each phase matches a different universality class from Heavy-Tailed Random Matrix Theory (RMT).
For full details of HTSR and these phases, see: Martin & Mahoney, JMLR 2021
WeightWatcher estimates the HTSR phase using the α (alpha) metric, which is extracted from the eigenvalue spectrum of the layer weight matrix W. You don’t need training or test data — α and the ESD tell you which phase the layer is in.
The table below is the WeightWatcher cheat sheet for interpreting α values and spectra.
| Phase | Name | Spectral Shape | Learning State | Typical α |
|---|---|---|---|---|
| 1 | Random-Like | Pure MP bulk (Gaussian-like). | No learning yet; random initialization. | α not meaningful |
| 2 | Bleeding-Out | Bulk edge “bleeds” to the right. | Very early training; tiny correlations. | α unstable |
| 3 | Bulk + Spikes | MP bulk + a few spikes above the edge. | Weak signal learned; most remains random. | α unstable |
| 4 |
Bulk Decay / Collapse Weakly Heavy-Tailed |
Bulk shrinks; early heavy tail forms. | Some learning; not well-trained yet. | 4 ≤ α ≤ 6 |
| 5 |
Heavy-Tailed Fat-Tailed |
Clear heavy tail; log-log linear region. | Well-trained; strong stable correlations. | 2 ≤ α ≤ 4 (Ideal ≈ 2) |
| * | Ideal Heavy-Tailed (α ≈ 2, ERG) |
Very clean fat tail; stable correlations. |
A unique heavy-tailed universality class at α ≈ 2. (not shown above; see SETOL) |
α ≈ 2, ERG satisfied |
| +1 | Rank-Collapse | Spectrum becomes singular; mass at 0. | Degenerate / pathological state. | α undefined / divergent |
pip install weightwatcher
import weightwatcher as ww
watcher = ww.WeightWatcher(model=your_model)
watcher.analyze() to compute α and other metrics.watcher.analyze(plot=True) to add ESD plots.watcher.analyze(plot=True, detX=True) (see SETOL).watcher.analyze(model=model, base_model=base_model).watcher.analyze(model=adapter_model).watcher.analyze(plot=True, randomize=True).The 5+1 phases make WeightWatcher’s plots interpretable at a glance.
A basic histogran shows the α distribution across layers. Well-trained models cluster with α in [2–4] (d-e). Layers with α>6 look almost Random-Like and potentially underfit and/or too large. Layers with α>2 lie in the Very-Heavy-Tailed (VHT) Universality Class (e), and appear to be overfit (as noted by SETOL). In theory, generally speaking, Layers with α closer to 2 are better trained.
details_df.alpha.plot.hist(bins=100)
ax.axvline(2, color='red', linestyle='--', linewidth=2)
ax.axvline(6, color='orange', linestyle='--', linewidth=2)
ax.set_xlabel("alpha (α)")
ax.set_ylabel("count")
This a plot of layer_id vs the HTSR α. table flows indicate healthy training. Different architectures have different Correlation Flow patterns, as shown in the Nature C. paper. For LLMs (Llama in particular), we typically find that the K and Q matrices have α≈2.
ax = details_df.plot(x='layer_id',y='alpha')
ax.axhline(2, color='red', linestyle='--')
ax.axhline(6, color='orange', linestyle='--')
ax.set_ylabel("alpha (α)")
ax.set_xlabel("layer index")
One can detect optimal layers by looking for α≈2, and inspecting the log-log plot to see the fit is good. To avoid false positives, one should also check that the ERG (detX) condition is satisified (see SETOL). You also want to check that there are no Correlation Traps (see below).
details_df = watcher.analyze(plot=True, detX=True)
Layers with α<2 often memorize patterns or become unstable. Sometimes the first few layers of a model may show this, and we think they may be memorizing abstract patterns necessary for generalization. When later layers show this, we think this is evidence of memorization and/or confusion, making predictions less effective on out-of-distribution examples.
details_df = watcher.analyze(plot=True)
Randomized vs trained ESD reveals distorted spectra and traps. Correlation Traps appear as orange vertical lines, far to the right of the red bulk region. They are evidence of undeseriable overfitting. In the details dataframe, they are also called rand_num_spikes. They frequently appear when the layer learning rate is too large, but can appear for other reasons as well.
details_df = watcher.analyze(plot=True, randomize=True)
Compare α layer-by-layer between a base model and a fine-tuned model. This shows where fine-tuning improved structure (α moves toward 2) or made it worse (α drifts <2 or >6).
details_df = watcher.analyze(model=fine_tuned_model, base_model=base_model)
details_df[['layer_id','alpha','base_alpha']].plot(x='layer_id')
plt.axhline(2, color='red')
All of the plot types shown above (α histograms, Correlation Flow, Ideal α≈2 layers, Very Heavy-Tailed overfitting, Correlation Traps, etc.) can be reproduced by training a simple 3-layer MLP under different learning rates and batch sizes. These experiments naturally generate all the HTSR phase behaviors.
See the example notebooks in the WeightWatcher-Examples GitHub Repository .
These notebooks demonstrate how different training hyperparameters drive the layer spectra into the various HTSR phases.