Authors: Hari K. Prakash · Charles H. Martin | Date: June 2025 | arXiv 2506.04434
A depth-3 MLP trained on a 1k-sample MNIST subset shows the classic
grokking signature—late surging test accuracy—followed by a newly
identified anti-grokking phase where generalization collapses even as
training accuracy remains perfect. Using Heavy-Tailed Self-Regularization
(HTSR) diagnostics—chiefly the layer-quality exponent α from
WeightWatcher—the study isolates three distinct regimes and
introduces Correlation Traps (spectral outliers in shuffled weights) as a
data-free early-warning signal of impending collapse.
We first replicate the grokking curves (training vs. test accuracy) and
then extend the training budget.
Observation → after the well-known grokking jump, test
accuracy crashes into a plateau near chance. This is the newly
characterized anti-grokking phase.
HTSR posits that well-trained layers have heavy-tailed eigenvalue
spectra.
● Left panel below: raw weight ESD fits a power-law tail with exponent
α.
● Right panel: the same layer after random element-wise shuffling follows
a Marchenko–Pastur (MP) bulk—confirming correlations were destroyed.
Shuffling should demolish correlations, yet near anti-grokking we observe
one or more huge eigenvalue spikes—“Correlation Traps”.
Their sudden emergence offers a data-free alert that the model is sliding
into memorization.
Tracking α per layer shows a consistent universal threshold: once any layer slips below α ≈ 2, anti-grokking follows within a few hundred steps. Earlier work hinted α≈2 is optimal; anti-grokking confirms it marks the boundary between productive representation building and over-fitting.
Common progress signals—activation sparsity, weight-norm growth, circuit complexity—track grokking but stay flat during collapse. HTSR’s α and correlation traps uniquely warn of the coming failure.
Take-away. Monitoring HTSR α plus correlation-trap spikes supplies a practical, dataset-free early-stopping criterion that can save GPU hours and avoid silent over-fitting in larger models.
Full PDF on arXiv:
Grokking & Generalization Collapse: Insights from HTSR Theory
All experiments in the paper — including grokking, anti-grokking, layer-wise α tracking, correlation traps, and α-vs-α dynamics — can be replicated using the publicly available WeightWatcher example notebooks.
The notebook used for the grokking experiments is:
Grokking-MNIST.ipynb — Full Reproducible Grokking Experiment
This notebook trains the small MLP, applies data augmentation, and logs per-epoch HTSR α-metrics — exactly matching the figures in the paper (epoch-wise double descent, α drift, VHT overfitting regime, correlation traps, etc.).