1;95;0c>
Authors: Hari K. Prakash · Charles H. Martin | Date: June 2025 | arXiv 2506.04434
A depth-3 MLP trained on a 1 k-sample MNIST subset shows the classic
grokking signature—late surging test accuracy—followed by a newly
identified anti-grokking phase where generalization collapses even as
training accuracy remains perfect. Using Heavy-Tailed Self-Regularization
(HTSR) diagnostics—chiefly the layer-quality exponent α from
WeightWatcher
—the study isolates three distinct regimes and
introduces Correlation Traps (spectral outliers in shuffled weights) as a
data-free early-warning signal of impending collapse.
We first replicate the grokking curves (training vs. test accuracy) and
then extend the training budget.
Observation → after the well-known grokking jump, test
accuracy crashes into a plateau near chance. This is the newly
characterized anti-grokking phase.
HTSR posits that well-trained layers have heavy-tailed eigenvalue
spectra.
● Left panel below: raw weight ESD fits a power-law tail with exponent
α.
● Right panel: the same layer after random element-wise shuffling follows
a Marchenko–Pastur (MP) bulk—confirming correlations were destroyed.
Shuffling should demolish correlations, yet near anti-grokking we observe
one or more huge eigenvalue spikes—“Correlation Traps”.
Their sudden emergence offers a data-free alert that the model is sliding
into memorization.
Tracking α per layer shows a consistent universal threshold: once any layer slips below α ≈ 2, anti-grokking follows within a few hundred steps. Earlier work hinted α≈2 is optimal; anti-grokking confirms it marks the boundary between productive representation building and over-fitting.
Common progress signals—activation sparsity, weight-norm growth, circuit complexity—track grokking but stay flat during collapse. HTSR’s α and correlation traps uniquely warn of the coming failure.
Take-away. Monitoring HTSR α plus correlation-trap spikes supplies a practical, dataset-free early-stopping criterion that can save GPU hours and avoid silent over-fitting in larger models.