1;95;0c> Grokking & Generalization Collapse — HTSR Theory

Grokking & Generalization Collapse: Insights from HTSR Theory

Authors: Hari K. Prakash · Charles H. Martin  |  Date: June 2025  |  arXiv 2506.04434

Abstract

A depth-3 MLP trained on a 1 k-sample MNIST subset shows the classic grokking signature—late surging test accuracy—followed by a newly identified anti-grokking phase where generalization collapses even as training accuracy remains perfect. Using Heavy-Tailed Self-Regularization (HTSR) diagnostics—chiefly the layer-quality exponent α from WeightWatcher—the study isolates three distinct regimes and introduces Correlation Traps (spectral outliers in shuffled weights) as a data-free early-warning signal of impending collapse.


4 · Results & Analysis (with Integrated Figures)

4.1 Three Training Phases

We first replicate the grokking curves (training vs. test accuracy) and then extend the training budget.
Observation → after the well-known grokking jump, test accuracy crashes into a plateau near chance. This is the newly characterized anti-grokking phase.

Figure 1 – Training and test accuracy across three phases
Figure 1. Accuracy trajectories reveal
• Pre-grokking (grey) • Grokking (yellow) • Anti-grokking (green).
Training accuracy (red) saturates quickly; test accuracy (purple) first lags, then peaks, then collapses.

4.2 Heavy-Tailed Spectra vs. Random Baseline

HTSR posits that well-trained layers have heavy-tailed eigenvalue spectra.
● Left panel below: raw weight ESD fits a power-law tail with exponent α.
● Right panel: the same layer after random element-wise shuffling follows a Marchenko–Pastur (MP) bulk—confirming correlations were destroyed.

Figure 2 – ESD with power-law and MP fits
Figure 2. Power-law vs. MP fits.
The value of α recovers the amount of structure in the layer while the MP baseline acts as a null model of pure noise.

4.3 Correlation Traps Signal Over-Correlation

Shuffling should demolish correlations, yet near anti-grokking we observe one or more huge eigenvalue spikes—“Correlation Traps”.
Their sudden emergence offers a data-free alert that the model is sliding into memorization.

Figure 3 – Spectral spikes showing correlation traps
Figure 3. Outlier eigenvalues (red spikes) in the shuffled weight spectra just before—and even more so after—generalization collapse.

4.4 Layer-wise α Trajectories

Tracking α per layer shows a consistent universal threshold: once any layer slips below α ≈ 2, anti-grokking follows within a few hundred steps. Earlier work hinted α≈2 is optimal; anti-grokking confirms it marks the boundary between productive representation building and over-fitting.

Figure 4 – α vs. steps for layers
Figure 4. Average α (top) and per-layer α (FC1, FC2). The dashed line at α=2 predicts the imminent test-accuracy crash.

4.5 Why Competing Metrics Miss Anti-Grokking

Common progress signals—activation sparsity, weight-norm growth, circuit complexity—track grokking but stay flat during collapse. HTSR’s α and correlation traps uniquely warn of the coming failure.

Figure 5 – Alternative metrics across training
Figure 5. Competing metrics plateau once grokking peaks, giving no hint of the catastrophic drop that α and traps forecast.

Take-away. Monitoring HTSR α plus correlation-trap spikes supplies a practical, dataset-free early-stopping criterion that can save GPU hours and avoid silent over-fitting in larger models.


Read the Full Paper