Grokking & Generalization Collapse

Grokking & Generalization Collapse: Insights from HTSR Theory

Authors: Hari K. Prakash · Charles H. Martin | Date: June 2025 | arXiv 2506.04434

Abstract

A depth-3 MLP trained on a 1k-sample MNIST subset shows the classic grokking signature—late surging test accuracy—followed by a newly identified anti-grokking phase where generalization collapses even as training accuracy remains perfect. Using Heavy-Tailed Self-Regularization (HTSR) diagnostics—chiefly the layer-quality exponent α from WeightWatcher—the study isolates three distinct regimes and introduces Correlation Traps (spectral outliers in shuffled weights) as a data-free early-warning signal of impending collapse.

4 · Results & Analysis (with Integrated Figures)

4.1 Three Training Phases

We first replicate the grokking curves (training vs. test accuracy) and then extend the training budget.
Observation → after the well-known grokking jump, test accuracy crashes into a plateau near chance. This is the newly characterized anti-grokking phase.

Figure 1 – Training and test accuracy across three phases — **Figure 1.** Accuracy trajectories reveal
• Pre-grokking (grey) • Grokking (yellow) • Anti-grokking (green).
Training accuracy (red) saturates quickly; test accuracy (purple) first lags, then peaks, then *collapses*.

4.2 Heavy-Tailed Spectra vs. Random Baseline

HTSR posits that well-trained layers have heavy-tailed eigenvalue spectra.
● Left panel below: raw weight ESD fits a power-law tail with exponent α.
● Right panel: the same layer after random element-wise shuffling follows a Marchenko–Pastur (MP) bulk—confirming correlations were destroyed.

Figure 2 – ESD with power-law and MP fits — **Figure 2.** Power-law vs. MP fits.
The value of α recovers the amount of structure in the layer while the MP baseline acts as a null model of pure noise.

4.3 Correlation Traps Signal Over-Correlation

Shuffling should demolish correlations, yet near anti-grokking we observe one or more huge eigenvalue spikes—“Correlation Traps”.
Their sudden emergence offers a data-free alert that the model is sliding into memorization.

Figure 3 – Spectral spikes showing correlation traps — **Figure 3.** Outlier eigenvalues (red spikes) in the shuffled weight spectra just before—and even more so after—generalization collapse.

4.4 Layer-wise α Trajectories

Tracking α per layer shows a consistent universal threshold: once any layer slips below α ≈ 2, anti-grokking follows within a few hundred steps. Earlier work hinted α≈2 is optimal; anti-grokking confirms it marks the boundary between productive representation building and over-fitting.

Figure 4 – α vs. steps for layers — **Figure 4.** Average α (top) and per-layer α (FC1, FC2). The dashed line at α=2 predicts the imminent test-accuracy crash.

4.5 Why Competing Metrics Miss Anti-Grokking

Common progress signals—activation sparsity, weight-norm growth, circuit complexity—track grokking but stay flat during collapse. HTSR’s α and correlation traps uniquely warn of the coming failure.

Figure 5 – Alternative metrics across training — **Figure 5.** Competing metrics plateau once grokking peaks, giving no hint of the catastrophic drop that α and traps forecast.

Take-away. Monitoring HTSR α plus correlation-trap spikes supplies a practical, dataset-free early-stopping criterion that can save GPU hours and avoid silent over-fitting in larger models.

Read the Full Paper & Reproduce the Experiments

Full PDF on arXiv:
Grokking & Generalization Collapse: Insights from HTSR Theory

Reproduce the Results Yourself

All experiments in the paper — including grokking, anti-grokking, layer-wise α tracking, correlation traps, and α-vs-α dynamics — can be replicated using the publicly available WeightWatcher example notebooks.

The notebook used for the grokking experiments is:

Grokking-MNIST.ipynb — Full Reproducible Grokking Experiment

This notebook trains the small MLP, applies data augmentation, and logs per-epoch HTSR α-metrics — exactly matching the figures in the paper (epoch-wise double descent, α drift, VHT overfitting regime, correlation traps, etc.).