Authors: Hari K. Prakash · Charles H. Martin | Date: Feb 2026 | arXiv:2602.02859
Memorization in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: anti-grokking, a late-stage collapse of generalization.
We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extend training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode.
To diagnose anti-grokking, we use the open-source WeightWatcher tool based on HTSR/SETOL theory. The primary signal is the emergence of Correlation Traps: anomalously large eigenvalues beyond the Marchenko–Pastur bulk in the empirical spectral density of shuffled weight matrices, which are predicted to impair generalization. As a secondary signal, anti-grokking corresponds to the average HTSR layer quality metric α deviating from 2.0. Neither metric requires access to the test or training data.
We compare these signals to alternative grokking diagnostics, including ℓ2 norms, Activation Sparsity, Absolute Weight Entropy, and Local Circuit Complexity. These track pre-grokking and grokking but fail to identify anti-grokking. Finally, we show that Correlation Traps can induce catastrophic forgetting and/or prototype memorization, and observe similar pathologies in large-scale LLMs, such as OSS GPT 20/120B.
We first replicate the grokking curves (training vs. test accuracy) and
then extend the training budget.
Observation → after the well-known grokking jump, test
accuracy crashes into a plateau near chance. This is the newly
characterized anti-grokking phase.
HTSR predicts that well-trained layers exhibit heavy-tailed spectra, while randomized weights follow a clean Marchenko–Pastur (MP) bulk. Below we show the comparison:
(a) Trained layer: heavy-tailed ESD with power-law exponent α.
(b) Randomized layer: MP bulk distribution, used as a noise baseline.
Shuffling should demolish correlations, yet near anti-grokking we observe
one or more huge eigenvalue spikes—“Correlation Traps”.
Their sudden emergence offers a data-free alert that the model is sliding
into over-fitting, even when training accuracy remains perfect.
The key empirical finding is that the onset of correlation traps is tightly aligned with the anti-grokking phase in both canonical settings: (i) the MLP3 MNIST setup and (ii) a small transformer trained on Modular Addition. In other words, over-fitting leaves clear signatures directly in the layer weight matrices—visible via shuffled-spectrum spikes—without needing access to data, labels, or accuracy curves.
Figure 4 (MLP3 MNIST). Average number Correlation Traps
Figure 8 (Modular Addition). Correlation Traps for each layer
We next track α per layer for the MLP3 MNIST experiment. HTSR/SETOL predicts that well-trained layers tend to organize near α ≈ 2, and that departures from this regime reflect degraded spectral quality. Empirically, anti-grokking coincides with α drifting away from the optimal band.
In the MLP3 MNIST setting, correlation traps are not merely “large eigenvalues”—they can correspond to interpretable, localized structure in the dominant singular vectors. This supports a concrete mechanism we call Prototype Overfitting: the model collapses from a smooth, global template to a small number of digit-like prototypes, consistent with the observed late-stage generalization collapse.
Common progress signals—activation sparsity, weight-norm growth, circuit complexity—track grokking but stay flat during collapse. HTSR’s α and correlation traps uniquely warn of the coming failure.
Take-away. Monitoring HTSR α plus correlation-trap spikes supplies a practical, dataset-free early-stopping criterion that can save GPU hours and avoid silent over-fitting in larger models.
All experiments in the paper — including grokking, anti-grokking, layer-wise α tracking, correlation traps, and α-vs-α dynamics — can be replicated using the publicly available WeightWatcher example notebooks.
The notebook used for the grokking experiments is:
Grokking-MNIST.ipynb — Full Reproducible Grokking Experiment
This notebook trains the small MLP, applies data augmentation, and logs per-epoch HTSR α-metrics — exactly matching the figures in the paper (epoch-wise double descent, α drift, VHT overfitting regime, correlation traps, etc.).