This page summarizes a small experiment originally posted on Reddit: “Observed a sharp epoch-wise double descent in a small MNIST MLP, associated with overfitting the augmented training data” .
We train a simple 3-layer MLP on MNIST using standard “good practice” tricks: light affine augmentation, label smoothing, learning-rate warmup, etc. The model is trained for 100 epochs, with augmentation applied on each batch (small random rotations, translations, and rescalings).
The interesting behavior: the model reaches its best test accuracy fairly early, then the test accuracy declines for a while, even though training accuracy continues to improve. This is a clear example of epoch-wise Double Descent.
The left panel shows the training loss; the right panel shows training vs test accuracy. Test accuracy (orange) rises to a local maximum, dips, then rises again. The augmented training accuracy (blue) climbs, then collapses, even though the model never reaches 100% accuracy on the augmented data (as expected).
To understand what was happening, we used WeightWatcher to analyze the layer weight matrices over the course of training. For each epoch, we computed the HTSR / WeightWatcher power-law layer quality metric α for each layer.
At the point of peak test accuracy, all layer α’s are close to their theoretically predicted ideal value α ≈ 2. The model is essentially optimal here. As training continues, the α values drop significantly below 2—right when test accuracy starts declining.
The figure shows test accuracy vs α for three layers at different epochs. When α is near 2, test accuracy is high. As α drops into the Very Heavy-Tailed (VHT) regime (α < 2), test accuracy declines and stays lower.
The key observation is that the drop in α lines up almost perfectly with overfitting to the augmented training distribution:
Importantly, it’s not the “standard” test accuracy where the model fails—it still does well there. The real out-of-distribution data in this setup is actually the augmented training data. When α < 2, the model can no longer describe this on-the-fly augmented distribution effectively.
This makes the example a nice demonstration of how WeightWatcher and the HTSR/SETOL metrics can detect overfitting to an augmented training distribution, not just to the usual held-out test set.
The full experiment is available in the notebook Epoch-Wise-DoubleDescent.ipynb (from the WeightWatcher examples repo), which:
Another example of WeightWatcher in action—linking spectral diagnostics (α, heavy tails) directly to training dynamics and Double Descent behavior in a real model.