Epoch-Wise Double Descent in a Small MNIST MLP

This page summarizes a small experiment originally posted on Reddit: “Observed a sharp epoch-wise double descent in a small MNIST MLP, associated with overfitting the augmented training data” .

We train a simple 3-layer MLP on MNIST using standard “good practice” tricks: light affine augmentation, label smoothing, learning-rate warmup, etc. The model is trained for 100 epochs, with augmentation applied on each batch (small random rotations, translations, and rescalings).

The interesting behavior: the model reaches its best test accuracy fairly early, then the test accuracy declines for a while, even though training accuracy continues to improve. This is a clear example of epoch-wise Double Descent.

Train loss and accuracy vs epoch for MNIST MLP (epoch-wise double descent)

The left panel shows the training loss; the right panel shows training vs test accuracy. Test accuracy (orange) rises to a local maximum, dips, then rises again. The augmented training accuracy (blue) climbs, then collapses, even though the model never reaches 100% accuracy on the augmented data (as expected).

Layer Spectra and α During Training

To understand what was happening, we used WeightWatcher to analyze the layer weight matrices over the course of training. For each epoch, we computed the HTSR / WeightWatcher power-law layer quality metric α for each layer.

At the point of peak test accuracy, all layer α’s are close to their theoretically predicted ideal value α ≈ 2. The model is essentially optimal here. As training continues, the α values drop significantly below 2—right when test accuracy starts declining.

Test accuracy vs alpha per layer across epochs

The figure shows test accuracy vs α for three layers at different epochs. When α is near 2, test accuracy is high. As α drops into the Very Heavy-Tailed (VHT) regime (α < 2), test accuracy declines and stays lower.

Overfitting the Augmented Training Distribution

The key observation is that the drop in α lines up almost perfectly with overfitting to the augmented training distribution:

Initially, augmentation provides useful variety; the model learns robust patterns.
Once augmentation no longer adds enough diversity, the model begins to memorize the randomized samples.
The layer spectra shift: α drops below 2 and the ESDs enter the VHT regime.

Importantly, it’s not the “standard” test accuracy where the model fails—it still does well there. The real out-of-distribution data in this setup is actually the augmented training data. When α < 2, the model can no longer describe this on-the-fly augmented distribution effectively.

This makes the example a nice demonstration of how WeightWatcher and the HTSR/SETOL metrics can detect overfitting to an augmented training distribution, not just to the usual held-out test set.

Notebook

The full experiment is available in the notebook Epoch-Wise-DoubleDescent.ipynb (from the WeightWatcher examples repo), which:

Trains the 3-layer MLP (MLP3) on MNIST with augmentation.
Tracks train and test accuracy over 100 epochs (showing epoch-wise Double Descent).
Computes layer-wise α with WeightWatcher at each checkpoint.
Plots test accuracy vs α and shows the transition into the VHT / overfit regime.

Another example of WeightWatcher in action—linking spectral diagnostics (α, heavy tails) directly to training dynamics and Double Descent behavior in a real model.