WeightWatcher: Data-Free Diagnostics for Deep Learning

The WeightWatcher analysis of Llama3.2 instruct models (3B, 1B, and 11B) shows:
1. Average Alpha and Dks: Both metrics decrease with model size, with the 11B model having the smallest values, indicating it is better conditioned and has fewer overfit layers. The 3B model’s higher average alpha than the 1B model is unusual and might reflect training or architecture differences.
2. Alpha Distribution: Most alphas are within the HTSR safe range (2-6), indicating strong stability overall. The 11B model’s tighter alpha range around lower values suggests fewer overfit layers and potential for slightly underfit layers.
In summary, the 11B model shows better stability with less overfitting, while the 3B model’s higher alpha compared to the 1B model is an anomaly worth noting.

Llama3.2 Models