WeightWatcher (w|w) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, using the new Theory of Heavy-Tailed Self-Regularization (HT-SR), published in JMLR and Nature.
The weightwatcher HTSR theory
tells us if and when a specific DNN layer has converged properly; it is unique in this regard
and the only theory capable of this. When running
watcher.analyze(), you will obtain a pandas dataframe containing
several layer quality metrics. In particular, the weightwatcher
alpha metric can tell us if a layer has is well trained or not.
Specifically, the layer quality metric
alpha should be between 2 and 6.
Here we have run weightwatcher on 2 of the currently newly popular Bloom models. We plot a histogram of the layer
Notice that both models have several layers with
alpha > 6; this is not great.
If you see such layers in your model, you may need to decrease the size of the layers or add data, and/or do a better job of optimizing your hyperparameters.
The best models have layer
alphas that lie between 2 and 6. This can be seen by comparing the layer alphas for
BERT, RoBERTa, XNLet.
The WeightWatcher Power-Law (PL) metric
alpha is a DNN model quality metric; smaller is better.
This plot above displays all the layer
alpha values for the 3 models.
It is immediately clear that the XNLet layers look much better than
BERT or RoBERTa;
alpha values are smaller on average, and there are no
alphas larger than 5:
the BERT and RoBERTa
alphas are much larger on average, and both models have too many large
This is totally consistent with the published results: In the original paper (from Microsoft Research),
XNLet outperforms BERT on 20 different NLP tasks.
Weightwatcher provides several different layer quality metrics, such as
From this, we can make a model quality metric by simply taking a layer-average.
One particularly useful model metric is the
average alpha-hat, which is a weighted average
of the weightwatcher
alpha layer quality metric
average alpha-hat metric is correlated with the reported test accuracies
for many production Computer Vision (CV) models like the VGG series, the ResNet series, etc.
In the weightwatcher Nature paper
shows that the
average alpha-hat metric works is remarkably well
correlated with test accuracies for over 100 different CV models. Here, we show how the
average alpha-hat metric tracks the reported top 1 (and top 5) test accuracies for the
open-source VGG models.
And, again, this does NOT require access to the test or even the training data!
You can reproduce this yourself using this Notebook.
layer idagainst layer quality metric
alpha: we call this plot the Correlation Flow.
alphas. In models with less optimal architectures, the layer
alphasmay increase with
layer id, as with the VGG models, and may even behave more erratically.
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021
Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data
Charles H. Martin, Tongsu (Serena) Peng & Michael W. Mahoney; Nature Communications 12(4122), 2021
SETOL: SemiEmpirical Theory of (Deep) Learning
Charles H. Martin & Michael W. Mahoney; (in press)