WeightWatcher (w|w) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, using the new Theory of Heavy-Tailed Self-Regularization (HT-SR), published in JMLR, Nature Communications, and NeurIPS2023.
The weightwatcher HTSR theory
tells us if and when a specific DNN layer has converged properly; it is unique in this regard
and the only theory capable of this. When running watcher.analyze()
, you will obtain a pandas dataframe containing
several layer quality metrics. In particular, the weightwatcher alpha
metric can tell us if a layer has is well trained or not.
Specifically, the layer quality metric alpha
should be between 2 and 6.
Here we have run weightwatcher on 2 of the currently newly popular Bloom models.
We plot a histogram of the layer alpha
values.
Notice that both models have several layers with alpha > 6
; this is not great.
If you see such layers in your model, you may need to decrease the size of the layers or add data,
and/or do a better job of optimizing your hyperparameters.
The best models have layer alphas
that lie between 2 and 6. This can be seen by comparing the layer alphas for
BERT and
XLNet.
The WeightWatcher Power-Law (PL) metric alpha
is a DNN model quality metric; smaller is better.
This plot displays all the layer alpha
values for these 2 popular models.
It is immediately clear that the XNLet layers look much better than BERT.
the alpha
values are smaller on average, and there are no alphas
larger than 5: (alpha <=5)
.
In contrast, the BERT alphas
are much larger on average, and both models have too many large alphas
.
This is totally consistent with the published results: In the original paper (from Microsoft Research),
XNLet outperforms BERT on 20 different NLP tasks.
Weightwatcher provides several different layer quality metrics, such as alpha
, alpha-hat
, etc.
From this, we can make a model quality metric by simply taking a layer-average.
One particularly useful model metric is the average alpha-hat
, which is a weighted average
of the weightwatcher alpha
layer quality metric
The weightwatcher average alpha-hat
metric is correlated with the reported test accuracies
for many production Computer Vision (CV) models like the VGG series, the ResNet series, etc.
In the weightwatcher Nature paper
shows that the average alpha-hat
metric works is remarkably well
correlated with test accuracies for over 100 different CV models. Here, we show how the
average alpha-hat
metric tracks the reported top 1 (and top 5) test accuracies for the
open-source VGG models.
And, again, this does NOT require access to the test or even the training data!
You can reproduce this yourself using this Notebook.
layer id
against layer quality metric alpha
:
we call this plot the Correlation Flow.
alphas
. In models with less optimal architectures, the layer alphas
may increase with layer id
, as with the VGG models, and may even behave more erratically.
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021
Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data
Charles H. Martin, Tongsu (Serena) Peng & Michael W. Mahoney; Nature Communications 12(4122), 2021
NeurIPS 2023 Invited Talk
Workhop on Heavy Tails in ML: Structure, Stability, Dynamics
Presentaton starts at 6:45 min
Reach out for an early draft of the upcoming monograph:
SETOL: SemiEmpirical Theory of (Deep) Learning
The weightwatcher tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. Email: Info@CalculationConsulting.com