WeightWatcher: Data-Free Diagnostics for Deep Learning

RoBERTa Models

The very popular BERT model is known to be overtained. The RoBERTa study, https://arxiv.org/abs/1907.11692 sought to understand how to improve BERT training by adjusting the hyper-parmeters and other factors when pre-training BERT. The resulting model is called RoBERTa. RoBERTa was shown to due beter than BERT onand acheived, for the time, state-of-the-art results on the GLUE, RACE andSQuAD benchmarks for NLP

Here, we look at 3 variants of RoBERTa: the orginal RoBERTa, the XML-RoBERTa, and the distlled DistillRoBERTa using weightwatcher. Looking at the distribution of layer lphas, we notice that all are similar, except that the original RoBERTa has 1 very large alpha outlier, whereas the other 2 models also have alpha outliers, but no where near as large. Also, the best quality model appears to be DistillRoBERTa , which has the smallest mean layer alpha (dashed green line). But be careful...

If we drill down a little more, and look at the layer Rand Distance metrics, we find that the XML-RoBERTa model has the largest mean layer rand distance metric.

Generally speaking, it is hard to resolve the quality between 3 such similar models, and what we want is that the best quality model has both the smallest mean layer alpha and the largest mean rand distance metric. In cases where is it hard to tell the difference, we can provide expert consulting to help resolve these tricky issues

Primary Reference: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/
Secondary Reference: https://www.youtube.com/watch?v=-MCYbmU9kfg
Paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa Models Included

RoBERTa Model Set Plots

RoBERTa Rand Distance Metric Plots