WeightWatcher: Data-Free Diagnostics for Deep Learning

FlanT5 Models

Flan stands for Fine-tuned LAnguage Net (FLAN). And T5 is a Text-To-Text Transfer Transformer (get it, 5 Ts). The FlanT5 model is an encoder-decoder Large Language Model (LLM) from Google, released in Oct 2021, which has been specifically fine-tuned using instruction tuning.

While FlanT5 has been trained on massive data sets, there are also some smaller checkpoints for the common user, which we analyze below. Specifically, we look at the 5 models, t5-small, t5-base, and t5-large, t5-xl, and t5-xxl. And we consider the Multi-task Language Understanding (MMLU) score.

First, looking at the bar plots, we see that alpmost all the models have similar average weightwatcher alpha metric, except for the xxl model, which has significantly lower average power law exponent alpha. Also, and maybe more imporantly, the Dks metric gets smaller with better models, which means that power law fits get significantly better as the models get improvees. This is a fantastic example of the HTSR / weightwatcher theory working in action.

Second, looking at the line plots (below). we see that the average weightwatcher alpha metric is pretty well correlated with the 5 MMLU scores, and the rand-distance metric is almost perfectly correlated. But the alpha-hat metric is not. Also, notice that most of the layer alphas lie within 2 and 6, however, all the models have a few outlier layers with alpha greater than 6, and mostly towards later layers (closer to the data). Importanbtly, as the model accuracy improves, there are fewer large alphas. This is typical of many high quality models.

Primary Reference: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints
Secondary Reference: https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html
Paper: Scaling Instruction-Finetuned Language Models

FlanT5 Models Included

FlanT5 Model Set Plots

FlanT5 % Randomness Metric Plots