The OpenAI GPT models are probably some of the best known pretrained and widely used (commerical) Large Language Models (LLMs). We have examined the OpenAI GPT and GPT2 models in our Nature paper; here we review these results and offer a deep dive into the specific weightwatcher results.
In particular, the OpenAI GPT and GPT2 models lets us examine the difference between the same model trained with widely different amounts of data. The first OpenAI GPT model was specifically trained without enough data to prevent its mis-use. Later, the GPT2 model was release, which was trained with just enough data to make the model sensible, but no where as good as the commercial GPT3 offering. By using weightwatcher, we can readily see the effects of not having enough data when training a model from scratch.
First, consider the first plot of the distribution of the layer alphas
. The original OpenAI GPT has many outliers alphas
past the threshold for being well trained (i.e right of the red dashed line).In contrast, all but 1 of the GPT2 layer alphas
lie within the thresholds of 2 and 6 (and the 1 is very close to 6). Additionally, in the Correlation Flow plots, and looking at the OpenAI GPT data(purple dots), there is a subset of layers that are increasingly undertrained,moving left to right, with the layer alphas
growing from 6 to above 12.
If we drill down into the GPT Rand Distance metric plots, we can also see signatures of under-training in the OpenAI GPT model when compared to GPT2. In all 3 plots, in looking at the layer rand_distances
(or randD), we see that the openai-gpt layer metric has much smaller values than the similar gpt2 layers. As expected, poorly trained layers look more random than their well trained counterparts.
Just using weightwatcher, we can clearly see that GPT2 is much better trained than the OpenAI GPT model!(For more details, see this post from the CalculatedContent blog.)