WeightWatcher: Data-Free Diagnostics for Deep Learning

The Llama Guard models have been developed by Meta to safeguard LLMs for Human-AI conversations. These are Instruct Fine-Tuned models

Here, we compare the Llama-Guard-3-1B and Llama-Guard-3-8B Instruct Fine-Tuned components, And they show an interesting pattern.

Both models show a similar patterns. The Alpha Histograms indicate that many of the layers are overfit. But gets even more interesting than this.

Looking at the Correlation Flow plot, we see that the first few layers (near the data) having layer alphas in the HTSR safe-zone. However, as the layers get closer to the labels, the penetrate deeper and deeper into the red zone, with alpha less than 2, and indicating that these layers may be overfit. T This is quite unique, and suggests that these Gaurd models are basically overfitting to known, 'bad' conversations, and will only generalize to conversations similar to these in some abstract way.

Notice this is exactly the opposite of the Segment Anthing Models (SAM), which overfit the layers closest to the data, (as opposed to closest to the labels)

It is proposed that the dataset was too small, as the research paper states 'the dataset was “meticulously gathered” and “albeit low in volume"'

Llama-Guard Models