WeightWatcher: Data-Free Diagnostics for Deep Learning

LLM Leaderboard

We rank some of the most popular open-source LLMs using the average weightwatcher quality metric alpha. A smaller alpha indicates the Base LLM has been trained bettter. As a secondary check provide the quality of fit (Dks). Smaller Dks is also means a better Base Model

We also provide some of the LLM Quality metrics from the popular HuggingFace Open LLM Leaderboard (ARC (25-shot), HellaSwag (10-shot), MMLU (5-shot), and TruthfulQA (0-shot))

LLM Truthfullness

Generally, better trained models are less Truthful; it as if the smarter the model is, the better it is at lying.
See below for more: WW Alpha gauges Truthfullness.

Evaluating Base Models

Many base model LLMs appear to be widely overparameterized, but with no apparent gain in performance. Why is this? Results on the Falcon LLM suggest why.
See below for more: Comparison Falcon vs. Llama.

LLM Quality Metrics and Deltaa

If you fine-tuning your own models, You can apply weightwatcher directly to your LLM deltas.
See below for more: LLM Deltas.

	model	version	alpha	Dks	ARC	HellaSwag	MMLU	TruthfulQA
1	GPT-NeoX	gpt-neox-20b	2.95	0.02	45.20	73.40	33.30	31.70
2	BLOOMChat	bloomchat-176b	3.03	0.01
3	OPT	opt-30b	3.04	0.02
4	OPT	opt-13b	3.05	0.02	40.50	71.30	30.40	34.00
5	Dolly	dolly-v2-12b	3.22	0.02	41.20	72.30	31.70	34.30
6	Dolly	dolly-v2-7b	3.26	0.02	43.70	69.30	30.20	34.50
7	Galactica	galactica-120b	3.28	0.02	46.80	66.40	50.40	41.30
8	OPT	opt-1.3b	3.30	0.03	29.60	54.60	27.70	38.70
9	GLM	glm-10b	3.35	0.03
10	StableLM	stablelm-tuned-alpha-7b	3.36	0.02	31.90	53.60	27.40	40.20
11	T-Zero	t0p	3.37	0.03
12	Guanaco	guanaco-65b	3.40	0.02
13	OPT	opt-2.7b	3.41	0.03
14	LLAMA	llama-65b	3.43	0.03	57.80	84.20	48.80	42.30
15	Falcon	falcon-7b-instruct	3.45	0.02	45.90	70.80	32.80	44.10
16	StableLM	stablelm-tuned-alpha-3b	3.45	0.02
17	Dolly	dolly-v2-3b	3.51	0.03	39.80	65.20	29.70	33.70
18	LLAMA	llama-30b	3.53	0.02	57.10	82.60	45.70	42.30
19	GLM	glm-2b	3.55	0.03
20	FLAN-UL2	flan-ul2	3.56	0.03
21	OPT	opt-125m	3.58	0.05	23.10	31.50	27.40	42.90
22	GPT4ALL	gpt4all-13b-snoozy	3.69	0.02
23	Vicuna	vicuna-13b-1.1	3.70	0.02
24	Alpaca	alpaca-13b	3.70	0.02	51.90	77.60	37.60	39.60
25	Stable-Vicuna	stable-vicuna-13b	3.70	0.02	48.10	76.40	38.80	46.50
26	Guanaco	guanaco-13b	3.70	0.02
27	LLAMA	llama-13b	3.71	0.02	50.80	78.90	37.70	39.90
28	Koala	koala-13b-details	3.71	0.02
29	Airoboros	airoboros-13b	3.71	0.02
30	GPT4ALL	gpt4all-mpt	3.73	0.02
31	Guanaco	guanaco-7b	3.76	0.03
32	Koala	koala-7b-details	3.76	0.03
33	GPT4ALL	gpt4all-j	3.76	0.02	41.20	64.50	33.30	45.60
34	Gorilla	gorilla-7b	3.77	0.03
35	LLAMA	llama-7b	3.77	0.03	46.60	75.60	34.20	34.10
36	Alpaca	alpaca-7b	3.78	0.03
37	Vicuna	vicuna-7b-1.1	3.78	0.03	47.00	75.20	37.50	48.90
38	Falcon	falcon-40b-instruct	3.81	0.02	61.60	84.40	54.10	52.50
39	Falcon	falcon-40b	3.81	0.02	61.90	85.30	52.70	41.70
40	RedPajama	redpajama-instruct-3b-v1	3.83	0.03
41	RedPajama	redpajama-chat-3b-v1	3.87	0.03
42	RedPajama	redpajama-base-3b-v1	3.87	0.03

WW Alpha gauges Truthfullness

Generally, better trained models are less Truthful; it as if the smarter the model is, the better it is at lying (just like people) Below, we show how the WW Alpha correlates with model Truthfulness

In fact, the weightwatcher alpha is the only metric that correlates well with Truthfulness.

The weightwatcher alpha is not strongly correlated with other LLM quality metrics; still, we provide these comparisons below

The weightwatcher metrics tell us how well the base model is trained, but it is not specifically correlated with other LLM metrics.

Comparison of Llama to Falcon

Many LLMs appear to be widely overparameterized, but with no apparent gain in performance. Why is this? Results on the Falcon LLM suggest why.

It appears that many text datasets have too many duplicates, effectively lowering the size. This likely causes the large alphas we see in models like LLAMA

Falcon, however, was trained on an extremely clean dataset. And the proof is in the pudding, so to speak.

Most people think that overparameterized models should always do better (due to Double Descent or some other argument). But this is not the case. In almost every case we have looked at, on average, higher quality models have smaller, not larger, alphas.

Falcon: Well-Sized

Llama: Widely-Overparametrized

Applying weightwatcher to fune-tuned LLMs

Avalanche LLM Talk: June 2023

Applying weightwatcher to LLMs

Vicuna: Well-Tuned

Dromedary: Possibly Overfit