We rank some of the most popular open-source LLMs using the average weightwatcher quality metric alpha. A smaller alpha indicates the Base LLM has been trained bettter. As a secondary check provide the quality of fit (Dks). Smaller Dks is also means a better Base Model
We also provide some of the LLM Quality metrics from the popular HuggingFace Open LLM Leaderboard (ARC (25-shot), HellaSwag (10-shot), MMLU (5-shot), and TruthfulQA (0-shot))
model | version | alpha | Dks | ARC | HellaSwag | MMLU | TruthfulQA | |
---|---|---|---|---|---|---|---|---|
1 | GPT-NeoX | gpt-neox-20b | 2.95 | 0.02 | 45.20 | 73.40 | 33.30 | 31.70 |
2 | BLOOMChat | bloomchat-176b | 3.03 | 0.01 | ||||
3 | OPT | opt-30b | 3.04 | 0.02 | ||||
4 | OPT | opt-13b | 3.05 | 0.02 | 40.50 | 71.30 | 30.40 | 34.00 |
5 | Dolly | dolly-v2-12b | 3.22 | 0.02 | 41.20 | 72.30 | 31.70 | 34.30 |
6 | Dolly | dolly-v2-7b | 3.26 | 0.02 | 43.70 | 69.30 | 30.20 | 34.50 |
7 | Galactica | galactica-120b | 3.28 | 0.02 | 46.80 | 66.40 | 50.40 | 41.30 |
8 | OPT | opt-1.3b | 3.30 | 0.03 | 29.60 | 54.60 | 27.70 | 38.70 |
9 | GLM | glm-10b | 3.35 | 0.03 | ||||
10 | StableLM | stablelm-tuned-alpha-7b | 3.36 | 0.02 | 31.90 | 53.60 | 27.40 | 40.20 |
11 | T-Zero | t0p | 3.37 | 0.03 | ||||
12 | Guanaco | guanaco-65b | 3.40 | 0.02 | ||||
13 | OPT | opt-2.7b | 3.41 | 0.03 | ||||
14 | LLAMA | llama-65b | 3.43 | 0.03 | 57.80 | 84.20 | 48.80 | 42.30 |
15 | Falcon | falcon-7b-instruct | 3.45 | 0.02 | 45.90 | 70.80 | 32.80 | 44.10 |
16 | StableLM | stablelm-tuned-alpha-3b | 3.45 | 0.02 | ||||
17 | Dolly | dolly-v2-3b | 3.51 | 0.03 | 39.80 | 65.20 | 29.70 | 33.70 |
18 | LLAMA | llama-30b | 3.53 | 0.02 | 57.10 | 82.60 | 45.70 | 42.30 |
19 | GLM | glm-2b | 3.55 | 0.03 | ||||
20 | FLAN-UL2 | flan-ul2 | 3.56 | 0.03 | ||||
21 | OPT | opt-125m | 3.58 | 0.05 | 23.10 | 31.50 | 27.40 | 42.90 |
22 | GPT4ALL | gpt4all-13b-snoozy | 3.69 | 0.02 | ||||
23 | Vicuna | vicuna-13b-1.1 | 3.70 | 0.02 | ||||
24 | Alpaca | alpaca-13b | 3.70 | 0.02 | 51.90 | 77.60 | 37.60 | 39.60 |
25 | Stable-Vicuna | stable-vicuna-13b | 3.70 | 0.02 | 48.10 | 76.40 | 38.80 | 46.50 |
26 | Guanaco | guanaco-13b | 3.70 | 0.02 | ||||
27 | LLAMA | llama-13b | 3.71 | 0.02 | 50.80 | 78.90 | 37.70 | 39.90 |
28 | Koala | koala-13b-details | 3.71 | 0.02 | ||||
29 | Airoboros | airoboros-13b | 3.71 | 0.02 | ||||
30 | GPT4ALL | gpt4all-mpt | 3.73 | 0.02 | ||||
31 | Guanaco | guanaco-7b | 3.76 | 0.03 | ||||
32 | Koala | koala-7b-details | 3.76 | 0.03 | ||||
33 | GPT4ALL | gpt4all-j | 3.76 | 0.02 | 41.20 | 64.50 | 33.30 | 45.60 |
34 | Gorilla | gorilla-7b | 3.77 | 0.03 | ||||
35 | LLAMA | llama-7b | 3.77 | 0.03 | 46.60 | 75.60 | 34.20 | 34.10 |
36 | Alpaca | alpaca-7b | 3.78 | 0.03 | ||||
37 | Vicuna | vicuna-7b-1.1 | 3.78 | 0.03 | 47.00 | 75.20 | 37.50 | 48.90 |
38 | Falcon | falcon-40b-instruct | 3.81 | 0.02 | 61.60 | 84.40 | 54.10 | 52.50 |
39 | Falcon | falcon-40b | 3.81 | 0.02 | 61.90 | 85.30 | 52.70 | 41.70 |
40 | RedPajama | redpajama-instruct-3b-v1 | 3.83 | 0.03 | ||||
41 | RedPajama | redpajama-chat-3b-v1 | 3.87 | 0.03 | ||||
42 | RedPajama | redpajama-base-3b-v1 | 3.87 | 0.03 |
The weightwatcher alpha is not strongly correlated with other LLM quality metrics; still, we provide these comparisons below
The weightwatcher metrics tell us how well the base model is trained, but it is not specifically correlated with other LLM metrics.
Many LLMs appear to be widely overparameterized, but with no apparent gain in performance. Why is this? Results on the Falcon LLM suggest why.
It appears that many text datasets have too many duplicates, effectively lowering the size. This likely causes the large alphas we see in models like LLAMA
Falcon, however, was trained on an extremely clean dataset. And the proof is in the pudding, so to speak.
Most people think that overparameterized models should always do better (due to Double Descent or some other argument). But this is not the case. In almost every case we have looked at, on average, higher quality models have smaller, not larger, alphas.