LLM Leaderboard


We rank some of the most popular open-source LLMs using the average weightwatcher quality metric alpha. A smaller alpha indicates the Base LLM has been trained bettter. As a secondary check provide the quality of fit (Dks). Smaller Dks is also means a better Base Model

We also provide some of the LLM Quality metrics from the popular HuggingFace Open LLM Leaderboard (ARC (25-shot), HellaSwag (10-shot), MMLU (5-shot), and TruthfulQA (0-shot))

LLM Truthfullness
Generally, better trained models are less Truthful; it as if the smarter the model is, the better it is at lying.
See below for more: WW Alpha gauges Truthfullness.

Evaluating Base Models
Many base model LLMs appear to be widely overparameterized, but with no apparent gain in performance. Why is this? Results on the Falcon LLM suggest why.
See below for more: Comparison Falcon vs. Llama.

LLM Quality Metrics and Deltaa
If you fine-tuning your own models, You can apply weightwatcher directly to your LLM deltas.
See below for more: LLM Deltas.


  model version alpha Dks ARC HellaSwag MMLU TruthfulQA
1 GPT-NeoX gpt-neox-20b 2.95 0.02 45.20 73.40 33.30 31.70
2 BLOOMChat bloomchat-176b 3.03 0.01
3 OPT opt-30b 3.04 0.02
4 OPT opt-13b 3.05 0.02 40.50 71.30 30.40 34.00
5 Dolly dolly-v2-12b 3.22 0.02 41.20 72.30 31.70 34.30
6 Dolly dolly-v2-7b 3.26 0.02 43.70 69.30 30.20 34.50
7 Galactica galactica-120b 3.28 0.02 46.80 66.40 50.40 41.30
8 OPT opt-1.3b 3.30 0.03 29.60 54.60 27.70 38.70
9 GLM glm-10b 3.35 0.03
10 StableLM stablelm-tuned-alpha-7b 3.36 0.02 31.90 53.60 27.40 40.20
11 T-Zero t0p 3.37 0.03
12 Guanaco guanaco-65b 3.40 0.02
13 OPT opt-2.7b 3.41 0.03
14 LLAMA llama-65b 3.43 0.03 57.80 84.20 48.80 42.30
15 Falcon falcon-7b-instruct 3.45 0.02 45.90 70.80 32.80 44.10
16 StableLM stablelm-tuned-alpha-3b 3.45 0.02
17 Dolly dolly-v2-3b 3.51 0.03 39.80 65.20 29.70 33.70
18 LLAMA llama-30b 3.53 0.02 57.10 82.60 45.70 42.30
19 GLM glm-2b 3.55 0.03
20 FLAN-UL2 flan-ul2 3.56 0.03
21 OPT opt-125m 3.58 0.05 23.10 31.50 27.40 42.90
22 GPT4ALL gpt4all-13b-snoozy 3.69 0.02
23 Vicuna vicuna-13b-1.1 3.70 0.02
24 Alpaca alpaca-13b 3.70 0.02 51.90 77.60 37.60 39.60
25 Stable-Vicuna stable-vicuna-13b 3.70 0.02 48.10 76.40 38.80 46.50
26 Guanaco guanaco-13b 3.70 0.02
27 LLAMA llama-13b 3.71 0.02 50.80 78.90 37.70 39.90
28 Koala koala-13b-details 3.71 0.02
29 Airoboros airoboros-13b 3.71 0.02
30 GPT4ALL gpt4all-mpt 3.73 0.02
31 Guanaco guanaco-7b 3.76 0.03
32 Koala koala-7b-details 3.76 0.03
33 GPT4ALL gpt4all-j 3.76 0.02 41.20 64.50 33.30 45.60
34 Gorilla gorilla-7b 3.77 0.03
35 LLAMA llama-7b 3.77 0.03 46.60 75.60 34.20 34.10
36 Alpaca alpaca-7b 3.78 0.03
37 Vicuna vicuna-7b-1.1 3.78 0.03 47.00 75.20 37.50 48.90
38 Falcon falcon-40b-instruct 3.81 0.02 61.60 84.40 54.10 52.50
39 Falcon falcon-40b 3.81 0.02 61.90 85.30 52.70 41.70
40 RedPajama redpajama-instruct-3b-v1 3.83 0.03
41 RedPajama redpajama-chat-3b-v1 3.87 0.03
42 RedPajama redpajama-base-3b-v1 3.87 0.03

Best Base Models

WW Alpha gauges Truthfullness


Generally, better trained models are less Truthful; it as if the smarter the model is, the better it is at lying (just like people) Below, we show how the WW Alpha correlates with model Truthfulness

In fact, the weightwatcher alpha is the only metric that correlates well with Truthfulness.

The weightwatcher alpha is not strongly correlated with other LLM quality metrics; still, we provide these comparisons below

The weightwatcher metrics tell us how well the base model is trained, but it is not specifically correlated with other LLM metrics.

WW Alpha vs TruthfulQA

Comparison of Llama to Falcon


Many LLMs appear to be widely overparameterized, but with no apparent gain in performance. Why is this? Results on the Falcon LLM suggest why.

It appears that many text datasets have too many duplicates, effectively lowering the size. This likely causes the large alphas we see in models like LLAMA

Falcon, however, was trained on an extremely clean dataset. And the proof is in the pudding, so to speak.

Most people think that overparameterized models should always do better (due to Double Descent or some other argument). But this is not the case. In almost every case we have looked at, on average, higher quality models have smaller, not larger, alphas.


Applying weightwatcher to fune-tuned LLMs


Avalanche LLM Talk: June 2023

Applying weightwatcher to LLMs

<

Vicuna: Well-Tuned

Vicuna Delta

Dromedary: Possibly Overfit

LoRA updates