WeightWatcher: Data-Free Diagnostics for Deep Learning

OLMo 2 is a family of 7B and 13B from Allen AI (AI2) models trained on up to 5T tokens. The OLMo 2 7B outperforms Llama-3.1 8B and OLMo 2 13B outperforms Qwen 2.5 7B despite lower total training FLOPs.
Key improvements include:
1. Enhanced architecture with RMSNorm, QK-Norm, auxiliary Z-loss, and rotary positional embeddings
2. Two-stage curriculum training approach using OLMo-Mix-1124 and Dolmino-Mix-1124
3. Model souping technique for final checkpoints (aka merging)
4. State-of-the-art post-training methodology from Tülu 3 with a three stage training of instruction tuning, preference tuning with DPO, and our new reinforcement learning with verifiable rewards
5. Evaluated on the OLMES suite
6. The Instruct variants are competitive with the best open-weight models, with OLMo 2 13B Instruct outperforming Qwen 2.5 14B instruct, Tülu 3 8B, and Llama 3.1 8B Instruct models.
Here we see that the OLMo-7B model has only a few overfit layers, with an average layer alpha of 3.70. The Instruct version component looks even better, with an average layer alpha of an even lower 2.88. Notice this is better the average alpha for Llama 3.1 8B Instruct (3.65) and even Qwen 2.5 7B (3.12)! The OLOm 7B Instruct model, however, has 50 overfit layers! Tha't a lot, and is a little surprising, which suggests that perhaps this model was overfit to the existing evaluation metrics. Time will tell

OLMo Models