WeightWatcher: Data-Free Diagnostics for Deep Learning

Mistral-7B Instruct Models
Mistral-7B is an advanced open-weight language model optimized for efficiency and performance, developed by Mistral AI. With 7 billion parameters, it delivers robust language understanding and generation capabilities. For more information, visit the Mistral AI official announcement.
Key Insights:
- Model Evolution and Improvements:
- The progression from v0.1 to v0.2 and v0.3 shows significant, systematic improvements in the average layer alpha for the instruct components
- v0.2 and v0.3 are much better than v0.1, with smaller average alpha values and improved Dks scores, indicating better conditioning and less overfitting or underfitting.
- The improvement in v0.2 and v0.3 highlights the effectiveness of the iterative fine-tuning process, making these models more suitable for instruction-following tasks.
What did they do ?
From v0.1 to v0.2:
- Increased Context Window: The context window was expanded from 8,000 tokens in v0.1 to 32,000 tokens in v0.2, allowing the model to handle longer sequences of text.
- Adjusted RoPE Theta: The rotary positional embedding (RoPE) theta parameter was set to 1e6 in v0.2, modifying how positional information is encoded.
- Removed Sliding-Window Attention: The sliding-window attention mechanism present in v0.1 was removed in v0.2, altering the model's attention strategy.
These changes are documented in the model card for Mistral-7B-Instruct-v0.2 on Hugging Face.
From v0.2 to v0.3:
- Extended Vocabulary: The vocabulary size was increased to 32,768 tokens in v0.3, enhancing the model's language comprehension and generation capabilities.
- Updated Tokenizer: Support for the v3 tokenizer was introduced in v0.3, improving tokenization efficiency and accuracy.
- Function Calling Support: v0.3 added support for function calling, enabling the model to interact with external functions and APIs, thereby expanding its applicability.
These updates are detailed in the model card for Mistral-7B-v0.3 on Hugging Face.
In conclusion, Mistral-7B v0.3 stands out as the best-performing version in this series, demonstrating excellent layer conditioning, stability, and scalability for instruction fine-tuning.

Mistral Models