The Bielik models are Instruction Fine-Tuned, Polish-speaking variants of the Mistral7B model, scaled up to 11 billion parameters. Designed for Polish-language instruction-following tasks, these models aim to enhance generalization and stability using the same training procedure as the SOLAR-10.7B model.
Here, we examine the Instruction component of the Bielik models, and compare them to the SOLAR-10.7B and amongst themselves.
Analysis of WeightWatcher Indicators
1. Alpha Values:
Compared to SOLAR-10.7B, which has all layer alphas within the HTSR safe range (2-6), the Bielik models show consistently larger alphas, with many layers falling into higher alpha ranges. These larger alphas indicate a tendency toward underfitting rather than overfitting, suggesting less robust layer conditioning. High alpha values generally reflect poor layer adaptability and reduced complexity, potentially limiting the Bielik models’ ability to generalize effectively across inputs.
2. Dks Metric:
The Dks values in the Bielik models are generally higher than in the SOLAR-10.7B model. In WeightWatcher analysis, larger Dks values typically signify a poorer alpha fit, meaning that the spectral distribution of singular values does not align as well with the expected heavy-tailed power law. This suggests that the Bielik models may not achieve the same degree of effective regularization as SOLAR-10.7B, and the layers may lack the same stability and conditioning. Higher Dks values in the Bielik models reinforce the idea that they deviate from optimal layer conditioning.
3. Scale Metric:
The scale values in the Bielik models are larger on average compared to SOLAR-10.7B, which typically indicates a stronger model with potentially better generalization capacity. In WeightWatcher, a larger scale is generally seen as a positive indicator, suggesting that layers have a greater capacity for capturing complexity. However, this contradicts the usual behavior observed with WeightWatcher indicators, as the Bielik models also exhibit larger alphas and Dks values, which typically imply underfitting and poor conditioning.
Contradiction in WeightWatcher Indicators
The Bielik models show a combination of large alphas, high Dks, and high scale values—an unusual set of indicators that contradicts the typical behavior observed in WeightWatcher analyses. Normally, a model with a larger scale would also have smaller alphas and Dks, as seen in SOLAR-10.7B, which has a smaller average scale, smaller alphas, and smaller Dks values. This contradiction suggests that, while the Bielik models may have a greater theoretical capacity for complexity (indicated by scale), their high alpha and Dks values imply that this capacity is not fully realized, potentially due to poor layer conditioning or stability. This combination of metrics suggests that the Bielik models are fundamentally different in behavior and structure compared to SOLAR-10.7B, with significant implications for their generalization and stability in real-world applications.