CLIP Models


CLIP stands for Contrastive Language–Image Pre-training. CLIP is a model developed by OpenAI and released in Jan, 2021. This model was designed to show how to learn new visual concepts by pre-training the model with NLP data. Notably, it shows the kind of “zero-shot” capabilities of GPT. Here, we examine the single CLIP model available on HuggingFace, openai-clip-vit-base-patch32

Notice that this model has many alphas larger than 6, and across a wide range of layers. Generally speaking, since this is a Transformer Model (i.e a VIsual Transformer), the alphas will be larger than in the Convolutional (Conv2D) models likeResNet, but we still would not expect so many large alphas. Moreover, if we look at the rand_distance metric, we see many layers with values less than 0.1, which is also unusual. While an interesting research model, we suspect that this specific CLIP model would be better off trained with more data.


Primary Reference: https://github.com/openai/CLIP
Secondary Reference:
https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/clip
Paper: Learning Transferable Visual Models From Natural Language Supervision

CLIP Models Included

CLIP Model Set Plots



CLIP % Randomness Metric Plots