CLIP Models

CLIP stands for Contrastive Language–Image Pre-training. CLIP is a model developed by OpenAI and released in Jan, 2021. This model was designed to show how to learn new visual concepts by pre-training the model with NLP data. Notably, it shows the kind of “zero-shot” capabilities of GPT. Here, we examine the single CLIP model available on HuggingFace, openai-clip-vit-base-patch32

Notice that this model has many alphas larger than 6, and across a wide range of layers. Generally speaking, since this is a Transformer Model (i.e a VIsual Transformer), the alphas will be larger than in the Convolutional (Conv2D) models likeResNet, but we still would not expect so many large alphas. Moreover, if we look at the rand_distance metric, we see many layers with values less than 0.1, which is also unusual. While an interesting research model, we suspect that this specific CLIP model would be better off trained with more data.

Primary Reference:
Secondary Reference:
Paper: Learning Transferable Visual Models From Natural Language Supervision

CLIP Models Included

CLIP Model Set Plots

CLIP % Randomness Metric Plots