CLIP stands for Contrastive Language–Image Pre-training. CLIP is a model developed by OpenAI and released in Jan, 2021. This model was designed to show how to learn new visual concepts by pre-training the model with NLP data. Notably, it shows the kind of “zero-shot” capabilities of GPT. Here, we examine the single CLIP model available on HuggingFace, openai-clip-vit-base-patch32
Notice that this model has many alphas
larger than 6, and across a wide range of layers. Generally speaking, since this is a Transformer Model (i.e a VIsual Transformer), the alphas will be larger than in the Convolutional (Conv2D) models likeResNet, but we still would not expect so many large alphas. Moreover, if we look at the rand_distance
metric, we see many layers with values less than 0.1, which is also unusual. While an interesting research model, we suspect that this specific CLIP model would be better off trained with more data.