CLIP and its variants

As mentioned in last blog, scaling model size and data size is a key to improve performance of transformer-based models. However, collecting large-scale data is not always an easy task. In this blog, I will introduce some multimodal learning methods, which expand the training data to other modals.

Table of contents
Multimodal Learning
CLIP
SAM
BLIP and BLIP-2
ImageBind
LanguageBind
LLaVA
Conclusion
References

Multimodal Learning

image source: Multimodality Helps Unimodality

Owing to dataset bias, a majority of unimodal models find it difficult to generalize effectively on new tasks or domains. Take, for instance, image classification models that are trained on ImageNet and struggle to classify blurred images. However, if we obtain the audio corresponding to the blurred image, we can potentially predict the class of the blurred image with greater ease (such as a car). As depicted in the figure presented above, we are able to utilize additional text or audio information to enhance the discriminatory capacity.

image source: Multimodality Helps Unimodality

To employ multimodal information, contrastive learning and transformer architecture are always a good choice. With contrasive learning, we can learn a representation that is invariant to the modality and align multiple modals. Transformers are able to compress multimodal information into a same semantic representation.

CLIP

image source: CLIP

SAM

image source: segment-anything

BLIP and BLIP-2

image source: LAVIS

ImageBind

image source: ImageBind

LanguageBind

image source: LanguageBind

LLaVA

image source: LLaVA

Conclusion

Unlike training with more data, multimodal learning is an more promising direction to improve performance. However, it is still a challenging task to fusion unpaired data from different modals.

References

[ICML’21] Learning Transferable Visual Models From Natural Language Supervision
[ICCV’23] Segment Anything
[ICML’22] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
[ICML’23] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
[CVPR’23] ImageBind: One Embedding Space To Bind Them All
[ICLR’24] LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
[NeurIPS’23] Visual Instruction Tuning
[CVPR’24] Improved Baselines with Visual Instruction Tuning
[CVPR’23] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Table of contents