Advanced Vision and Language Models
This course offers a comprehensive overview of vision and language models, focusing on their technical foundations, core architectures, and diverse applications. We explore the key components of state-of-the-art pre-trained and large-scale vision and language models, highlighting the transformer-based architectures that enable training on multimodal datasets. The course overviews the commonly studied tasks and datasets in the research community, providing a practical understanding of the field's benchmarks. It also discusses the multimodal representations, fine-tuning strategies, prompt engineering, and in-context learning paradigms for adapting these models to various use cases. We introduce both open-source and proprietary vision and language models, analyzing their generalization and reasoning capabilities while addressing their current limitations. Through this, students will gain a nuanced understanding of the potential and challenges of these transformative technologies.
Multimodal LLMs for Data Analytics and visualization - P1
Multimodal LLMs for Data Analytics and visualization - P2
Discussion & Demos
This lab provides a practical introduction to working with vision and language models (VLMs). The structure is as follows: We begin by exploring tools for creating initial representations of vision and language data, preparing inputs for VLMs. Next, we work with widely used pre-trained models, such as LXMERT and ViLBERT, to understand their architectures and functionalities. We then introduce the CLIP-based family of models, which leverage contrastive loss for training, and practice their fine-tuning and application. We examine large-scale models like GPT-family and LLAVA, focusing on their capabilities and adaptations for multimodal tasks. Students will engage in showcase tasks such as object and relationship detection, image captioning, visual question answering, and text-to-image generation. The experiments are conducted using PyTorch and the Hugging Face library, providing hands-on experience with state-of-the-art tools and techniques in multimodal learning.
Round Table (MultiModal LLM)