SDAIA Winter School 2024

Welcome to SDAIA Winter School 2024

8 - 19 December 2024

Introduction

SDAIA Winter School (SWS 2024) will focus on Multimodal Large Language Models (MM-LLMs) . This winter school is organized by the Saudi Data and AI Authority (SDAIA). The event will be held as a two-week residential research school at the National Center for Artificial Intelligence (NCAI) in Riyadh, KSA, from December 8-19, 2024 .

Teaching Track

From Dec 8 to Dec 12

The teaching track will provide foundational knowledge in AI through expert-led lectures, hands-on labs, and collaborative group discussions. Lectures will focus on key topics such as Multimodal LLMs, Speech Understanding, Speech Generation, and Visual Perception, delivered by global experts. Interactive labs will offer students the chance to apply the concepts learned during lectures.

Research Track

From Dec 8 to Dec 19

The research track will feature four collaborative projects with teams of five to eight researchers:

1. Exploring Direct LLM Document Comprehension: A Comparative Study of Direct Composition and Cascading with OCR

This project investigates the effectiveness of using large language models (LLMs) for direct comprehension of document images, aiming to bypass traditional OCR-to-text preprocessing. By allowing the LLM to analyze raw document images, including text, tables, graphs, and visual elements, we explore whether this approach (termed "direct composition") yields better understanding and insights compared to the conventional cascading approach, where OCR is first applied to extract text before LLM processing. We aim to determine whether direct image-to-LLM comprehension improves interpretative accuracy and information retrieval, especially for visually rich documents, and to assess whether this method serves as a complement or alternative to OCR-based processing.

2. Quran Pronunciation Learning

This project aims to develop a pronunciation assessment pipeline for Quranic Arabic, focusing on the accurate evaluation of letters (consonants) and diacritics (vowels). The pipeline includes designing and preparing datasets for training and evaluation, utilizing recordings in Modern Standard Arabic (MSA) style, including Quranic Arabic readings without Tajweed rules. Both orthographic and phonetic transcriptions will be employed to enable character- and phoneme-level evaluation and feedback.
The evaluation dataset will emphasize Quranic Arabic readings and include a subset with deliberate mispronunciations to assess the system's capability for error detection. Preprocessing will involve transcription, vowelization and phonetic annotation.
Model training will focus on fine-tuning existing speech representation models, such as MMS, Hubert, and wav2vec, with considerations for incorporating synthetic and mispronounced data to improve robustness. Evaluation metrics will include Word Error Rate (WER), Character/Phoneme Error Rate (CER/PER), and specific measures for mispronunciation detection, such as False Rejection Rate (FRR), False Acceptance Rate (FAR), and Diagnosis Error Rate (DER).
The initial phase of the project will concentrate on developing a Quranic pronunciation system without Tajweed rules. Subsequent phases will address the challenges of assessing Tajweed recitation pronunciation.

3. Sign Language Recognition

In this project we will extract features relevant for SLT in a gloss-free system and prepare the trainslation system based on the T5 model and / or SignLLaVa. It has been shown that visually pretrained models do not convey the relevant information for the SignLLaVa system and they need to be aligned with the natural (textual) representation of the language. Hence, we want to design a system that will align these SL features with the language features. The systems will be trained on American SL and Arabic SL. Minor effort will be also put into pre-processing of data with multiple signers present.

4. Zero shot multilingual code-switching speech recognition

How do you estimate performance of a multilingual model in novel multilingual scenarios? We are aiming in the long term to develop synthetic code-switching benchmarks as a proxy-evaluation task, ideally capable of both ranking the performance of various multilingual models and estimating their performance in novel unseen domains. Project outcomes include methods for generating synthetic data for codeswitching that could work for arabic dialects as well as benchmarking different ASR approaches on real and synthetic benchmarks.

Teaching Track

From Dec 8 to Dec 12

Day 1 - Sunday, Dec 8, 2024 (Intro & Speech Recognition/Understanding)

08:00 - 09:00

Registration and breakfast

09:00 - 09:15

Introduction & Welcome

09:15 - 09:45

Goals and Objectives of the Winter School

09:45 - 10:15

Overview on Frederick Jelinek memorial summer workshop on speech and language technology overview

10:15 - 10:45

Historical perspective on Speech Technology

10:45 - 11:00

Break

11:00 - 12:30

Classical and Neural Approaches to Automatic Speech Recognition and Beyond

This lecture will review some basic history of the speech recognition task, introduce the noisy channel model for the ASR task and classical HMM based models and algorithms used in ASR. We will then show how these classical algorithms can be modified and used with modern deep learning techniques to arrive at CTC and transducer based modeling approaches for ASR. We will discuss self-supervised learning on unlabeled data and present a unified view of CTC, other sequence-based objective functions, and most self-supervised learning techniques as instances of mutual information estimation. We will then discuss emerging applications of ASR as a supporting tool for speech translation, summarization, or voice conversion and more.

12:30 - 2:00

Lunch Break

2:00 - 4:00

Speech recognition and understating lab

4:00 - 5:00

Round Table (what does it take to build LLM)

Day 2 - Monday, Dec 9, 2024 (Speech Synthesis and Voice generation)

08:00 - 09:00

Breakfast

09:00 - 09:15

What Lays Forward

09:15 - 10:00

Introduction to Speech Synthesis

10:00 - 10:15

Break

10:15 - 11:00

Advanced voice generation

11:00 - 12:00

Discussion & Demos

1:00 - 2:00

Lunch Break

2:00 - 4:00

Speech synthesis and voice generation lab Lamyaa & Vasista

4:00 - 5:00

Round Table (Deep Fake in Speech Processing)

Day 3 - Tuesday, Dec 10, 2024 (Image & Video Processing)

08:00 - 09:00

Breakfast

09:00 - 09:15

What Lays Forward

09:15 - 10:00

Image processing - Part 1

10:00 - 10:15

Break

10:15 - 11:00

Image processing - Part 2

11:00 - 12:00

Discussion & Demos

12:00 - 1:00

Lunch Break

1:00 - 1:45

Introduction to Vision and Language Models

This course offers a comprehensive overview of vision and language models, focusing on their technical foundations, core architectures, and diverse applications. We explore the key components of state-of-the-art pre-trained and large-scale vision and language models, highlighting the transformer-based architectures that enable training on multimodal datasets. The course overviews the commonly studied tasks and datasets in the research community, providing a practical understanding of the field's benchmarks. It also discusses the multimodal representations, fine-tuning strategies, prompt engineering, and in-context learning paradigms for adapting these models to various use cases. We introduce both open-source and proprietary vision and language models, analyzing their generalization and reasoning capabilities while addressing their current limitations. Through this, students will gain a nuanced understanding of the potential and challenges of these transformative technologies.

2:00 - 4:00

Image processing lab

4:00 - 5:00

Round Table (Vision and Image Processing)

Day 4 - Wednesday, Dec 11, 2024 (MultiModal LLM)

08:00 - 09:00

Breakfast

09:00 - 09:15

What Lays Forward

09:15 - 10:00

Advanced Vision and Language Models

10:45 - 11:15

Break

10:15 - 11:00

Multimodal LLMs for Data Analytics and visualization - P1

11:00 - 11:15

Break

11:15 - 12:00

Multimodal LLMs for Data Analytics and visualization - P2

12:00 - 1:00

Discussion & Demos

This lab provides a practical introduction to working with vision and language models (VLMs). The structure is as follows: We begin by exploring tools for creating initial representations of vision and language data, preparing inputs for VLMs. Next, we work with widely used pre-trained models, such as LXMERT and ViLBERT, to understand their architectures and functionalities. We then introduce the CLIP-based family of models, which leverage contrastive loss for training, and practice their fine-tuning and application. We examine large-scale models like GPT-family and LLAVA, focusing on their capabilities and adaptations for multimodal tasks. Students will engage in showcase tasks such as object and relationship detection, image captioning, visual question answering, and text-to-image generation. The experiments are conducted using PyTorch and the Hugging Face library, providing hands-on experience with state-of-the-art tools and techniques in multimodal learning.

1:00 - 2:00

Lunch Break

3:00 - 4:00

Round Table (MultiModal LLM)

6:00 - 9:00

Social event

Day 5 - Thursday, Dec 12, 2024 (MultiModal LLM)

08:00 - 09:00

Breakfast

09:00 - 09:15

What Lays Forward

09:15 - 10:00

Challenges in Developing Spoken Language Models - P1

This tutorial discusses Spoken Language Models (Spoken LMs), with a particular focus on the unique challenges in developing them, which set them apart from traditional text-based LMs. Key topics include speech representation, pre-training challenges, alignment, and achieving natural interaction. Special emphasis is placed on the importance of discrete tokens for effective speech generation and the role of text data in enhancing the learning process. The talk also explores the potential for generalization in Spoken LMs and strategies to address the issue of catastrophic forgetting. Finally, the tutorial covers methods for evaluating these models, highlighting the need for new benchmarks tailored to spoken LMs.

10:00 - 10:15