Large Generative Models Meet Multimodal Applications (LGM3A)

Workshop at ACM Multimedia 2024

Scope and Topics

This workshop aims to explore the potential of large generative models to revolutionize the way we interact with multimodal information. A Large Language Model (LLM) represents a sophisticated form of artificial intelligence engineered to comprehend and produce natural language text, exemplified by technologies such as GPT, LLaMA, Flan-T5, ChatGLM, and Qwen, etc. These models undergo training on extensive text datasets, exhibiting commendable attributes including robust language generation, zero-shot transfer capabilities, and In-Context Learning (ICL). With the surge in multimodal content—encompassing images, videos, audio, and 3D models—over the recent period, Large MultiModal Models (LMMs) have seen significant enhancements. These improvements enable the augmentation of conventional LLMs to accommodate multimodal inputs or outputs, as seen in BLIP, Flamingo, KOSMOS, LLaVA, Gemini, GPT-4, etc. Concurrently, certain research initiatives have delved into generating specific modalities, with Kosmos2 and MiniGPT-5 focusing on image generation, and SpeechGPT on speech production. There are also endeavors to integrate LLMs with external tools to achieve a near 'any-to-any' multimodal comprehension and generation capacity, illustrated by projects like Visual-ChatGPT, ViperGPT, MMREACT, HuggingGPT, and AudioGPT. Collectively, these models, spanning not only text and image generation but also other modalities, are referred to as large generative models. This workshop will provide an opportunity for researchers, practitioners, and industry professionals to explore the latest trends and best practices in the field of multimodal applications of large generative models. We also remark that the submissions are not limited to the use of such models. The workshop will also focus on exploring the challenges and opportunities of integrating large language models with other AI technologies such as computer vision and speech recognition. Additionally, the workshop will provide a platform for participants to present their research, share their experiences, and discuss potential collaborations.

News

7/5/2023 - CFP is released.
7/5/2023 - Workshop homepage is now available.

Call for Papers

This workshop intends to 1) provide a platform for researchers to present their latest works and receive feedback from experts in the field, 2) foster discussions on current challenges and opportunities in multimodal analysis and application, 3) identify emerging trends and opportunities in the field, and 4) explore their potential impact on future research and development. Potential topics include, but are not limited to:

Multimodal data augmentation
Multimodal data analysis and understanding
Multimodal question answering
Multimodal generation
Multimodal retrieval augmentation
Multimodal recommendation
Multimodal summarization and text generation
Multimodal agents
Multimodal prompting
Multimodal continual learning
Multimodal fusion and integration of information
Multimodal applications/pipelines
Multimodal systems management and indexing
Multimodal mobile/lightweight deployment

Important dates:

Workshop Papers Submission: July 19, 2024
Workshop Papers Notification: August 5, 2024
Camera-ready Submission: August 19, 2024
Conference dates: 28 October - 1 November 2024

Please note: The submission deadline is at 11:59 p.m. of the stated deadline date Anywhere on Earth.

Submission

Submission Guidelines:

Submitted papers (.pdf format) must be the same format & template as the main conference. The submition format The manuscript’s length is limited to one of the two options: a) 4 pages plus 1-page reference; or b) 8 pages plus up to 2-page reference. All papers will be peer-reviewed by experts in the field. Acceptance will be based on relevance to the workshop, scientific novelty, and technical quality.

Submission Site: https://easychair.org/conferences/?conf=lgm3a

Organizers

Shihao Xu (Huawei Singapore Research Center, Singapore)
Yiyang Luo (Huawei Singapore Research Center, Singapore)
Justin Dauwels (Delft University of Technology)
Andy Khong (Nanyang Technological University, Singapore)
Zheng Wang (Huawei Singapore Research Center, Singapore)
Qianqian Chen (Huawei Singapore Research Center, Singapore)
Chen Cai (Huawei Singapore Research Center, Singapore)
Wei Shi (Huawei Singapore Research Center, Singapore)
Tat-Seng Chua (National University of Singapore, Singapore)

Speakers

Keynote 1

Prof. Ziwei Liu is a Nanyang Assistant Professor (2020-) at College of Computing and Data Science in Nanyang Technological University, with MMLab@NTU. Previously, he was a research fellow (2018-2020) in CUHK with Prof. Dahua Lin and a post-doc researcher (2017-2018) in UC Berkeley with Prof. Stella Yu. His research interests include computer vision, machine learning and computer graphics. Ziwei received his Ph.D. (2013-2017) from CUHK, Multimedia Lab, advised by Prof. Xiaoou Tang and Prof. Xiaogang Wang. He is fortunate to have internships at Microsoft Research and Google Research. Ziwei is the recipient of MIT Technology Review Innovators under 35 Asia Pacific, ICBS Frontiers of Science Award, CVPR Best Paper Award Candidate and WAIC Yunfan Award. His works have been transferred to products, including Microsoft Pix, SenseGo and Google Clips.

Talk Title: Multi-Modal Generative AI with Foundation Models
Abstract: Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in AI-driven visual context generation of humans, objects and scenes, with an emphasis on combining the power of neural rendering with large multimodal foundation models. Our generative AI framework has shown its effectiveness and generalizability on a wide range of tasks.

Keynote 2

Prof. Mike Zheng Shou is a tenure-track Assistant Professor at National University of Singapore and a former Research Scientist at Facebook AI in the Bay Area. He holds a PhD degree from Columbia University in the City of New York, where he worked with Prof. Shih-Fu Chang. He was awarded the Wei Family Private Foundation Fellowship. He received the best paper finalist at CVPR'22 and the best student paper nomination at CVPR'17. His team won 1st place in multiple international challenges including ActivityNet 2017, EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Fellow of the National Research Foundation (NRF) Singapore and has been named on the Forbes 30 Under 30 Asia list.

Talk Title: Multimodal Video Understanding and Generation
Abstract: Exciting progress has been made in multimodal video intelligence, including both understanding and generation, these two pillars in video. Despite being promising, several key challenges still remain. In this talk, I will introduce our attempts to address some of them. (1) For understanding, I will share All-in-one, which employs one single unified network for efficient video-language modeling, and EgoVLP, which is the first video-language pre-trained model for egocentric video. (2) For generation, I will introduce our study of efficient video diffusion models (i.e., Tune-A-Video, 4K GitHub stars). (3) Finally, I would like to discuss our recent exploration, Show-o, one single LLM that unifies multimodal understanding and generation.

Workshop Schedule

On-site venue: TBD

Date & Time: October 28th Afternoon (Local Melbourne Time, GMT+11)

Zoom link: Meeting ID: 839 6623 9115, Passcode: 20241028

Time (GMT+11)	Title
13:00 - 13:10	Welcome Message from the Chairs
13:10 - 13:50	Keynote 1: Multi-Modal Generative AI with Foundation Models
13:50 - 14:30	Keynote 2: Multimodal Video Understanding and Generation
14:30 - 14:40	Break
14:40 - 15:00	Presentation 1: Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning
15:00 - 15:20	Presentation 2: Leveraging the Syntactic Structure of the Text Prompt to Enhance Object-Attribute Binding in Image Generation
15:20 - 15:40	Presentation 3: SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
15:40 - 16:00	Presentation 4: Multimodal Understanding: Investigating the Capabilities of Large Multimodel Models for Object Detection in XR Applications
16:00 - 16:20	Presentation 5: A Method for Efficient Structured Data Generation with Large Language Models
16:20 - 16:30	Workshop Closing

For any questions, please email to LGM3A2024@gmail.com