News

  • 28 March 2025 - Call For Paper is released.
  • 28 March 2025 - Workshop homepage is now available.

Call for Papers

This workshop intends to 1) provide a platform for researchers to present their latest works and receive feedback from experts in the field, 2) foster discussions on current challenges and opportunities in multimodal analysis and application, 3) identify emerging trends and opportunities in the field, and 4) explore their potential impact on future research and development. Potential topics include, but are not limited to:
  • Multimodal data augmentation
  • Multimodal data analysis and understanding
  • Multimodal question answering
  • Multimodal generation
  • Multimodal retrieval augmentation
  • Multimodal recommendation
  • Multimodal summarization and text generation
  • Multimodal agents
  • Multimodal prompting
  • Multimodal continual learning
  • Multimodal fusion and integration of information
  • Multimodal applications/pipelines
  • Multimodal systems management and indexing
  • Multimodal mobile/lightweight deployment
Important dates:
  • Workshop Papers Submission: 11 July 2025
  • Workshop Papers Notification: 01 August 2025
  • Camera-ready Submission: 11 August 2025
  • Conference dates: 27 October 2025 - 31 October 2025
Please note: The submission deadline is at 11:59 p.m. of the stated deadline date Anywhere on Earth.

Submission

  • Submission Guidelines:
  • Submitted papers (.pdf format) must be the same format & template as the main conference. The submition format The manuscript’s length is limited to one of the two options: a) 4 pages plus 1-page reference; or b) 8 pages plus up to 2-page reference. All papers will be peer-reviewed by experts in the field. Acceptance will be based on relevance to the workshop, scientific novelty, and technical quality.
  • Submission Site: Submission Link
  • Organizers

    • Zheng Wang (Huawei Singapore Research Center, Singapore)
    • Qianqian Chen (Huawei Singapore Research Center, Singapore)
    • Yiyang Luo (Huawei Singapore Research Center, Singapore)
    • Zhiqiu Ye (Huawei Singapore Research Center, Singapore)
    • Wei Shi (Huawei Singapore Research Center, Singapore)
    • Hanwang Zhang (Nanyang Technological University, Singapore)
    • Tat-Seng Chua (National University of Singapore, Singapore)

    Speakers

    Keynote 1

    Ziwei Liu

    Speaker: Ziwei Liu (Nanyang Technological University)

    Talk Title: From Multimodal Generative Models to Dynamic World Modeling

    Abstract: Beyond the confines of flat screens, multimodal generative models are crucial to create immersive experiences in virtual reality, not only for human users but also for robotics. Virtual environments or real-world simulators, often comprised of complex 3D/4D assets, significantly benefit from the accelerated creation enabled by Gen AI. In this talk, we will introduce our latest research progress on multimodal generative models for objects, avatars, scenes, motions, and ultimately dynamic world models.

    Short Bio: Ziwei Liu is currently an Associate Professor at Nanyang Technological University, Singapore. His research revolves around computer vision, machine learning and computer graphics. He has published extensively on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH, TPAMI, TOG and Nature Machine Intelligence. He is the recipient of PAMI Mark Everingham Prize, CVPR Best Paper Award Candidate, Asian Young Scientist Fellowship, International Congress of Basic Science Frontiers of Science Award and MIT Technology Review Innovators under 35 Asia Pacific. He serves as an Area Chair of CVPR, ICCV, ECCV, NeurIPS and ICLR, as well as an Associate Editor of IJCV.


    Keynote 2

    Mike Zheng Shou

    Speaker: Mike Zheng Shou (National University of Singapore)

    Talk Title: Video Intelligence in the Era of Multimodal

    Bio: Mike Shou is an Assistant Professor under Presidential Young Professorship at National University of Singapore. He was a Research Scientist at Facebook AI in the Bay Area. He obtained his Ph.D. degree at Columbia University with Prof Shih-Fu Chang. His research mainly focuses on video and multimodal. He received the Best Paper Finalist at CVPR 2022, Best Student Paper Nomination at CVPR 2017, EgoVis Distinguished Paper Award 2022/23. His team won 1st place in the international challenges including ActivityNet, EPIC-Kitchens, Ego4D. He is a ST Engineering Distinguished Professor and a Fellow of National Research Foundation Singapore. He is on the Forbes 30 Under 30 Asia list.


    Keynote 3

    Guosheng Lin

    Speaker: Guosheng Lin (Nanyang Technological University)

    Talk Title: Recent Advances in 3D Generation: From 3D Assets to CAD Models

    Abstract: In this talk, I will present our latest progress in 3D generative learning guided by text or image input. I will begin with our method for high-quality 3D asset generation from images, where we propose an efficient coarse-to-fine framework that combines compact coarse-level representations and part-aware voxel refinement. I will then introduce our approach for generating parametric CAD models directly from real-world images, eliminating the need for expensive 3D scanning. Together, these works push the frontier of visual generation and bring us closer to practical, scalable 3D modelling for real applications.

    Bio: Guosheng Lin is an Associate Professor at the College of Computing and Data Science, Nanyang Technological University. His research interests are in computer vision, with a focus on data-efficient learning and generative learning. He has published over 100 research articles in prestigious venues. He serves as an Associate Editor for the IEEE journals TMM and TCSVT. He also serves as an Area Chair or Senior PC member for flagship conferences such as CVPR, ACM MM, IJCAI, and AAAI.


    Keynote 4

    Hao Fei

    Speaker: Hao Fei (National University of Singapore)

    Talk Title: On Path to Multimodal Generalist: General-Level and General-Bench

    Abstract: AI systems are increasingly capable of handling diverse types of data—such as text, images, and audio. However, many of these multimodal systems excel only in specific tasks or data modalities, lacking the broad adaptability seen in human intelligence. Also the existing evaluation paradigm that simply assumes that higher performance across tasks indicates a stronger MLLM capability can be problematic. This research introduces two tools/resources: General-Level, a framework that assesses an AI model's ability to integrate and apply knowledge across different tasks and data types; and General-Bench, a comprehensive dataset comprising over 700 tasks and 325,000 examples designed to evaluate this integrative capability. By applying these tools to over 100 existing AI models, we discovered that while some models perform well on individual tasks, they often struggle to transfer knowledge between different types of tasks or data. This indicates a gap in achieving truly general-purpose multimodal AGI. Our work aims to guide the development of more versatile multimodal AI systems that can seamlessly understand and generate multiple forms of data, moving us closer to AI that mirrors human-like general intelligence.

    Bio: Hao Fei is a Senior Research Fellow at National University of Singapore. His research focuses on vision-language understanding and generation, multimodal large foundation models. He has published over 60 papers in top-tier venues such as IEEE TPAMI, IEEE TKDE, ACM TOIS, AI, ICML, NeurIPS, CVPR, ACL, ICLR, and AAAI, gaining over 7k citations. He has received many accolades including 2022 Outstanding CIPSC PhD Dissertation Award, 2023 WAIC Rising Star Award, 2024 WAIC Excellent Young Paper Award, and Stanford World's Top 2% Scientist. His representative contributions include NExT-GPT, Vitron, and the Generalist Benchmark. He has organized over 20 international workshops, tutorials, and shared tasks, and frequently serves as Area Chair/Senior PC for top conferences including CVPR, ICML, NeurIPS, AAAI, ACL, IJCAI, and ACM MM, as well as Associate Editor of ACM TALLIP and Neurocomputing.