Speakers
Keynote 1
Prof. Ziwei Liu is a Nanyang Assistant Professor (2020-) at College of Computing and Data Science in Nanyang Technological University, with MMLab@NTU. Previously, he was a research fellow (2018-2020) in CUHK with Prof. Dahua Lin and a post-doc researcher (2017-2018) in UC Berkeley with Prof. Stella Yu. His research interests include computer vision, machine learning and computer graphics. Ziwei received his Ph.D. (2013-2017) from CUHK, Multimedia Lab, advised by Prof. Xiaoou Tang and Prof. Xiaogang Wang. He is fortunate to have internships at Microsoft Research and Google Research. Ziwei is the recipient of MIT Technology Review Innovators under 35 Asia Pacific, ICBS Frontiers of Science Award, CVPR Best Paper Award Candidate and WAIC Yunfan Award. His works have been transferred to products, including Microsoft Pix, SenseGo and Google Clips.
Talk Title: Multi-Modal Generative AI with Foundation Models
Abstract: Generating photorealistic and controllable visual contents has been a long-pursuing goal of artificial intelligence (AI), with extensive real-world applications. It is also at the core of embodied intelligence. In this talk, I will discuss our work in AI-driven visual context generation of humans, objects and scenes, with an emphasis on combining the power of neural rendering with large multimodal foundation models. Our generative AI framework has shown its effectiveness and generalizability on a wide range of tasks.
Keynote 2
Prof. Mike Zheng Shou is a tenure-track Assistant Professor at National University of Singapore and a former Research Scientist at Facebook AI in the Bay Area. He holds a PhD degree from Columbia University in the City of New York, where he worked with Prof. Shih-Fu Chang. He was awarded the Wei Family Private Foundation Fellowship. He received the best paper finalist at CVPR'22 and the best student paper nomination at CVPR'17. His team won 1st place in multiple international challenges including ActivityNet 2017, EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Fellow of the National Research Foundation (NRF) Singapore and has been named on the Forbes 30 Under 30 Asia list.
Talk Title: Multimodal Video Understanding and Generation
Abstract: Exciting progress has been made in multimodal video intelligence, including both understanding and generation, these two pillars in video. Despite being promising, several key challenges still remain. In this talk, I will introduce our attempts to address some of them. (1) For understanding, I will share All-in-one, which employs one single unified network for efficient video-language modeling, and EgoVLP, which is the first video-language pre-trained model for egocentric video. (2) For generation, I will introduce our study of efficient video diffusion models (i.e., Tune-A-Video, 4K GitHub stars). (3) Finally, I would like to discuss our recent exploration, Show-o, one single LLM that unifies multimodal understanding and generation.
Workshop Schedule
On-site venue: TBD
Date & Time: October 28th Afternoon (Local Melbourne Time, GMT+11)
Zoom link: Meeting ID: 839 6623 9115, Passcode: 20241028
Time (GMT+11) | Title |
13:00 - 13:10 | Welcome Message from the Chairs |
13:10 - 13:50 | Keynote 1: Multi-Modal Generative AI with Foundation Models |
13:50 - 14:30 | Keynote 2: Multimodal Video Understanding and Generation |
14:30 - 14:40 | Break |
14:40 - 15:00 | Presentation 1: Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning |
15:00 - 15:20 | Presentation 2: Leveraging the Syntactic Structure of the Text Prompt to Enhance Object-Attribute Binding in Image Generation |
15:20 - 15:40 | Presentation 3: SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding |
15:40 - 16:00 | Presentation 4: Multimodal Understanding: Investigating the Capabilities of Large Multimodel Models for Object Detection in XR Applications |
16:00 - 16:20 | Presentation 5: A Method for Efficient Structured Data Generation with Large Language Models |
16:20 - 16:30 | Workshop Closing |