免费领取大会全套演讲PPT    

点击领取

我要参会

Peiyu Wang

Tech Lead of Multimodal Large Models at Kunlun Wanwei

Peiyu Wang is the Multimodal Tech Lead at Kunlun Wanwei. His main research areas include multimodal understanding and generation, video generation, and world models. He has participated in releasing a series of open-source models including skywork-r1v (multimodal reasoning), unipic (unified understanding and generation), matrix game (world model), and skyreels (video generation). At the time of release, these models were SOTA open-source models in their respective fields. They have accumulated over one million downloads on Hugging Face and have received widespread recognition from the open-source community. He has deep theoretical knowledge and extensive practical experience in multimodal understanding and generation, video generation, and world models.

Topic

From Video Generation to World Models: The Evolution and Practice of Multimodal Generative Technologies

With the rapid development of generative AI, video generation models are evolving from simple “content generation tools” into world models capable of understanding environments and predicting future states. This talk will draw on several open-source projects I have worked on at Kunlun Wanwei, including video generation models and the Matrix-Game series of world models, to systematically introduce the key technological pathway from video generation to interactive world models. The discussion will focus on long-sequence consistency in video generation and explore how action-conditioned modeling and autoregressive diffusion architectures can extend video generation capabilities into world models that can respond to interactions in real time and predict environmental changes. Through real engineering cases, the talk will share practical experience in multimodal data construction, model architecture design, and training strategies, while also discussing the future applications of generative AI in virtual worlds, agent training, and robotic simulation. Over the past few years, video generation models have undergone a significant paradigm shift—from offline content generation toward models capable of simulating environments as world models. At the video generation stage, we released the SkyReels series, a unified foundation model for video and audio generation. It adopts a Multimodal Diffusion Transformer (MMDiT) architecture that performs dual-stream modeling of video diffusion and audio diffusion, while using a unified multimodal encoder to jointly understand text, image, and video conditions. This structure enables the model not only to perform text-to-video generation but also tasks such as video editing, video extension, and audio synchronization. For example, when given an image of a person playing the guitar, the model can generate continuous video frames along with music audio synchronized with the hand movements, producing a complete audiovisual output. However, video generation is still fundamentally offline content generation. If we want AI to truly understand and predict environments, we need to introduce the variable of action. To address this, we further proposed Matrix-Game. In this model, video generation is no longer just about predicting the next frame, but about predicting the future state of the environment given specific actions. We constructed a training dataset containing thousands of hours of interaction data, where each video frame is aligned with keyboard and mouse actions. The model adopts a few-step autoregressive diffusion architecture, injecting action embeddings into the generation network during the generation process so that the model can generate new visual states in real time based on input actions. In practical systems, the model can continuously generate video streams at real-time speed and respond instantly to user interactions.

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号