免费领取大会全套演讲PPT    

点击领取

我要参会

Yang Ke

Core Contributor to Mooncake, Technical Expert at Approaching AI

Yang Ke is a Technical Expert at Approaching AI and a core contributor to the open-source project Mooncake. He earned his Ph.D. from the Institute of High-Performance Computing, Department of Computer Science, Tsinghua University, and his bachelor's degree from Beijing University of Posts and Telecommunications. He was a finalist in the 2013 ACM-ICPC World Finals and has published first-author papers in top systems conferences such as SOSP and ASPLOS. His research interests include distributed systems, parallel computing, and AI infrastructure.

Topic

From Monolith to Decoupled: How Mooncake Supports Next-Generation Large Model Inference

As large models grow in context length and token consumption explodes, traditional monolithic inference architectures struggle to meet the demands of large-scale services. Mooncake is an open-source, distributed large model inference architecture centered on KVCache and designed for decoupled scenarios. It addresses bottlenecks in compute utilization caused by bandwidth, latency, and fault-tolerance issues during storage and transmission of KVCache, model weights, and other data, enabling the evolution of large model services from monolithic deployments to more heterogeneous, decoupled, and efficient system architectures. With this goal, Mooncake has evolved into a communication and storage infrastructure for large model services, focusing on efficiency limits and system stability under large-scale operation. It provides five core capabilities: 1. Efficient KVCache transmission under parameter/data separation 2. Global KVCache reuse across distributed inference clusters 3. Elastic expert-parallel computation with fault recovery 4. High fault-tolerance PyTorch distributed backend 5. Fast model weight updates via tensor-native and zero-copy APIs Mooncake integrates deeply with mainstream inference engines such as SGLang, vLLM, xLLM, and TensorRT LLM, joining the native PyTorch ecosystem. It has been deployed in multiple enterprises and institutions, continually advancing large model services toward greater scalability, efficiency, and industrial applicability. Outline: 1. Background: Challenges of long-context large model inference, evolution of inference architectures, and Mooncake project overview 2. In-depth exploration of Mooncake’s system architecture, core features, optimizations, and latest developments 3. Applications of Mooncake in the large model open-source ecosystem and industry deployments Audience Takeaways: This session aims to provide practical system design insights and optimization strategies for building next-generation large model inference systems. By attending, participants will: * Understand the key challenges facing large-scale model inference today and the broader evolution trends in inference architectures. * Gain a deep understanding of the critical issues in efficient and reliable data transmission and storage under disaggregated architectures. * Learn how the open-source project Mooncake addresses these challenges through architectural design and system-level optimization. * Explore Mooncake’s latest developments, along with its integration practices and real-world applications within the open-source large model ecosystem.

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号