Yang Ke | 2026 Singularity Intelligent technology Summit

免费领取大会全套演讲PPT

点击领取

我要参会

Yang Ke

Core Contributor to Mooncake, Technical Expert at Approaching AI

Yang Ke is a Technical Expert at Approaching AI and a core contributor to the open-source project Mooncake. He earned his Ph.D. from the Institute of High-Performance Computing, Department of Computer Science, Tsinghua University, and his bachelor's degree from Beijing University of Posts and Telecommunications. He was a finalist in the 2013 ACM-ICPC World Finals and has published first-author papers in top systems conferences such as SOSP and ASPLOS. His research interests include distributed systems, parallel computing, and AI infrastructure.

Topic

From Monolith to Decoupled: How Mooncake Supports Next-Generation Large Model Inference

As large models grow in context length and token consumption explodes, traditional monolithic inference architectures struggle to meet the demands of large-scale services. Mooncake is an open-source, distributed large model inference architecture centered on KVCache and designed for decoupled scenarios. It addresses bottlenecks in compute utilization caused by bandwidth, latency, and fault-tolerance issues during storage and transmission of KVCache, model weights, and other data, enabling the evolution of large model services from monolithic deployments to more heterogeneous, decoupled, and efficient system architectures. With this goal, Mooncake has evolved into a communication and storage infrastructure for large model services, focusing on efficiency limits and system stability under large-scale operation. It provides five core capabilities: 1. Efficient KVCache transmission under parameter/data separation 2. Global KVCache reuse across distributed inference clusters 3. Elastic expert-parallel computation with fault recovery 4. High fault-tolerance PyTorch distributed backend 5. Fast model weight updates via tensor-native and zero-copy APIs Mooncake integrates deeply with mainstream inference engines such as SGLang, vLLM, xLLM, and TensorRT LLM, joining the native PyTorch ecosystem. It has been deployed in multiple enterprises and institutions, continually advancing large model services toward greater scalability, efficiency, and industrial applicability. Outline: 1. Background: Challenges of long-context large model inference, evolution of inference architectures, and Mooncake project overview 2. In-depth exploration of Mooncake’s system architecture, core features, optimizations, and latest developments 3. Applications of Mooncake in the large model open-source ecosystem and industry deployments Audience Takeaways: This session aims to provide practical system design insights and optimization strategies for building next-generation large model inference systems. By attending, participants will: * Understand the key challenges facing large-scale model inference today and the broader evolution trends in inference architectures. * Gain a deep understanding of the critical issues in efficient and reliable data transmission and storage under disaggregated architectures. * Learn how the open-source project Mooncake addresses these challenges through architectural design and system-level optimization. * Explore Mooncake’s latest developments, along with its integration practices and real-world applications within the open-source large model ecosystem.

Boolan is a leading IT Education & Consulting company in China. Our core competence is our experts team around the world and their cutting edge technology experience accumulated through decades. Adhering to the tenet of "Global Experts, Global Wisdom", we are dedicated to providing our customers In-house Training,Technical Conference, Software Consulting, Expert Lecture, Seminar, Talent Evaluation and Certification and other services by gathering the world's top IT technology experts. www.boolan.com

沪ICP备15014563号-6