免费领取大会全套演讲PPT    

点击领取

我要参会

Wu Baodong

Vice President of Technology at Infinigence AI

Wu Baodong is Vice President of Technology at Infinigence AI. He received his Ph.D. from the Institute of Computing Technology, Chinese Academy of Sciences, and completed his postdoctoral research at Tsinghua University. He is a recipient of the ACM SIGHPC China Outstanding Doctoral Dissertation Award. His long-term research focuses on high-performance computing, parallel computing, cluster scheduling, and large-model training systems. He has published more than ten papers in leading international conferences and journals such as SC, TPDS, IPDPS, and ICDCS, and received a Best Paper Nomination at ICDCS 2020. Previously, at SenseTime, he led the end-to-end development of an AI computing platform from the ground up, achieving unified management and scheduling of over 20,000 GPUs. He currently leads the core technology R&D for Infinigence AI’s one-stop AI platform, where he has built China’s first unified scheduling platform for heterogeneous, geographically distributed, and cross-domain computing resources. The platform manages more than ten types of chips and over 25,000 PFLOPS of compute capacity, and has enabled large-scale deployment of key technologies including fault-tolerant large-model training systems, distributed inference services, and task failure prediction.

Topic

Agentic Infra–Based AIOps Agent System: Breaking Through and Practicing Automated Operations for Multi-GPU Clusters

With the explosive growth in large model computing demands, GPU infrastructure has evolved from single clusters to multi-cluster, cross-region, and multi-architecture collaborative environments, causing operational complexity to increase exponentially. Traditional operation and maintenance approaches face significant bottlenecks when dealing with “black-box” challenges such as heterogeneous hardware failures, RDMA network topology differences, and tightly coupled high-performance storage. These issues often lead to alarm storms, difficulty in cross-cluster fault localization, and low response efficiency. To address these challenges, we first reconstructed thousands of real operational records into a structured format, building an industry-leading GPU operations benchmark and expert knowledge base for multi-scenario evaluation. Based on our self-developed **Agentic Infra** framework, we then developed an **AIOps agent system** tailored for multi-GPU cluster operations. The system orchestrates global tasks through intelligent agents and includes four specialized agents for query, alert handling, deployment delivery, and automated inspection, providing cross-cluster state awareness and governance capabilities. Experimental results show that in scenarios with concurrent multi-cluster failures, this system reduces MTTR (Mean Time to Repair) by over 90%, achieving an intelligent transformation from “manual passive response” to “AI-driven proactive cross-cluster governance.” This forum presentation will focus on the **Agentic Infra–based AIOps agent system**, providing an in-depth analysis of how to overcome the bottlenecks of traditional operations and maintenance in multi-GPU, multi-cluster, and cross-region computing environments, achieving true intelligent and automated governance. Attendees will gain a systematic understanding of: * How to build a GPU operations benchmark and expert knowledge base tailored for real production scenarios; * How to design a control agent system with global orchestration capabilities based on the Agentic Infra framework; * How specialized agents for query, alert handling, deployment delivery, and automated inspection collaborate to achieve cross-cluster state awareness and closed-loop fault management. Through real-world case studies and measured data, the presentation will share the architectural design principles and practical experience behind a **90% reduction in MTTR**, helping participants grasp the key steps for upgrading from “manual passive response” to “AI-driven proactive governance,” and providing replicable methodology and practical references for building the next-generation intelligent operations system for multi-GPU clusters.

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号