Chen Zhang

Senior Algorithm Engineer at Moore Threads, Former Senior Algorithm Researcher at Tencent

Responsible for Moore's Threads distributed training research and development. More than 10 years of experience in NLP, focusing on NLP algorithms, distributed training, and large-scale optimization. Participated in Tencent Search business optimization, led the team to participate in CLUE large model benchemark evaluation, and won the Top 10 with a small model under 1B. Deep learning veteran, MXNet.cpp Commiter.

Topic

Moore's Threads Full-Featured GPU Distributed Training Performance Optimization Exploration for Large-Scale Language Models

Introduction: In the wave of large model training, the distributed training capability of domestic full-featured GPUs is ushering in an unprecedented breakthrough. Moore Threads AI Infra group has been deeply engaged in large language model training technology for nearly three years, and has been ranked among the top 10 in CLUE evaluation, successfully adapted to almost all the mainstream model training frameworks, and constructed a large-scale domestic graphics card cluster to achieve the industry's top-level MFUs with the help of FP8 acceleration. at the same time, we are the first one to complete the highly efficient adaptation of the DeepSeek model to achieve excellent training performance. In this talk, we will analyze the compatibility advantage of domestic full-featured GPUs in large-scale model training, share the core practice of optimization from Dense model to MoE model, and discuss the breakthrough direction of domestic AI computing hardware in future large-scale training, so as to provide developers with real-world experience and in-depth thinking. Outline: 1、Domestic Graphics Card AI Computing Architecture: MUSA High Compatibility and MT-Megatron and Other Framework Achievements 2、Dense model optimization exploration: challenges and optimization strategies for distributed training of dense models. 3. MoE model acceleration practice: Efficient adaptation and performance optimization of DeepSeek-like MoE models. 4、Future Outlook: How Domestic AI Computing Hardware Continues to Make Breakthroughs in Large-Scale Model Training

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号