Chaojun Xiao
ChatGPT 说: Postdoc at Tsinghua CS Dept, lead author of MiniCPM-4
Postdoctoral researcher at the Natural Language Processing Lab of Tsinghua University, with a research focus on efficient large model architectures. He has published over ten papers in top-tier international AI conferences as first author or co-first author, and is a primary contributor to MiniCPM-4, an efficient edge-side large model. His work has been cited over 3,000 times on Google Scholar. He has received several prestigious honors, including the First Prize of the Qian Weichang Award for Science and Technology in Chinese Information Processing, the Postdoctoral Innovative Talent Support Program, the Tsinghua Shuimu Scholar title, and the Outstanding Scholarship of the Tencent Rhino-Bird Elite Talent Program.
Topic
MiniCPM: Efficient Large Model for Edge Devices
With the rapid development of artificial intelligence, the demand for deploying large models on edge devices has become increasingly urgent. However, traditional large models often face challenges such as high computational resource consumption and slow inference speed, making it difficult to operate effectively in resource-constrained edge environments. MiniCPM, a highly efficient large model specifically optimized for edge devices, offers a solution to this critical problem. Its core innovations span four key dimensions. At the model architecture level, InfLLM v2 features a trainable sparse attention mechanism that significantly accelerates the prefill and decoding stages for long-context processing. In terms of training data, the UltraClean efficient data filtering strategy greatly improves the effectiveness and efficiency of data validation. Regarding training algorithm optimization, Model Wind Tunnel enables efficient pretraining strategy search, segmented exploration sampling achieves load-balanced reinforcement learning, and BitCPM implements extreme parameter compression through post-training ternary quantization. At the inference system level, the CPM.cu inference engine integrates sparse attention, model quantization, and speculative sampling technologies, greatly enhancing inference efficiency. Tests on typical edge chips show that, compared with dense models of the same size, MiniCPM achieves over 5× inference acceleration.