Liu Xiao
Senior Researcher at Microsoft Research Asia (MSRA).
Senior Researcher at Microsoft Research Asia (MSRA) in the AI Inference Group, focusing on natural language processing, large language models, and inference technologies, with a mission to advance the systematic deployment of research from fundamental algorithms to real-world applications. He received both his bachelor's and Ph.D. degrees from Beijing Institute of Technology, with his doctoral dissertation awarded the 2023 Excellent Ph.D. Dissertation Award by the Chinese Information Processing Society (CIPS). He has published over 40 papers in top international conferences and journals in natural language processing and machine learning. His research has been recognized with the runner-up for the Best Paper Award at NeurIPS 2024 and has been applied in core products such as Microsoft Bing Search. He also serves as an Area Chair for leading international conferences including ACL, ICML, NeurIPS, and EMNLP.
Topic
Rethinking Data in Large Language Model Pretraining — Data Selection, Data Mixing, and Efficient Training
The capabilities of current large language models largely rely on large-scale training with massive datasets, but not all data contribute equally to model learning. This report revisits the pretraining of large language models from a **data-centric** perspective, presenting our work on data selection, data mixing, and efficient model training. Related studies show that through more effective data filtering, adaptive data balancing, and improved training processes, it is possible to enhance model performance while simultaneously improving training efficiency. Outline: a) Introduction i. The trend and challenges of large-scale pretraining for large language models ii. From data scale to data quality: a data-centric perspective b) Data Selection i. Token importance estimation (Rho-1) ii. Long-range information in extended contexts iii. Diversity-aware data selection c) Data Mixing i. Balancing multi-domain training data ii. Data Mixing Agent: learning-based optimization for data allocation d) Efficient Training i. Sigma-MoE-Tiny e) Summary and Outlook Through this report, the audience will be able to: 1. Understand the data-centric perspective in large language model pretraining and its significance 2. Learn how to identify more valuable training data through data selection 3. Grasp key issues in multi-domain data mixing and learning-based optimization methods 4. Recognize the role of data–model co-design in improving training efficiency Outline: a) Introduction i. The trend and challenges of large-scale pretraining for large language models ii. From data scale to data quality: a data-centric perspective b) Data Selection i. Token importance estimation (Rho-1) ii. Long-range information in extended contexts iii. Diversity-aware data selection c) Data Mixing i. Balancing multi-domain training data ii. Data Mixing Agent: learning-based optimization for data allocation d) Efficient Training i. Sigma-MoE-Tiny e) Summary and Outlook Through this report, the audience will be able to: 1. Understand the data-centric perspective in large language model pretraining and its significance 2. Learn how to identify more valuable training data through data selection 3. Grasp key issues in multi-domain data mixing and learning-based optimization methods 4. Recognize the role of data–model co-design in improving training efficiency