Kai Qiu

Senior Researcher, Microsoft Research Asia, Logic-RL Contributor

Kai Qiu is a senior researcher at Microsoft Research Asia, graduated from University of Chinese Academy of Sciences. His research interests include image and video generation, post-training of large multimodal models, and reinforcement learning of large language models. He has published several articles in conferences and journals such as CVPR, ICCV, AAAI, ACM Multimedia, Pattern Recognition and so on. The related research results have been used in several Microsoft products, including Bing Ads, Windows Copilot, and so on. He is a reviewer for CVPR, ICCV, ECCV, ACM MM, AAAI, IJCV and other conferences and journals. The related technologies have been granted Chinese and US patents.

Topic

Logic-RL: Unleashing Big Model Reasoning with Rule-Based Reinforcement Learning

We explore the potential of rule-based reinforcement learning (RL) for large-scale inference modeling. Inspired by the success of DeepSeek-R1, we analyze inference dynamics by synthesizing logic puzzles as training data. These logic puzzles are ideal training data due to their controlled complexity and simple answer verification process. Several key technical contributions are presented in this study, including systematic cues that emphasize the thinking and answering process, a strict-format reward function that penalizes the output of shortcuts, and a simple training scheme that achieves stable convergence. Our 7B model demonstrates generalization ability on the challenging math benchmark tests AIME and AMC after training on only 5,000 logic problems. Outline: 1, Introduction. Exploring the contribution and inspiration of DeepSeek-R1 in rule-based reinforcement learning 2. research background and motivation Analyze the limitations of existing mathematical datasets for inference training. Introduce Knights and Knaves (K&K) logic puzzles of manageable difficulty as a rationale for selection of training data. 3. Technical Contributions Describe in detail the design of the system hints, format reward functions, and training schemes. Presentation of the REINFORCE++ algorithm and its application and improvement in model training. 4. Experiments and Results Discuss the natural scaling of the model's inference steps during training. Demonstrate the model's performance improvement on AIME and AMC math benchmark tests. 5. Findings and Insights Analyze the relationship between response length and inference quality. Discuss the correlation between the frequency of 'think' related words and performance. Compare the differences between SFT and RL in terms of memory and generalization ability. Evaluate the impact of cold starts on training dynamics.

© boolan.com 博览 版权所有

沪ICP备15014563号-6

沪公网安备31011502003949号