Tech Lead of Multimodal Large Models at Kunlun Wanwei
Responsible for multimodal reasoning, multimodal reward models, and unified understanding-generation tasks. The **Skywork-R1V** series for multimodal reasoning has accumulated nearly 100K downloads on Hugging Face within a single month.
Topic
Multimodal Reasoning and Unified Models
r1v: The world’s first industrial multimodal chain-of-thought reasoning model, designed to transfer textual reasoning capabilities to visual tasks. Its architecture uses a lightweight visual projector to connect text and vision models, and a hybrid optimization framework (iterative SFT + GRPO) to enhance alignment. Adaptive chain-of-thought distillation improves efficiency. With 38 billion parameters, it achieves MMMU 69.0 and MathVista 67.5, leading in textual reasoning and laying the foundation for unified multimodal reasoning. r1v2: Improves hybrid reinforcement learning by using the SSB mechanism to mitigate GRPO “advantage vanishing.” The MPO strategy integrates reward models and rule constraints, and reward thresholds are calibrated to reduce hallucinations. Overall performance is enhanced, achieving MMMU 73.6 and MathVista 74.0, narrowing the gap with closed-source models while balancing specialized and general capabilities. r1v3: Upgrades cross-modal fusion and reinforcement learning with cold-start RL, key reasoning entropy discrimination, optimized visual connectors, and cross-modal causal modeling. Trained on 25,000+ samples, inference speed increases sixfold, and reasoning steps are compressed to 1/6. Achieves MMMU 76.0, surpassing some closed-source models and approaching the level of junior human experts, ranking first in multiple metrics among open-source models.