Aug 12, 2025 What Makes a Reward Model a Good Teacher? An Optimization Perspective / The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models Aug 12, 2025 The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models / What Makes a Reward Model a Good Teacher? An Optimization Perspective Jul 15, 2025 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Jun 03, 2025 Reinforcement Learning Finetunes Small Subnetworks in Large Language Models Mar 25, 2025 ReFT: Reasoning with Reinforced Fine-Tuning Jan 02, 2025 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning Sep 23, 2024 Training Language Models to Self-Correct via Reinforcement Learning Jul 23, 2024 Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning Jul 23, 2024 Step-DPO : Step-wise preference optimization for long-chain reasoning of LLMs Apr 02, 2024 Preference-free Alignment Learning with Regularized Relevance Reward