rlhf

an archive of posts with this tag

Aug 19, 2025 ON THE GENERALIZATION OF SFT: A REINFORCEMENT LEARNING PERSPECTIVE WITH REWARD RECTIFICATION
Aug 12, 2025 What Makes a Reward Model a Good Teacher? An Optimization Perspective / The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models
Aug 12, 2025 The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models / What Makes a Reward Model a Good Teacher? An Optimization Perspective
Apr 15, 2025 Universal and Transferable Adversarial Attacks on Aligned Language Models
Apr 08, 2025 Reasoning Models Don’t Always Say What They Think
Mar 25, 2025 ReFT: Reasoning with Reinforced Fine-Tuning
Feb 18, 2025 DeepSeek v3
Oct 17, 2024 Rule Based Rewards for Language Model Safety
Jul 23, 2024 Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
Jul 23, 2024 Step-DPO : Step-wise preference optimization for long-chain reasoning of LLMs
Jun 11, 2024 Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
May 28, 2024 SimPO: Simple Preference Optimization with a Reference-Free Reward
May 27, 2024 Understanding the performance gap between online and offline alignment algorithms
Apr 23, 2024 ORPO: Monolithic Preference Optimization without Reference Model
Apr 02, 2024 Preference-free Alignment Learning with Regularized Relevance Reward
Mar 05, 2024 Beyond Memorization: Violating Privacy Via Inferencing With LLMs
Feb 27, 2024 SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION
Feb 06, 2024 Self-Rewarding Language Models
Sep 19, 2023 The CRINGE Loss: Learning what language not to model
Jun 29, 2023 QLoRA: Eficient Finetuning of Quantized LLMs
Jun 22, 2023 The False Promise of Imitating Proprietary LLMs
Jan 26, 2023 Task-aware Retrieval with Instructions