alignment

an archive of posts with this tag

Aug 19, 2025 ON THE GENERALIZATION OF SFT: A REINFORCEMENT LEARNING PERSPECTIVE WITH REWARD RECTIFICATION
Aug 12, 2025 The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models / What Makes a Reward Model a Good Teacher? An Optimization Perspective
Jun 10, 2025 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Apr 15, 2025 Universal and Transferable Adversarial Attacks on Aligned Language Models
Apr 08, 2025 Reasoning Models Don’t Always Say What They Think
Mar 25, 2025 ReFT: Reasoning with Reinforced Fine-Tuning
Mar 04, 2025 Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Sep 23, 2024 Training Language Models to Self-Correct via Reinforcement Learning
Sep 09, 2024 Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Sep 02, 2024 Many-shot jailbreaking
Aug 13, 2024 Knowledge conflict survey
Jul 23, 2024 Step-DPO : Step-wise preference optimization for long-chain reasoning of LLMs
Jul 02, 2024 RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
Jun 11, 2024 Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
May 28, 2024 SimPO: Simple Preference Optimization with a Reference-Free Reward
May 27, 2024 Understanding the performance gap between online and offline alignment algorithms
May 21, 2024 LLAMA PRO: Progressive LLaMA with Block Expansion
Apr 30, 2024 Training diffusion modelse with reinforcement learning
Apr 23, 2024 ORPO: Monolithic Preference Optimization without Reference Model
Apr 02, 2024 Preference-free Alignment Learning with Regularized Relevance Reward
Mar 05, 2024 Beyond Memorization: Violating Privacy Via Inferencing With LLMs
Feb 06, 2024 Self-Rewarding Language Models
Jan 30, 2024 Lion: Adversarial Distillation of Proprietary Large Language Models
Jan 16, 2024 BENCHMARKING COGNITIVE BIASES IN LARGE LANGUAGE MODELS AS EVALUATORS
Oct 31, 2023 In-Context Learning Learns Label Relationships but Is Not Conventional Learning
Oct 31, 2023 A Survey on Large Language Model based Autonomous Agents
Sep 12, 2023 A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training