RLHF Reading Notes 1
Table of Contents
Glossed Overview: RLHF for LLM
Reinforcement Learning from Human Feedback (RLHF) is a technique to integrate human preferences into AI systems, particularly for problems that are difficult to define explicitly.
The core RLHF process involves three steps:
- training a capable language model
- collecting human preference data to train a reward model
- optimizing the language model using reinforcement learning guided by the reward model.
RLHF is a crucial part of “post-training” for LLM, a set of techniques to enhance model usability, including:
- Supervised Instructional Finetuning for learning features of language that form the basis of the desired output format and the ability of instruction following.
- Preference Finetuning for learning the output style and subtle alignment with human preferences and
- Reinforcement Finetuning for further performance boosts in verifiable domains
Further Background Readings
RLHF in Deep Reinforcement Learning
Challenge: Difficulty in Specifying Reward Functions
Manually designing reward functions for complex tasks is incredibly difficult and often leads to unintended or suboptimal agent behavior.
Proposal: Learning from Human Preferences Instead of Explicit Rewards
Instead of trying to define a reward function directly, learn a reward function from human judgments about which behavior is better.
Details:
- Generate Trajectory Pairs: The agent performs the task and generates pairs of trajectories (sequences of actions and states).
- Human Preference Judgments: Humans are presented with these pairs of trajectories and asked to choose which one they prefer (which trajectory is “better” according to some criteria). Crucially, humans don’t need to explicitly define why one is better, just to indicate their preference.
- Reward Model Training: These human preference judgments are used to train a reward model. This reward model learns to predict which trajectory a human would prefer. Essentially, it learns to approximate the underlying, implicit reward function based on human feedback.
- Reinforcement Learning with Learned Reward Model: The trained reward model is then used as the reward signal for a standard deep reinforcement learning algorithm (like policy gradients or Q-learning). The agent is trained to maximize the reward predicted by the reward model, which in turn is aligned with human preferences.
Learning to summarize from human feedback
Challenge: Limitations of Traditional Summarization Metrics and Methods.
Traditional automatic summarization methods, often optimized using metrics like ROUGE, don’t always align well with human preferences for good summaries. ROUGE primarily measures n-gram overlap with reference summaries, which can be a crude proxy for summary quality. Furthermore, directly optimizing for metrics like ROUGE can lead to models that generate summaries that are grammatically correct but lack coherence, focus, or truly capture the essence of the original text as a human would.
Proposal: Training Summarization Models with Human Preference Feedback.
Similar to the Christiano et al. (2017) paper, this work proposes to move away from solely relying on automatic metrics and instead train summarization models using direct human feedback on the quality of generated summaries. The idea is to teach the model to generate summaries that humans prefer, rather than just those that score well on automatic metrics.
Details:
- Pre-training a Summarization Model: Pre-train a sequence-to-sequence model for summarization.
- Collecting Human Preference Data (Comparison Data):
Collect human judgments by presenting human annotators with pairs of summaries generated by different models (or different versions of the same model). The annotators are asked to choose which summary is better based on criteria like:
- Helpfulness: Is the summary informative and useful?
- Relevance: Does the summary accurately reflect the content of the original document?
- Readability: Is the summary well-written and easy to understand?
- Non-redundancy: Does the summary avoid unnecessary repetition?
- Training a Reward Model: The collected human preference data (pairs of summaries and the preferred one) is used to train a reward model. This reward model learns to predict which summary a human would prefer given an input document. The reward model is trained to assign higher scores to summaries that humans tend to prefer.
- Fine-tuning the Summarization Model with Reinforcement Learning: The pre-trained summarization model is then fine-tuned using reinforcement learning. The reward signal for RL is provided by the trained reward model. The RL objective is to generate summaries that maximize the score given by the reward model, effectively guiding the summarization model towards generating summaries that are more human-preferred. They used Proximal Policy Optimization (PPO) algorithm for this RL fine-tuning stage.
Webgpt: Browser-assisted question-answering with human feedback
Challenge: Limitations of Traditional Question Answering and the Need for Browser Assistance:
Traditional question-answering (QA) models rely solely on their internal knowledge or pre-indexed datasets. Many real-world questions require accessing and processing information from the open web to provide comprehensive and up-to-date answers. Furthermore, simply retrieving documents isn’t enough; the model needs to effectively browse, extract relevant information, and synthesize it into a coherent answer.
Proposal: WebGPT - A Browser-Assisted QA Model Trained with Human Feedback
WebGPT, a model that is trained to use a web browser to answer questions. It’s not just a language model; it’s an agent that can interact with the web in a controlled manner, including searching, clicking links, scrolling, and reading web pages. Crucially, WebGPT is trained using Reinforcement Learning from Human Feedback (RLHF) to generate answers that are helpful, truthful, and harmless.
Details: Browser-in-the-Loop Question Answering with RLHF:
- Browser Environment: They created a simulated browser environment that WebGPT can interact with. This environment provides actions like searching, clicking links, scrolling, and observing the rendered web page content.
- WebGPT Agent: WebGPT is a Transformer-based language model trained to act as an agent within this browser environment. Given a question, it decides on a sequence of browser actions to gather information and ultimately generate an answer.
- Human Feedback Collection:
Human evaluators are crucial. They are asked to compare pairs of answers generated by different models (including WebGPT and baseline models) and indicate which answer is better based on criteria like:
- Helpfulness: Is the answer useful and informative?
- Truthfulness/Accuracy: Is the answer factually correct and supported by evidence?
- Harmlessness: Is the answer safe and avoids harmful or biased content?
- Browser Usage Quality: Was the browsing process efficient and effective in finding relevant information?
- Reward Model Training: The human preference data is used to train a reward model. This reward model learns to predict which answer a human would prefer, based on the quality criteria. It also learns to reward efficient and effective browser usage.
- Reinforcement Learning Fine-tuning: WebGPT’s policy (how it decides to act in the browser and generate answers) is then fine-tuned using reinforcement learning (Proximal Policy Optimization). The reward signal comes from the trained reward model. The RL objective is to train WebGPT to perform browser actions and generate answers that maximize the reward predicted by the reward model, thus aligning with human preferences for helpful, truthful, and harmless answers.
Training language models to follow instructions with human feedback
Challenge: Mismatch between Language Model Objectives and User Intent
A key problem with standard language models trained for next-token prediction: they are good at generating text that is statistically likely but not necessarily helpful, truthful, or harmless (the “alignment problem”). These models often generate outputs that are:
- Unhelpful: Not actually answering the user’s question or fulfilling the user’s request.
- Untruthful: Generating factually incorrect or misleading information.
- Harmful: Producing biased, toxic, or unsafe content.
The core issue is that optimizing for next-token prediction alone doesn’t incentivize models to align with human intent and values.
Proposal: InstructGPT - Training Language Models to Follow Instructions via RLHF
The central solution proposed is InstructGPT, a language model specifically trained to follow instructions using Reinforcement Learning from Human Feedback (RLHF). The goal is to directly train the model to be helpful, truthful, and harmless, aligning its behavior with what humans actually want.
Details: A Three-Step RLHF Pipeline for Instruction Following:
- Supervised Fine-tuning (SFT) on Instruction Data: First, fine-tune a pre-trained language model (in this case, a GPT-3 model) on a dataset of human-written demonstrations of instruction following. This dataset consists of prompts (instructions) and desired responses. This step teaches the model to initially understand and attempt to follow instructions.
- Reward Model Training from Human Preference Data: Next, collect human preference data. Humans are presented with multiple responses generated by the SFT model for a given instruction. They are asked to rank these responses based on which one is better, considering factors like helpfulness, truthfulness, and harmlessness. This preference data is used to train a reward model. The reward model learns to predict which response a human would prefer for a given instruction. It essentially learns to score responses based on alignment with human values.
- Reinforcement Learning Fine-tuning with the Reward Model: Finally, the SFT model is further fine-tuned using reinforcement learning (Proximal Policy Optimization). The reward signal for RL is provided by the trained reward model. The RL objective is to train the model to generate responses that maximize the reward predicted by the reward model. This step directly optimizes the language model for alignment with human preferences as captured by the reward model.
Training a helpful and harmless assistant with reinforcement learning from human feedback
Challenge: Ensuring Harmlessness in AI Assistants trained with RLHF
While previous RLHF work (like InstructGPT) focused on helpfulness and truthfulness, this paper specifically tackles the challenge of ensuring harmlessness in AI assistants. They argue that directly relying on human feedback for all aspects of harmlessness can be problematic and potentially lead to inconsistent or biased judgments. It’s difficult for humans to consistently and comprehensively define “harmlessness” in all situations.
Proposal: Constitutional AI (CAI) - Using a Constitution to Guide Harmlessness Learning
Instead of directly asking humans to rate harmlessness in every instance, they propose to use a set of principles, or a “constitution,” to define and guide what constitutes harmless behavior. This constitution is used to:
- Self-Critique: The AI assistant itself uses the constitution to critique its own responses and identify potentially harmful outputs.
- Guide Reward Model Training: The constitution informs the training of the reward model, so the model learns to penalize responses that violate the constitutional principles.
Details: Two-Phase RLHF with Constitutional Guidance
- Constitutional Reinforcement Learning (Constitutional RL):
- Agent Generates Responses: The AI assistant generates responses to prompts.
- Constitutional Critique: The assistant then uses the pre-defined constitution to critique its own generated responses. This critique identifies potential violations of the constitutional principles.
- Self-Correction: Based on the critique, the assistant refines or regenerates its response to better align with the constitution.
- Reward based on Constitutional Alignment: A reward signal is generated based on how well the response aligns with the constitution (i.e., how few constitutional violations it has). This phase trains the assistant to be constitutionally aligned.
- Human Preference Reinforcement Learning (Preference RL):
- Agent Generates Pairs of Responses: The constitutionally trained assistant generates pairs of responses (often one from the constitutional RL phase and one from a baseline model, or variations of constitutionally aligned responses).
- Human Preference Judgments (Helpfulness): Humans are then asked to compare these pairs of responses and choose which one is more helpful (ignoring harmlessness at this stage, as harmlessness is already addressed in phase 1).
- Reward Model Training (Helpfulness Reward): Human preference data is used to train a reward model that specifically focuses on predicting human preferences for helpfulness.
- RL Fine-tuning with Helpfulness Reward: The constitutionally aligned assistant is further fine-tuned using reinforcement learning, but now with the reward signal from the helpfulness reward model. This phase trains the assistant to be helpful, while retaining the harmlessness learned in phase 1.