RLHF Complete Pipeline
This guide demonstrates the complete Reinforcement Learning from Human Feedback (RLHF) pipeline. The Bios Cookbook provides a script implementing the standard three-stage RLHF process.
Three-Stage RLHF Pipeline
Run the Complete Pipeline
Execute all three stages with a single command:
1python -m bios_cookbook.recipes.preference.rlhf.rlhf_pipelineThis script orchestrates all three training stages sequentially, managing checkpoints and data flow between stages automatically.
Stage 1: Training the Initial Policy (SFT)
First, train the policy via supervised learning on high-quality instruction-following data. This establishes a strong foundation for subsequent RL training.
Configuration
Why Start with SFT?
Supervised fine-tuning provides the model with basic instruction-following and domain knowledge. This significantly accelerates RL convergence and prevents the policy from exploring undesirable behaviors.
Stage 2: Training the Preference Model
Train a preference model (reward model) on pairwise comparison data. The model learns to score which of two completions is preferred.
Configuration
Preference Model Purpose
The trained preference model acts as a reward function in Stage 3. It scores how well completions align with human preferences, enabling RL optimization without requiring human evaluation during training.
Stage 3: Training the Policy via RL
Use the preference model as a reward function to optimize the policy through reinforcement learning. This stage employs self-play with pairwise comparisons.
RL Configuration
Self-Play Mechanism
For each prompt, the policy generates multiple completions. The preference model grades all pairs, creating a tournament-style evaluation. The policy receives reward based on its win fraction, incentivizing generation of responses that the preference model ranks highly.
Pipeline Flow Diagram
Supervised Fine-Tuning
Train on instruction data → Save SFT checkpoint
Preference Model Training
Train on pairwise preferences → Save preference model
RL Policy Optimization
Load SFT checkpoint + preference model → RL training with self-play → Final aligned model
Next Steps
Dive deeper into each stage of the RLHF pipeline: