Reinforcement Learning
Reinforcement learning (RL) means learning from trial and error. Unlike supervised learning where we provide input-output pairs, RL gives models inputs (prompts) and reward functions—functions that score candidate outputs. RL algorithms discover optimal outputs through iterative refinement.
Your First RL Run
The Bios Cookbook provides a minimal script for running RL on mathematical reasoning tasks:
1python -m bios_cookbook.recipes.rl_basicThis fine-tunes an UltraSafe model with combined reward: correctness + format compliance. Training reaches ~63% accuracy after 15 iterations (~1 min per iteration).
env/all/correctCorrectness fraction
env/all/formatFormat compliance
ac_tokens_per_turnTokens per completion
kl_sample_trainKL divergence
Types of RL Training
RL with Verifiable Rewards (RLVR)
Train on programmatic reward functions: unit tests, reference answer verification, mathematical correctness.
RL on Human Feedback (RLHF)
Train preference model on human rankings, use as reward for policy optimization.
Creating Custom RL Environments
Implement the Env interface to create custom training environments. Find classes in bios_cookbook.rl.types.
RL Training Loop
The Bios Cookbook provides a simple, self-contained RL training loop in rl_loop.py. This implementation avoids environment classes for a more direct, educational approach.
1python -m bios_cookbook.recipes.rl_looprl_loop.py
Self-contained training loop for learning. Inline data loading and rollout generation.
Best for: Understanding RL mechanics, writing custom loops, algorithm research
rl/train.py
Production-optimized with async, periodic evals, advanced features.
Best for: Production pipelines, large-scale training, maximum performance
Training Progress
The default configuration completes after 57 steps. Results are written to /tmp/bios-examples/rl-loop.
Visualize Reward Curve
Plot the reward progression to monitor RL training:
1import pandas as pd
2import matplotlib.pyplot as plt
3
4# Load RL metrics
5metrics_path = "/tmp/bios-examples/rl-loop/metrics.jsonl"
6df = pd.read_json(metrics_path, lines=True)
7
8# Plot reward progression
9plt.figure(figsize=(10, 6))
10plt.plot(df["reward/mean"], label="Mean Reward", linewidth=2)
11plt.xlabel("Training Steps")
12plt.ylabel("Reward")
13plt.title("RL Training Progress: Reward vs Steps")
14plt.legend()
15plt.grid(True, alpha=0.3)
16plt.show()Expected Reward Curve
You should see an upward trend showing the model learning to maximize reward through policy optimization.
Illustrative Reward Curve
1python -m bios_cookbook.recipes.twenty_questions.trainThis multi-step environment trains a question-asking agent to guess hidden words through strategic questioning.