Reinforcement Learning

Reinforcement learning (RL) means learning from trial and error. Unlike supervised learning where we provide input-output pairs, RL gives models inputs (prompts) and reward functions—functions that score candidate outputs. RL algorithms discover optimal outputs through iterative refinement.

Your First RL Run

The Bios Cookbook provides a minimal script for running RL on mathematical reasoning tasks:

Run Basic RL Example

1python -m bios_cookbook.recipes.rl_basic

This fine-tunes an UltraSafe model with combined reward: correctness + format compliance. Training reaches ~63% accuracy after 15 iterations (~1 min per iteration).

Key RL Metrics

env/all/correct

Correctness fraction

env/all/format

Format compliance

ac_tokens_per_turn

Tokens per completion

kl_sample_train

KL divergence

Types of RL Training

RL with Verifiable Rewards (RLVR)

Train on programmatic reward functions: unit tests, reference answer verification, mathematical correctness.

RL on Human Feedback (RLHF)

Train preference model on human rankings, use as reward for policy optimization.

Creating Custom RL Environments

Implement the Env interface to create custom training environments. Find classes in bios_cookbook.rl.types.

Example: Twenty Questions

RL Training Loop

The Bios Cookbook provides a simple, self-contained RL training loop in rl_loop.py. This implementation avoids environment classes for a more direct, educational approach.

Run RL Training Loop

1python -m bios_cookbook.recipes.rl_loop

rl_loop.py

Self-contained training loop for learning. Inline data loading and rollout generation.

Best for: Understanding RL mechanics, writing custom loops, algorithm research

rl/train.py

Production-optimized with async, periodic evals, advanced features.

Best for: Production pipelines, large-scale training, maximum performance

Training Progress

The default configuration completes after 57 steps. Results are written to /tmp/bios-examples/rl-loop.

Visualize Reward Curve

Plot the reward progression to monitor RL training:

Plot Reward Curve

1import pandas as pd
2import matplotlib.pyplot as plt
3
4# Load RL metrics
5metrics_path = "/tmp/bios-examples/rl-loop/metrics.jsonl"
6df = pd.read_json(metrics_path, lines=True)
7
8# Plot reward progression
9plt.figure(figsize=(10, 6))
10plt.plot(df["reward/mean"], label="Mean Reward", linewidth=2)
11plt.xlabel("Training Steps")
12plt.ylabel("Reward")
13plt.title("RL Training Progress: Reward vs Steps")
14plt.legend()
15plt.grid(True, alpha=0.3)
16plt.show()

Expected Reward Curve

You should see an upward trend showing the model learning to maximize reward through policy optimization.

Illustrative Reward Curve

1python -m bios_cookbook.recipes.twenty_questions.train

This multi-step environment trains a question-asking agent to guess hidden words through strategic questioning.