Hyperparameter Sweep Case Study
While default hyperparameters provide excellent starting points, optimal values are often task-specific. A hyperparameter sweep—systematically testing values across a range—is the most reliable way to identify the best settings for your specific use case.
Learning from Sweeps
This guide demonstrates how to sweep over learning rates to find an optimal value. While our default recommendations typically achieve <0.5% regret, task-specific sweeps can provide marginal improvements and greater confidence in your hyperparameter choices.
Why Sweep the Learning Rate?
The learning rate is typically the most impactful hyperparameter. While our default recommendations perform well, sweeping helps you:
Find Task-Specific Optimum
Discover the precise LR that minimizes loss for your specific dataset and task
Understand Sensitivity
See how performance varies with LR to gauge hyperparameter robustness
Build Confidence
Validate that your chosen LR is near-optimal before scaling to production
Sweep Setup
We'll use the simple supervised learning training loop in sl_loop.py, training an UltraSafe model.
Get Default Learning Rate
1from bios_cookbook.hyperparam_utils import get_lr
2
3# Get default LR for UltraSafe model
4model_name = "ultrasafe/usf-mini"
5default_lr = get_lr(model_name)
6
7print(f"Default LR for {model_name}: {default_lr}")
8# Output: 0.0002856415043086949  (≈ 2.8e-4)Define Sweep Range
A common best practice is to sweep one order of magnitude above and below the default. For LR ≈ 2.8e-4, we test:
Sweep Values:
LR ∈ [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]
This range covers ~2 orders of magnitude, ensuring we capture the optimal value
Running the Sweep
Launch experiments in parallel using separate terminal windows for each LR value. This maximizes GPU utilization and minimizes total sweep time.
1# Launch 6 experiments in parallel (separate terminals)
2python -m bios_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sft-lr-sweep/lr-0.003
3python -m bios_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sft-lr-sweep/lr-0.001
4python -m bios_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sft-lr-sweep/lr-0.0003
5python -m bios_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sft-lr-sweep/lr-0.0001
6python -m bios_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sft-lr-sweep/lr-0.00003
7python -m bios_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sft-lr-sweep/lr-0.00001Automation Tip
Automate this process by writing a script that spawns multiple tmux windows and launches experiments programmatically. This is especially useful for larger sweeps or repeated experiments.
1import subprocess
2import time
3
4# Define LR sweep values
5learning_rates = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]
6
7# Launch experiments
8processes = []
9for lr in learning_rates:
10    cmd = [
11        "python", "-m", "bios_cookbook.recipes.sl_loop",
12        f"learning_rate={lr}",
13        f"log_path=/tmp/sft-lr-sweep/lr-{lr}"
14    ]
15    proc = subprocess.Popen(cmd)
16    processes.append(proc)
17    print(f"Launched experiment with LR={lr}")
18    time.sleep(2)  # Stagger launches
19
20# Wait for all to complete
21for proc in processes:
22    proc.wait()
23
24print("All experiments complete!")Collecting Results
After experiments complete, collect and analyze the metrics from each run:
1from glob import glob
2import pandas as pd
3import os
4import json
5
6# Collect results from all experiments
7data = []
8for fname in sorted(glob(os.path.expanduser("/tmp/sft-lr-sweep/*/metrics.jsonl"))):
9    df = pd.read_json(fname, lines=True)
10    
11    # Ensure experiment completed (progress >= 98%)
12    if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
13        continue
14    
15    # Load experiment configuration
16    config_fname = fname.replace("metrics.jsonl", "config.json")
17    with open(config_fname, "rb") as f:
18        metadata = json.load(f)
19    
20    # Extract final metrics
21    data.append({
22        "fname": fname,
23        "learning_rate": metadata["learning_rate"],
24        "final_loss": df["train_mean_nll"].iloc[-1].item()
25    })
26
27print(f"Read metrics for {len(data)} experiments")
28# Output: Read metrics for 6 experimentsVisualizing the Sweep
Plot final loss as a function of learning rate to identify the optimal value:
1import matplotlib.pyplot as plt
2import pandas as pd
3
4# Create DataFrame from results
5df = pd.DataFrame(data)
6
7# Plot final loss vs learning rate
8plt.figure(figsize=(10, 6))
9plt.plot(df["learning_rate"], df["final_loss"], marker='o', linewidth=2, markersize=8)
10
11# Add horizontal line at minimum loss
12plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--", label="Best Loss")
13
14# Configure plot
15plt.ylim(1.65, 1.8)
16plt.xscale("log")
17plt.xlabel("Learning Rate (log scale)", fontsize=12)
18plt.ylabel("Final Loss", fontsize=12)
19plt.title("Final Loss vs Learning Rate", fontsize=14, fontweight='bold')
20plt.legend()
21plt.grid(True, alpha=0.3)
22plt.tight_layout()
23plt.show()Expected Visualization
You should see a U-shaped curve showing:
- →Left side (low LR): Under-training, high loss due to insufficient learning
- →Middle (optimal LR): Minimum loss, best performance
- →Right side (high LR): Training instability, divergence, high loss
Example U-Curve (illustrative)
If the full U-curve isn't visible, expand your sweep range
Determining the Optimal LR
Find the learning rate that minimizes final loss:
1# Find LR with minimum final loss
2optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
3optimal_loss = df["final_loss"].min()
4
5print(f"Optimal LR: {optimal_lr:.2e}")
6print(f"Best final loss: {optimal_loss:.4f}")
7
8# Compare to default LR
9default_lr = get_lr("ultrasafe/usf-mini")
10default_loss = df[df["learning_rate"] == default_lr]["final_loss"].values[0]
11
12improvement = (default_loss - optimal_loss) / default_loss * 100
13print(f"Improvement over default: {improvement:.2f}%")
14
15# Output:
16# Optimal LR: 3.00e-04
17# Best final loss: 1.6542
18# Improvement over default: 0.31%Interpretation
In this example, the optimal LR (3e-4) is very close to the default (2.8e-4), confirming our default recommendations are well-calibrated. The marginal 0.31% improvement validates using defaults for most cases, while showing that task-specific tuning can still provide small gains.
Complete Sweep Pipeline
End-to-end script for running a learning rate sweep and analyzing results:
1#!/usr/bin/env python3
2"""
3Complete hyperparameter sweep pipeline
4"""
5import subprocess
6import time
7import pandas as pd
8import matplotlib.pyplot as plt
9from glob import glob
10import os
11import json
12
13def run_sweep(learning_rates, base_log_path):
14    """Launch parallel training experiments"""
15    processes = []
16    for lr in learning_rates:
17        cmd = [
18            "python", "-m", "bios_cookbook.recipes.sl_loop",
19            f"learning_rate={lr}",
20            f"log_path={base_log_path}/lr-{lr}"
21        ]
22        proc = subprocess.Popen(cmd)
23        processes.append(proc)
24        print(f"✓ Launched experiment: LR={lr}")
25        time.sleep(2)
26    
27    # Wait for completion
28    for proc in processes:
29        proc.wait()
30    print("All experiments complete!")
31
32def collect_results(base_log_path):
33    """Collect metrics from all experiments"""
34    data = []
35    for fname in sorted(glob(f"{base_log_path}/*/metrics.jsonl")):
36        df = pd.read_json(fname, lines=True)
37        if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
38            continue
39            
40        config_fname = fname.replace("metrics.jsonl", "config.json")
41        with open(config_fname, "rb") as f:
42            metadata = json.load(f)
43        
44        data.append({
45            "learning_rate": metadata["learning_rate"],
46            "final_loss": df["train_mean_nll"].iloc[-1].item()
47        })
48    return pd.DataFrame(data)
49
50def visualize_sweep(df):
51    """Create visualization of sweep results"""
52    plt.figure(figsize=(10, 6))
53    plt.plot(df["learning_rate"], df["final_loss"], marker='o', linewidth=2)
54    plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--")
55    plt.xscale("log")
56    plt.xlabel("Learning Rate (log scale)")
57    plt.ylabel("Final Loss")
58    plt.title("Hyperparameter Sweep: Final Loss vs Learning Rate")
59    plt.grid(True, alpha=0.3)
60    plt.savefig("/tmp/sft-lr-sweep/sweep_results.png", dpi=150)
61    plt.show()
62    
63    # Print optimal LR
64    optimal_idx = df["final_loss"].idxmin()
65    print(f"\nOptimal LR: {df.loc[optimal_idx, 'learning_rate']:.2e}")
66    print(f"Best Loss: {df.loc[optimal_idx, 'final_loss']:.4f}")
67
68if __name__ == "__main__":
69    LRs = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]
70    
71    # Run sweep
72    run_sweep(LRs, "/tmp/sft-lr-sweep")
73    
74    # Analyze results
75    results_df = collect_results("/tmp/sft-lr-sweep")
76    visualize_sweep(results_df)Next Steps After Finding Optimal LR
Once you've identified the optimal learning rate:
1️⃣Production Training Run
Retrain with the optimal LR for your final production model. Use full training duration and complete dataset.
2️⃣Sweep Other Hyperparameters
Consider sweeping batch size, warmup steps, weight decay, or LoRA rank for further optimization.
3️⃣Establish Baseline
Use the optimal LR as a baseline for future experiments on similar tasks or datasets.
Sweep Best Practices
✓ Do
- • Start with default LR from get_lr()
- • Sweep ±1 order of magnitude around default
- • Run experiments in parallel to save time
- • Use log scale for LR values (10x spacing)
- • Ensure experiments complete (>98% progress)
- • Visualize results before selecting optimal value
✗ Don't
- • Don't use too narrow a sweep range (may miss optimum)
- • Don't forget to use log scale for LR axis
- • Don't conclude from incomplete experiments
- • Don't ignore the U-curve shape (indicates sweep quality)
- • Don't sweep multiple hyperparameters simultaneously (confounds results)
- • Don't over-optimize (diminishing returns beyond 1-2% improvement)
Next Steps
Apply sweep methodology to optimize your training: