Hyperparameter Sweep Case Study

While default hyperparameters provide excellent starting points, optimal values are often task-specific. A hyperparameter sweep—systematically testing values across a range—is the most reliable way to identify the best settings for your specific use case.

Learning from Sweeps

This guide demonstrates how to sweep over learning rates to find an optimal value. While our default recommendations typically achieve <0.5% regret, task-specific sweeps can provide marginal improvements and greater confidence in your hyperparameter choices.

Why Sweep the Learning Rate?

The learning rate is typically the most impactful hyperparameter. While our default recommendations perform well, sweeping helps you:

🎯

Find Task-Specific Optimum

Discover the precise LR that minimizes loss for your specific dataset and task

📊

Understand Sensitivity

See how performance varies with LR to gauge hyperparameter robustness

Build Confidence

Validate that your chosen LR is near-optimal before scaling to production

Sweep Setup

We'll use the simple supervised learning training loop in sl_loop.py, training an UltraSafe model.

Get Default Learning Rate

Retrieve Default LR
1from bios_cookbook.hyperparam_utils import get_lr
2
3# Get default LR for UltraSafe model
4model_name = "ultrasafe/usf-mini"
5default_lr = get_lr(model_name)
6
7print(f"Default LR for {model_name}: {default_lr}")
8# Output: 0.0002856415043086949  (≈ 2.8e-4)

Define Sweep Range

A common best practice is to sweep one order of magnitude above and below the default. For LR ≈ 2.8e-4, we test:

Sweep Values:

LR ∈ [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]

This range covers ~2 orders of magnitude, ensuring we capture the optimal value

Running the Sweep

Launch experiments in parallel using separate terminal windows for each LR value. This maximizes GPU utilization and minimizes total sweep time.

Parallel Experiment Execution
1# Launch 6 experiments in parallel (separate terminals)
2python -m bios_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sft-lr-sweep/lr-0.003
3python -m bios_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sft-lr-sweep/lr-0.001
4python -m bios_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sft-lr-sweep/lr-0.0003
5python -m bios_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sft-lr-sweep/lr-0.0001
6python -m bios_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sft-lr-sweep/lr-0.00003
7python -m bios_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sft-lr-sweep/lr-0.00001

Automation Tip

Automate this process by writing a script that spawns multiple tmux windows and launches experiments programmatically. This is especially useful for larger sweeps or repeated experiments.

Automated Sweep Script
1import subprocess
2import time
3
4# Define LR sweep values
5learning_rates = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]
6
7# Launch experiments
8processes = []
9for lr in learning_rates:
10    cmd = [
11        "python", "-m", "bios_cookbook.recipes.sl_loop",
12        f"learning_rate={lr}",
13        f"log_path=/tmp/sft-lr-sweep/lr-{lr}"
14    ]
15    proc = subprocess.Popen(cmd)
16    processes.append(proc)
17    print(f"Launched experiment with LR={lr}")
18    time.sleep(2)  # Stagger launches
19
20# Wait for all to complete
21for proc in processes:
22    proc.wait()
23
24print("All experiments complete!")

Collecting Results

After experiments complete, collect and analyze the metrics from each run:

Load Sweep Results
1from glob import glob
2import pandas as pd
3import os
4import json
5
6# Collect results from all experiments
7data = []
8for fname in sorted(glob(os.path.expanduser("/tmp/sft-lr-sweep/*/metrics.jsonl"))):
9    df = pd.read_json(fname, lines=True)
10    
11    # Ensure experiment completed (progress >= 98%)
12    if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
13        continue
14    
15    # Load experiment configuration
16    config_fname = fname.replace("metrics.jsonl", "config.json")
17    with open(config_fname, "rb") as f:
18        metadata = json.load(f)
19    
20    # Extract final metrics
21    data.append({
22        "fname": fname,
23        "learning_rate": metadata["learning_rate"],
24        "final_loss": df["train_mean_nll"].iloc[-1].item()
25    })
26
27print(f"Read metrics for {len(data)} experiments")
28# Output: Read metrics for 6 experiments

Visualizing the Sweep

Plot final loss as a function of learning rate to identify the optimal value:

Plot Sweep Results
1import matplotlib.pyplot as plt
2import pandas as pd
3
4# Create DataFrame from results
5df = pd.DataFrame(data)
6
7# Plot final loss vs learning rate
8plt.figure(figsize=(10, 6))
9plt.plot(df["learning_rate"], df["final_loss"], marker='o', linewidth=2, markersize=8)
10
11# Add horizontal line at minimum loss
12plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--", label="Best Loss")
13
14# Configure plot
15plt.ylim(1.65, 1.8)
16plt.xscale("log")
17plt.xlabel("Learning Rate (log scale)", fontsize=12)
18plt.ylabel("Final Loss", fontsize=12)
19plt.title("Final Loss vs Learning Rate", fontsize=14, fontweight='bold')
20plt.legend()
21plt.grid(True, alpha=0.3)
22plt.tight_layout()
23plt.show()

Expected Visualization

You should see a U-shaped curve showing:

  • Left side (low LR): Under-training, high loss due to insufficient learning
  • Middle (optimal LR): Minimum loss, best performance
  • Right side (high LR): Training instability, divergence, high loss

Example U-Curve (illustrative)

optimal

If the full U-curve isn't visible, expand your sweep range

Determining the Optimal LR

Find the learning rate that minimizes final loss:

Calculate Optimal LR
1# Find LR with minimum final loss
2optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
3optimal_loss = df["final_loss"].min()
4
5print(f"Optimal LR: {optimal_lr:.2e}")
6print(f"Best final loss: {optimal_loss:.4f}")
7
8# Compare to default LR
9default_lr = get_lr("ultrasafe/usf-mini")
10default_loss = df[df["learning_rate"] == default_lr]["final_loss"].values[0]
11
12improvement = (default_loss - optimal_loss) / default_loss * 100
13print(f"Improvement over default: {improvement:.2f}%")
14
15# Output:
16# Optimal LR: 3.00e-04
17# Best final loss: 1.6542
18# Improvement over default: 0.31%

Interpretation

In this example, the optimal LR (3e-4) is very close to the default (2.8e-4), confirming our default recommendations are well-calibrated. The marginal 0.31% improvement validates using defaults for most cases, while showing that task-specific tuning can still provide small gains.

Complete Sweep Pipeline

End-to-end script for running a learning rate sweep and analyzing results:

sweep_and_analyze.py
1#!/usr/bin/env python3
2"""
3Complete hyperparameter sweep pipeline
4"""
5import subprocess
6import time
7import pandas as pd
8import matplotlib.pyplot as plt
9from glob import glob
10import os
11import json
12
13def run_sweep(learning_rates, base_log_path):
14    """Launch parallel training experiments"""
15    processes = []
16    for lr in learning_rates:
17        cmd = [
18            "python", "-m", "bios_cookbook.recipes.sl_loop",
19            f"learning_rate={lr}",
20            f"log_path={base_log_path}/lr-{lr}"
21        ]
22        proc = subprocess.Popen(cmd)
23        processes.append(proc)
24        print(f"✓ Launched experiment: LR={lr}")
25        time.sleep(2)
26    
27    # Wait for completion
28    for proc in processes:
29        proc.wait()
30    print("All experiments complete!")
31
32def collect_results(base_log_path):
33    """Collect metrics from all experiments"""
34    data = []
35    for fname in sorted(glob(f"{base_log_path}/*/metrics.jsonl")):
36        df = pd.read_json(fname, lines=True)
37        if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
38            continue
39            
40        config_fname = fname.replace("metrics.jsonl", "config.json")
41        with open(config_fname, "rb") as f:
42            metadata = json.load(f)
43        
44        data.append({
45            "learning_rate": metadata["learning_rate"],
46            "final_loss": df["train_mean_nll"].iloc[-1].item()
47        })
48    return pd.DataFrame(data)
49
50def visualize_sweep(df):
51    """Create visualization of sweep results"""
52    plt.figure(figsize=(10, 6))
53    plt.plot(df["learning_rate"], df["final_loss"], marker='o', linewidth=2)
54    plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--")
55    plt.xscale("log")
56    plt.xlabel("Learning Rate (log scale)")
57    plt.ylabel("Final Loss")
58    plt.title("Hyperparameter Sweep: Final Loss vs Learning Rate")
59    plt.grid(True, alpha=0.3)
60    plt.savefig("/tmp/sft-lr-sweep/sweep_results.png", dpi=150)
61    plt.show()
62    
63    # Print optimal LR
64    optimal_idx = df["final_loss"].idxmin()
65    print(f"\nOptimal LR: {df.loc[optimal_idx, 'learning_rate']:.2e}")
66    print(f"Best Loss: {df.loc[optimal_idx, 'final_loss']:.4f}")
67
68if __name__ == "__main__":
69    LRs = [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]
70    
71    # Run sweep
72    run_sweep(LRs, "/tmp/sft-lr-sweep")
73    
74    # Analyze results
75    results_df = collect_results("/tmp/sft-lr-sweep")
76    visualize_sweep(results_df)

Next Steps After Finding Optimal LR

Once you've identified the optimal learning rate:

1️⃣Production Training Run

Retrain with the optimal LR for your final production model. Use full training duration and complete dataset.

2️⃣Sweep Other Hyperparameters

Consider sweeping batch size, warmup steps, weight decay, or LoRA rank for further optimization.

3️⃣Establish Baseline

Use the optimal LR as a baseline for future experiments on similar tasks or datasets.

Sweep Best Practices

✓ Do

  • • Start with default LR from get_lr()
  • • Sweep ±1 order of magnitude around default
  • • Run experiments in parallel to save time
  • • Use log scale for LR values (10x spacing)
  • • Ensure experiments complete (>98% progress)
  • • Visualize results before selecting optimal value

✗ Don't

  • • Don't use too narrow a sweep range (may miss optimum)
  • • Don't forget to use log scale for LR axis
  • • Don't conclude from incomplete experiments
  • • Don't ignore the U-curve shape (indicates sweep quality)
  • • Don't sweep multiple hyperparameters simultaneously (confounds results)
  • • Don't over-optimize (diminishing returns beyond 1-2% improvement)