Best Practices

Production-ready patterns, optimization techniques, and common pitfalls to avoid when training UltraSafe expert models at scale with Bios.

Goal: Reliable, Efficient Training

These practices emerged from training thousands of models in production. Following them will help you avoid common mistakes and achieve better results faster.

Data Quality Over Quantity

High-quality training data is the most important factor for successful fine-tuning.

✓ High-Quality Data

• 100-1000 carefully curated examples
• Diverse coverage of target distribution
• Consistent formatting and style
• Accurate, verified outputs
• Representative of production use cases
• Manually reviewed for quality

✗ Low-Quality Data

• 10,000+ examples scraped without review
• Narrow coverage (overfitting risk)
• Inconsistent formatting
• Errors or hallucinations in outputs
• Synthetic data without validation
• Duplicates or near-duplicates

Data Quality Checklist

✓ Manual review of random sample (at least 10%)
✓ Diversity metrics (unique inputs, output variety)
✓ Format validation (consistent structure)
✓ Deduplication (remove exact/near matches)
✓ Train/validation split (never evaluate on training data)

LoRA Configuration Guidelines

Proper LoRA configuration is crucial for training effectiveness and efficiency.

Choosing LoRA Rank

The LoRA rank determines the capacity of your adapter. Follow this rule of thumb:

LoRA Parameters ≥ Total Completion Tokens in Dataset

Small dataset (<10K tokens):rank = 8

Medium dataset (10K-100K tokens):rank = 16

Large dataset (>100K tokens):rank = 32-64

Target Modules

Apply LoRA to both attention and MLP layers for best results:

1# Recommended: All attention + MLP layers
2training_client = service_client.create_lora_training_client(
3    base_model="ultrasafe/usf-finance",
4    rank=16,
5    alpha=32,
6    target_modules=[
7        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
8        "gate_proj", "up_proj", "down_proj"      # MLP
9    ]
10)

⚠️ Common Mistake

Applying LoRA only to attention layers (q_proj, v_proj) limits capacity and can lead to underfitting. Always include MLP layers for production training.

Learning Rate Configuration

LoRA requires different learning rates than full fine-tuning due to the low-rank parameterization.

LoRA Learning Rate Scaling

Use 20-100x higher learning rates for LoRA compared to full fine-tuning:

✗ Too Low (Underperforms)

lr = 1e-5

Using full fine-tune LR for LoRA = slow convergence

✓ Optimal Range

lr = 1e-4 to 5e-4

50-100x higher than full fine-tune = fast convergence

Calculating Optimal Learning Rate

1from bios_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
2
3# Get recommended LR scaling for your model
4base_model = "ultrasafe/usf-finance"
5full_ft_lr = 1e-5  # Typical full fine-tune LR
6
7# Calculate optimal LoRA LR
8lr_scale = get_lora_lr_over_full_finetune_lr(base_model)
9lora_lr = full_ft_lr * lr_scale
10
11print(f"Recommended LoRA LR: {lora_lr:.2e}")
12# Output: Recommended LoRA LR: 5.00e-04 (50x scaling)

Checkpointing Strategy

Strategic checkpointing balances fault tolerance with training efficiency.

Checkpoint Frequency

Balance between fault tolerance and overhead:

• Short runs (<1 hour): Every 100-200 steps
• Medium runs (1-4 hours): Every 500 steps
• Long runs (>4 hours): Every 1000 steps + hourly

Checkpoint Types

Use different checkpoints for different purposes:

• Training checkpoints: Full optimizer state (resume training)
• Sampling checkpoints: Weights only (faster loading)
• Best model: Based on validation metrics

Production Checkpointing Pattern

1best_val_loss = float('inf')
2checkpoint_interval = 500
3
4for step, batch in enumerate(dataloader):
5    # Training step
6    fwd_future = training_client.forward_backward(batch, "cross_entropy")
7    opt_future = training_client.optim_step(adam_params)
8    
9    fwd_result = fwd_future.result()
10    
11    # Periodic checkpoint
12    if step % checkpoint_interval == 0:
13        training_client.save_state(
14            name=f"training_step_{step}",
15            metadata={"step": step, "loss": fwd_result.loss}
16        )
17    
18    # Validation + best model tracking
19    if step % eval_interval == 0:
20        val_loss = evaluate(training_client, val_data)
21        
22        if val_loss < best_val_loss:
23            best_val_loss = val_loss
24            training_client.save_state(
25                name=f"best_model",
26                metadata={"step": step, "val_loss": val_loss}
27            )
28            print(f"✓ New best model: val_loss={val_loss:.4f}")

Monitoring & Debugging

Track the right metrics to catch issues early and optimize training.

Essential Metrics to Monitor

Training Loss

Should decrease steadily. Plateau may indicate learning rate issues.

Validation Loss

If diverging from train loss → overfitting (reduce epochs or add data)

Gradient Norm

Very large (>10) or very small (<0.01) indicates problems

Learning Rate

Track LR decay over training (warmup + cosine decay is common)

Throughput

Examples/second - should be stable (drop indicates bottleneck)

GPU Utilization

Reported by Bios - target >80% for efficient training

Monitoring Dashboard Script

1import json
2from collections import deque
3
4# Track metrics in a rolling window
5metrics_window = deque(maxlen=100)
6
7for step, batch in enumerate(dataloader):
8    fwd_future = training_client.forward_backward(batch, "cross_entropy")
9    opt_future = training_client.optim_step(adam_params)
10    
11    fwd_result = fwd_future.result()
12    opt_result = opt_future.result()
13    
14    # Collect metrics
15    metrics = {
16        "step": step,
17        "loss": fwd_result.loss,
18        "grad_norm": fwd_result.grad_norm,
19        "learning_rate": opt_result.learning_rate
20    }
21    metrics_window.append(metrics)
22    
23    # Print summary every 10 steps
24    if step % 10 == 0:
25        recent_loss = sum(m["loss"] for m in metrics_window) / len(metrics_window)
26        print(f"Step {step}: avg_loss={recent_loss:.4f} | "
27              f"grad_norm={fwd_result.grad_norm:.4f} | "
28              f"lr={opt_result.learning_rate:.2e}")
29    
30    # Log to file
31    with open("training_log.jsonl", "a") as f:
32        f.write(json.dumps(metrics) + "\n")

Common Issues & Solutions

Loss Not Decreasing

Symptoms: Loss stays flat or fluctuates randomly

Solutions to try:

1. Increase learning rate (try 2-5x current value)
2. Verify data weighting (check that completion tokens have weight=1)
3. Increase LoRA rank (may need more capacity)
4. Check for data quality issues (duplicates, errors)
5. Reduce batch size if using very large batches with LoRA

Overfitting

Symptoms: Training loss decreases but validation loss increases

Solutions:

1. Add more diverse training data
2. Reduce number of training epochs
3. Increase weight decay (L2 regularization)
4. Use LoRA dropout (0.05-0.1)
5. Reduce LoRA rank if dataset is small
6. Implement early stopping based on validation loss

Gradient Explosion

Symptoms: Loss becomes NaN, gradient norm >100

Solutions:

1. Enable gradient clipping (max_norm=1.0)
2. Reduce learning rate
3. Check for corrupted data (extreme values)
4. Use gradient accumulation to stabilize
5. Verify LoRA alpha is appropriate (typically 2x rank)

Production Training Patterns

1. Evaluation During Training

Run periodic evaluations to track generalization and prevent overfitting:

1def evaluate(training_client, val_data):
2    """Run validation without updating weights"""
3    total_loss = 0
4    for batch in val_data:
5        # Forward only (no backward)
6        result = training_client.forward_backward(
7            batch, "cross_entropy"
8        ).result()
9        total_loss += result.loss
10    return total_loss / len(val_data)
11
12# In training loop
13for epoch in range(num_epochs):
14    # Train
15    for batch in train_data:
16        training_client.forward_backward(batch, "cross_entropy")
17        training_client.optim_step(adam_params)
18    
19    # Evaluate
20    val_loss = evaluate(training_client, val_data)
21    print(f"Epoch {epoch}: val_loss={val_loss:.4f}")

2. Graceful Error Handling

Handle transient failures and resume training:

1import time
2
3def train_with_retry(training_client, dataloader, max_retries=3):
4    """Training loop with automatic retry on transient failures"""
5    for step, batch in enumerate(dataloader):
6        for attempt in range(max_retries):
7            try:
8                fwd_future = training_client.forward_backward(batch, "cross_entropy")
9                opt_future = training_client.optim_step(adam_params)
10                
11                fwd_result = fwd_future.result()
12                opt_result = opt_future.result()
13                
14                # Success - break retry loop
15                break
16                
17            except bios.TransientError as e:
18                if attempt < max_retries - 1:
19                    print(f"Transient error at step {step}, retrying...")
20                    time.sleep(2 ** attempt)  # Exponential backoff
21                else:
22                    print(f"Failed after {max_retries} attempts")
23                    raise

3. Experiment Tracking

Log hyperparameters and results for reproducibility:

1import json
2from datetime import datetime
3
4# Save experiment config
5experiment_config = {
6    "timestamp": datetime.now().isoformat(),
7    "base_model": "ultrasafe/usf-finance",
8    "lora_rank": 16,
9    "lora_alpha": 32,
10    "learning_rate": 2e-4,
11    "batch_size": 32,
12    "num_epochs": 3,
13    "dataset": "financial_qa_v2",
14    "dataset_size": 1247
15}
16
17with open("experiment_config.json", "w") as f:
18    json.dump(experiment_config, f, indent=2)
19
20# After training, save results
21results = {
22    "final_train_loss": final_train_loss,
23    "final_val_loss": final_val_loss,
24    "best_val_loss": best_val_loss,
25    "total_steps": total_steps,
26    "training_time_hours": training_time / 3600,
27    "checkpoint_path": checkpoint_path
28}
29
30with open("experiment_results.json", "w") as f:
31    json.dump(results, f, indent=2)

Performance Optimization Tips

Technique	Speedup	How to Apply
Operation Pipelining	~20%	Submit forward_backward + optim_step before .result()
Async Functions	~30%	Use forward_backward_async() for better concurrency
Larger Batches	~40%	Use batch_size=32-128 (GPU memory permitting)
Gradient Accumulation	Varies	Accumulate over 2-8 batches for effective larger batches

Combined Impact

Applying all optimizations together can result in 2-3x throughput improvement compared to naive implementations. This translates directly to reduced training time and cost.

Security & Compliance

Data Privacy

• Training data never leaves Bios infrastructure
• Data is encrypted in transit and at rest
• Your data is never used to improve base models
• Automatic data deletion after training completion
• GDPR and HIPAA compliant infrastructure

API Key Security

• Never commit keys to version control
• Use environment variables or secret managers
• Rotate keys every 90 days
• Separate keys for dev/prod environments
• Revoke compromised keys immediately

Ready for Production?

Apply these best practices to your training workflows:

Start Training →SL Guide →RLHF Guide →