Best Practices
Production-ready patterns, optimization techniques, and common pitfalls to avoid when training UltraSafe expert models at scale with Bios.
Goal: Reliable, Efficient Training
These practices emerged from training thousands of models in production. Following them will help you avoid common mistakes and achieve better results faster.
Data Quality Over Quantity
High-quality training data is the most important factor for successful fine-tuning.
✓ High-Quality Data
- • 100-1000 carefully curated examples
- • Diverse coverage of target distribution
- • Consistent formatting and style
- • Accurate, verified outputs
- • Representative of production use cases
- • Manually reviewed for quality
✗ Low-Quality Data
- • 10,000+ examples scraped without review
- • Narrow coverage (overfitting risk)
- • Inconsistent formatting
- • Errors or hallucinations in outputs
- • Synthetic data without validation
- • Duplicates or near-duplicates
Data Quality Checklist
- ✓ Manual review of random sample (at least 10%)
- ✓ Diversity metrics (unique inputs, output variety)
- ✓ Format validation (consistent structure)
- ✓ Deduplication (remove exact/near matches)
- ✓ Train/validation split (never evaluate on training data)
LoRA Configuration Guidelines
Proper LoRA configuration is crucial for training effectiveness and efficiency.
Choosing LoRA Rank
The LoRA rank determines the capacity of your adapter. Follow this rule of thumb:
LoRA Parameters ≥ Total Completion Tokens in Dataset
rank = 8rank = 16rank = 32-64Target Modules
Apply LoRA to both attention and MLP layers for best results:
1# Recommended: All attention + MLP layers
2training_client = service_client.create_lora_training_client(
3    base_model="ultrasafe/usf-finance",
4    rank=16,
5    alpha=32,
6    target_modules=[
7        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
8        "gate_proj", "up_proj", "down_proj"      # MLP
9    ]
10)⚠️ Common Mistake
Applying LoRA only to attention layers (q_proj, v_proj) limits capacity and can lead to underfitting. Always include MLP layers for production training.
Learning Rate Configuration
LoRA requires different learning rates than full fine-tuning due to the low-rank parameterization.
LoRA Learning Rate Scaling
Use 20-100x higher learning rates for LoRA compared to full fine-tuning:
✗ Too Low (Underperforms)
lr = 1e-5Using full fine-tune LR for LoRA = slow convergence
✓ Optimal Range
lr = 1e-4 to 5e-450-100x higher than full fine-tune = fast convergence
1from bios_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
2
3# Get recommended LR scaling for your model
4base_model = "ultrasafe/usf-finance"
5full_ft_lr = 1e-5  # Typical full fine-tune LR
6
7# Calculate optimal LoRA LR
8lr_scale = get_lora_lr_over_full_finetune_lr(base_model)
9lora_lr = full_ft_lr * lr_scale
10
11print(f"Recommended LoRA LR: {lora_lr:.2e}")
12# Output: Recommended LoRA LR: 5.00e-04 (50x scaling)Checkpointing Strategy
Strategic checkpointing balances fault tolerance with training efficiency.
Checkpoint Frequency
Balance between fault tolerance and overhead:
- • Short runs (<1 hour): Every 100-200 steps
- • Medium runs (1-4 hours): Every 500 steps
- • Long runs (>4 hours): Every 1000 steps + hourly
Checkpoint Types
Use different checkpoints for different purposes:
- • Training checkpoints: Full optimizer state (resume training)
- • Sampling checkpoints: Weights only (faster loading)
- • Best model: Based on validation metrics
1best_val_loss = float('inf')
2checkpoint_interval = 500
3
4for step, batch in enumerate(dataloader):
5    # Training step
6    fwd_future = training_client.forward_backward(batch, "cross_entropy")
7    opt_future = training_client.optim_step(adam_params)
8    
9    fwd_result = fwd_future.result()
10    
11    # Periodic checkpoint
12    if step % checkpoint_interval == 0:
13        training_client.save_state(
14            name=f"training_step_{step}",
15            metadata={"step": step, "loss": fwd_result.loss}
16        )
17    
18    # Validation + best model tracking
19    if step % eval_interval == 0:
20        val_loss = evaluate(training_client, val_data)
21        
22        if val_loss < best_val_loss:
23            best_val_loss = val_loss
24            training_client.save_state(
25                name=f"best_model",
26                metadata={"step": step, "val_loss": val_loss}
27            )
28            print(f"✓ New best model: val_loss={val_loss:.4f}")Monitoring & Debugging
Track the right metrics to catch issues early and optimize training.
Essential Metrics to Monitor
Training Loss
Should decrease steadily. Plateau may indicate learning rate issues.
Validation Loss
If diverging from train loss → overfitting (reduce epochs or add data)
Gradient Norm
Very large (>10) or very small (<0.01) indicates problems
Learning Rate
Track LR decay over training (warmup + cosine decay is common)
Throughput
Examples/second - should be stable (drop indicates bottleneck)
GPU Utilization
Reported by Bios - target >80% for efficient training
1import json
2from collections import deque
3
4# Track metrics in a rolling window
5metrics_window = deque(maxlen=100)
6
7for step, batch in enumerate(dataloader):
8    fwd_future = training_client.forward_backward(batch, "cross_entropy")
9    opt_future = training_client.optim_step(adam_params)
10    
11    fwd_result = fwd_future.result()
12    opt_result = opt_future.result()
13    
14    # Collect metrics
15    metrics = {
16        "step": step,
17        "loss": fwd_result.loss,
18        "grad_norm": fwd_result.grad_norm,
19        "learning_rate": opt_result.learning_rate
20    }
21    metrics_window.append(metrics)
22    
23    # Print summary every 10 steps
24    if step % 10 == 0:
25        recent_loss = sum(m["loss"] for m in metrics_window) / len(metrics_window)
26        print(f"Step {step}: avg_loss={recent_loss:.4f} | "
27              f"grad_norm={fwd_result.grad_norm:.4f} | "
28              f"lr={opt_result.learning_rate:.2e}")
29    
30    # Log to file
31    with open("training_log.jsonl", "a") as f:
32        f.write(json.dumps(metrics) + "\n")Common Issues & Solutions
Loss Not Decreasing
Symptoms: Loss stays flat or fluctuates randomly
Solutions to try:
- 1. Increase learning rate (try 2-5x current value)
- 2. Verify data weighting (check that completion tokens have weight=1)
- 3. Increase LoRA rank (may need more capacity)
- 4. Check for data quality issues (duplicates, errors)
- 5. Reduce batch size if using very large batches with LoRA
Overfitting
Symptoms: Training loss decreases but validation loss increases
Solutions:
- 1. Add more diverse training data
- 2. Reduce number of training epochs
- 3. Increase weight decay (L2 regularization)
- 4. Use LoRA dropout (0.05-0.1)
- 5. Reduce LoRA rank if dataset is small
- 6. Implement early stopping based on validation loss
Gradient Explosion
Symptoms: Loss becomes NaN, gradient norm >100
Solutions:
- 1. Enable gradient clipping (max_norm=1.0)
- 2. Reduce learning rate
- 3. Check for corrupted data (extreme values)
- 4. Use gradient accumulation to stabilize
- 5. Verify LoRA alpha is appropriate (typically 2x rank)
Production Training Patterns
1. Evaluation During Training
Run periodic evaluations to track generalization and prevent overfitting:
1def evaluate(training_client, val_data):
2    """Run validation without updating weights"""
3    total_loss = 0
4    for batch in val_data:
5        # Forward only (no backward)
6        result = training_client.forward_backward(
7            batch, "cross_entropy"
8        ).result()
9        total_loss += result.loss
10    return total_loss / len(val_data)
11
12# In training loop
13for epoch in range(num_epochs):
14    # Train
15    for batch in train_data:
16        training_client.forward_backward(batch, "cross_entropy")
17        training_client.optim_step(adam_params)
18    
19    # Evaluate
20    val_loss = evaluate(training_client, val_data)
21    print(f"Epoch {epoch}: val_loss={val_loss:.4f}")2. Graceful Error Handling
Handle transient failures and resume training:
1import time
2
3def train_with_retry(training_client, dataloader, max_retries=3):
4    """Training loop with automatic retry on transient failures"""
5    for step, batch in enumerate(dataloader):
6        for attempt in range(max_retries):
7            try:
8                fwd_future = training_client.forward_backward(batch, "cross_entropy")
9                opt_future = training_client.optim_step(adam_params)
10                
11                fwd_result = fwd_future.result()
12                opt_result = opt_future.result()
13                
14                # Success - break retry loop
15                break
16                
17            except bios.TransientError as e:
18                if attempt < max_retries - 1:
19                    print(f"Transient error at step {step}, retrying...")
20                    time.sleep(2 ** attempt)  # Exponential backoff
21                else:
22                    print(f"Failed after {max_retries} attempts")
23                    raise3. Experiment Tracking
Log hyperparameters and results for reproducibility:
1import json
2from datetime import datetime
3
4# Save experiment config
5experiment_config = {
6    "timestamp": datetime.now().isoformat(),
7    "base_model": "ultrasafe/usf-finance",
8    "lora_rank": 16,
9    "lora_alpha": 32,
10    "learning_rate": 2e-4,
11    "batch_size": 32,
12    "num_epochs": 3,
13    "dataset": "financial_qa_v2",
14    "dataset_size": 1247
15}
16
17with open("experiment_config.json", "w") as f:
18    json.dump(experiment_config, f, indent=2)
19
20# After training, save results
21results = {
22    "final_train_loss": final_train_loss,
23    "final_val_loss": final_val_loss,
24    "best_val_loss": best_val_loss,
25    "total_steps": total_steps,
26    "training_time_hours": training_time / 3600,
27    "checkpoint_path": checkpoint_path
28}
29
30with open("experiment_results.json", "w") as f:
31    json.dump(results, f, indent=2)Performance Optimization Tips
| Technique | Speedup | How to Apply | 
|---|---|---|
| Operation Pipelining | ~20% | Submit forward_backward + optim_step before .result() | 
| Async Functions | ~30% | Use forward_backward_async() for better concurrency | 
| Larger Batches | ~40% | Use batch_size=32-128 (GPU memory permitting) | 
| Gradient Accumulation | Varies | Accumulate over 2-8 batches for effective larger batches | 
Combined Impact
Applying all optimizations together can result in 2-3x throughput improvement compared to naive implementations. This translates directly to reduced training time and cost.
Security & Compliance
Data Privacy
- • Training data never leaves Bios infrastructure
- • Data is encrypted in transit and at rest
- • Your data is never used to improve base models
- • Automatic data deletion after training completion
- • GDPR and HIPAA compliant infrastructure
API Key Security
- • Never commit keys to version control
- • Use environment variables or secret managers
- • Rotate keys every 90 days
- • Separate keys for dev/prod environments
- • Revoke compromised keys immediately
Ready for Production?
Apply these best practices to your training workflows: