Distributed Training
Bios abstracts the complexity of distributed training, allowing you to write simple Python code that executes exclusively on UltraSafe's secure GPU infrastructure while it automatically orchestrates distributed execution. No distributed systems expertise required.
Key Insight
You write training loops as if running on a single GPU. Bios automatically handles data distribution, gradient accumulation, model parallelism, and fault recovery across the cluster.
How Distributed Training Works in Bios
Bios uses a client-server architecture that separates your training logic from execution infrastructure:
Your Development Interface
Write training code through our API, define loss functions. All execution happens on UltraSafe's GPUs.
Bios API Server
Receives operations, schedules work, manages state, handles failures.
GPU Cluster
Executes forward/backward passes, accumulates gradients, applies updates.
Network Transparency
Communication between your code and Bios happens over HTTPS. Your training code calls API functions that return immediately (as futures), allowing efficient pipelining of operations.
Your Code Stays Simple
This training code runs entirely on UltraSafe's distributed GPU cluster through our API:
1import bios
2from bios import types
3
4# Initialize (connects to Bios API)
5service_client = bios.ServiceClient()
6training_client = service_client.create_lora_training_client(
7    base_model="ultrasafe/usf-finance",
8    rank=16
9)
10
11# Standard training loop - looks like single-GPU code
12for epoch in range(num_epochs):
13    for batch in dataloader:
14        # Queue forward/backward (executes on GPUs)
15        fwd_future = training_client.forward_backward(batch, "cross_entropy")
16        
17        # Queue optimizer step (executes on GPUs)
18        opt_future = training_client.optim_step(
19            types.AdamParams(learning_rate=1e-4)
20        )
21        
22        # Wait for results
23        fwd_result = fwd_future.result()
24        opt_result = opt_future.result()
25        
26        print(f"Loss: {fwd_result.loss:.4f}")
27
28# Bios handled: GPU allocation, data distribution, gradient sync, parallelismWhat Bios Did Behind the Scenes
- ✓ Allocated GPUs from the cluster
- ✓ Distributed batch across available GPUs
- ✓ Synchronized gradients across workers
- ✓ Applied gradient clipping and accumulation
- ✓ Managed model parallelism (if needed)
- ✓ Handled any worker failures transparently
Distributed Training Features
Automatic GPU Orchestration
Bios automatically allocates GPUs based on model size and batch size. You don't specify the number of GPUs— the system determines optimal allocation.
Gradient Synchronization
After each forward/backward pass, Bios synchronizes gradients across all workers using efficient AllReduce operations.
Gradient Accumulation
Train with large effective batch sizes by accumulating gradients over multiple forward/backward calls before applying optimizer updates.
1# Accumulate over 4 batches
2for i in range(4):
3    training_client.forward_backward(batch[i], "cross_entropy")
4# Apply accumulated gradients
5training_client.optim_step()Automatic Fault Tolerance
If a GPU fails mid-training, Bios detects the failure, reallocates work to healthy GPUs, and restores from the last checkpoint automatically.
Data Parallelism
Bios uses data parallelism by default—each GPU processes a subset of the batch:
How Data Parallelism Works
Batch Splitting
Your batch of 128 examples is split across 4 GPUs (32 examples each)
Parallel Forward/Backward
Each GPU independently computes forward pass and gradients on its subset
Gradient Averaging
Gradients are averaged across all GPUs using AllReduce
Parameter Update
Each GPU applies the same averaged gradient to update model parameters
Result: Linear Speedup
With 4 GPUs, you get approximately 4x throughput (examples processed per second). The overhead of gradient synchronization is typically <10% for reasonably sized batches.
Model Parallelism (Automatic)
For very large models, Bios automatically applies model parallelism—splitting the model itself across GPUs:
When Model Parallelism Activates
Bios automatically uses model parallelism when:
- • Model + optimizer state > single GPU memory
- • Training very large models (70B+ parameters)
- • Using high LoRA ranks that increase memory usage
No code changes needed - Bios detects memory constraints and applies tensor parallelism automatically.
Tensor Parallelism Strategy
Bios splits attention and MLP layers across GPUs:
Performance Optimization
✓ Best Practices
- • Use batch sizes that fully utilize GPU memory
- • Pipeline operations by submitting forward_backward + optim_step together
- • Use gradient accumulation for effective larger batches
- • Let Bios auto-select GPU count (don't specify)
- • Use async functions for maximum throughput
- • Monitor gradient norms to detect issues early
✗ Common Pitfalls
- • Very small batches (<8 examples) waste GPU cycles
- • Waiting for each operation individually (no pipelining)
- • Excessive checkpointing (save only when needed)
- • Not monitoring GPU utilization
- • Ignoring gradient accumulation opportunities
- • Using synchronous APIs when async would work
Monitoring Distributed Training
Track key metrics to ensure efficient distributed execution:
1# Track distributed training metrics
2import time
3
4start_time = time.time()
5examples_processed = 0
6
7for epoch in range(num_epochs):
8    for batch in dataloader:
9        fwd_future = training_client.forward_backward(batch, "cross_entropy")
10        opt_future = training_client.optim_step(adam_params)
11        
12        fwd_result = fwd_future.result()
13        opt_result = opt_future.result()
14        
15        examples_processed += len(batch)
16        
17        # Calculate throughput
18        elapsed = time.time() - start_time
19        throughput = examples_processed / elapsed
20        
21        print(f"Loss: {fwd_result.loss:.4f} | "
22              f"Grad Norm: {fwd_result.grad_norm:.4f} | "
23              f"Throughput: {throughput:.1f} examples/sec | "
24              f"LR: {opt_result.learning_rate:.2e}")Next Steps
Explore related topics and advanced training techniques: