Distributed Training

Bios abstracts the complexity of distributed training, allowing you to write simple Python code that executes exclusively on UltraSafe's secure GPU infrastructure while it automatically orchestrates distributed execution. No distributed systems expertise required.

Key Insight

You write training loops as if running on a single GPU. Bios automatically handles data distribution, gradient accumulation, model parallelism, and fault recovery across the cluster.

How Distributed Training Works in Bios

Bios uses a client-server architecture that separates your training logic from execution infrastructure:

Your Development Interface

Write training code through our API, define loss functions. All execution happens on UltraSafe's GPUs.

Bios API Server

Receives operations, schedules work, manages state, handles failures.

GPU Cluster

Executes forward/backward passes, accumulates gradients, applies updates.

Network Transparency

Communication between your code and Bios happens over HTTPS. Your training code calls API functions that return immediately (as futures), allowing efficient pipelining of operations.

Your Code Stays Simple

This training code runs entirely on UltraSafe's distributed GPU cluster through our API:

Simple Training Loop (Runs Anywhere)

1import bios
2from bios import types
3
4# Initialize (connects to Bios API)
5service_client = bios.ServiceClient()
6training_client = service_client.create_lora_training_client(
7    base_model="ultrasafe/usf-finance",
8    rank=16
9)
10
11# Standard training loop - looks like single-GPU code
12for epoch in range(num_epochs):
13    for batch in dataloader:
14        # Queue forward/backward (executes on GPUs)
15        fwd_future = training_client.forward_backward(batch, "cross_entropy")
16        
17        # Queue optimizer step (executes on GPUs)
18        opt_future = training_client.optim_step(
19            types.AdamParams(learning_rate=1e-4)
20        )
21        
22        # Wait for results
23        fwd_result = fwd_future.result()
24        opt_result = opt_future.result()
25        
26        print(f"Loss: {fwd_result.loss:.4f}")
27
28# Bios handled: GPU allocation, data distribution, gradient sync, parallelism

What Bios Did Behind the Scenes

✓ Allocated GPUs from the cluster
✓ Distributed batch across available GPUs
✓ Synchronized gradients across workers
✓ Applied gradient clipping and accumulation
✓ Managed model parallelism (if needed)
✓ Handled any worker failures transparently

Distributed Training Features

Automatic GPU Orchestration

Bios automatically allocates GPUs based on model size and batch size. You don't specify the number of GPUs— the system determines optimal allocation.

No configuration needed - just call training functions

Gradient Synchronization

After each forward/backward pass, Bios synchronizes gradients across all workers using efficient AllReduce operations.

Technique: Ring-AllReduce for gradient averaging

Gradient Accumulation

Train with large effective batch sizes by accumulating gradients over multiple forward/backward calls before applying optimizer updates.

1# Accumulate over 4 batches
2for i in range(4):
3    training_client.forward_backward(batch[i], "cross_entropy")
4# Apply accumulated gradients
5training_client.optim_step()

Automatic Fault Tolerance

If a GPU fails mid-training, Bios detects the failure, reallocates work to healthy GPUs, and restores from the last checkpoint automatically.

Recovery: Automatic checkpoint + restore (no manual intervention)

Data Parallelism

Bios uses data parallelism by default—each GPU processes a subset of the batch:

How Data Parallelism Works

Batch Splitting

Your batch of 128 examples is split across 4 GPUs (32 examples each)

Parallel Forward/Backward

Each GPU independently computes forward pass and gradients on its subset

Gradient Averaging

Gradients are averaged across all GPUs using AllReduce

Parameter Update

Each GPU applies the same averaged gradient to update model parameters

Result: Linear Speedup

With 4 GPUs, you get approximately 4x throughput (examples processed per second). The overhead of gradient synchronization is typically <10% for reasonably sized batches.

1 GPU:100 examples/sec

4 GPUs:~380 examples/sec (3.8x speedup)

Model Parallelism (Automatic)

For very large models, Bios automatically applies model parallelism—splitting the model itself across GPUs:

When Model Parallelism Activates

Bios automatically uses model parallelism when:

• Model + optimizer state > single GPU memory
• Training very large models (70B+ parameters)
• Using high LoRA ranks that increase memory usage

No code changes needed - Bios detects memory constraints and applies tensor parallelism automatically.

Tensor Parallelism Strategy

Bios splits attention and MLP layers across GPUs:

GPU 0: Q/K/V projections [0:D/4]

GPU 1: Q/K/V projections [D/4:D/2]

GPU 2: Q/K/V projections [D/2:3D/4]

GPU 3: Q/K/V projections [3D/4:D]

Performance Optimization

✓ Best Practices

• Use batch sizes that fully utilize GPU memory
• Pipeline operations by submitting forward_backward + optim_step together
• Use gradient accumulation for effective larger batches
• Let Bios auto-select GPU count (don't specify)
• Use async functions for maximum throughput
• Monitor gradient norms to detect issues early

✗ Common Pitfalls

• Very small batches (<8 examples) waste GPU cycles
• Waiting for each operation individually (no pipelining)
• Excessive checkpointing (save only when needed)
• Not monitoring GPU utilization
• Ignoring gradient accumulation opportunities
• Using synchronous APIs when async would work

Monitoring Distributed Training

Track key metrics to ensure efficient distributed execution:

Monitoring Script

1# Track distributed training metrics
2import time
3
4start_time = time.time()
5examples_processed = 0
6
7for epoch in range(num_epochs):
8    for batch in dataloader:
9        fwd_future = training_client.forward_backward(batch, "cross_entropy")
10        opt_future = training_client.optim_step(adam_params)
11        
12        fwd_result = fwd_future.result()
13        opt_result = opt_future.result()
14        
15        examples_processed += len(batch)
16        
17        # Calculate throughput
18        elapsed = time.time() - start_time
19        throughput = examples_processed / elapsed
20        
21        print(f"Loss: {fwd_result.loss:.4f} | "
22              f"Grad Norm: {fwd_result.grad_norm:.4f} | "
23              f"Throughput: {throughput:.1f} examples/sec | "
24              f"LR: {opt_result.learning_rate:.2e}")

Next Steps

Explore related topics and advanced training techniques:

Supervised Learning →RLHF Training →Async Operations →