Model Evaluations

Bios training scripts automatically print training and test loss. Beyond basic loss metrics, this guide covers two common evaluation workflows: inline evaluations during training and offline evaluations on saved checkpoints.

Inline Evaluations During Training

Add inline evaluations to your training runs by configuring evaluator builders. These run periodically during training to monitor model quality on held-out tasks.

Supervised Fine-Tuning Evals

Configure evaluation frequency in your SFT config:

SFT Inline Evaluations
1# Add to supervised training config
2config = {
3    # Regular evaluations
4    "evaluator_builders": [eval_builder_1, eval_builder_2],
5    "eval_every": 100,  # Run every 100 steps
6    
7    # Infrequent (expensive) evaluations
8    "infrequent_evaluator_builders": [expensive_eval],
9    "infrequent_eval_every": 500  # Run every 500 steps
10}

RL Training Evals

Configure periodic evaluations for RL training:

RL Inline Evaluations
1# Add to RL training config
2config = {
3    "evaluator_builders": [sampling_eval_1, sampling_eval_2],
4    "eval_every": 50  # Run every 50 RL iterations
5}

Evaluator Types

  • EvaluatorBuilder for SFT
  • SamplingClientEvaluator for RL

Offline Evaluations with Inspect AI

Run standard benchmarks on your trained models using the Inspect AI library. Bios provides a script that uses internal sampling functionality.

Run Inspect Evaluations
1MODEL_PATH=bios://YOUR_MODEL_PATH_HERE
2python -m bios_cookbook.eval.run_inspect_evals \
3    model_path=$MODEL_PATH \
4    model_name=ultrasafe/usf-mini \
5    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
6    renderer_name=ultrasafe

Supported Evaluation Tasks

Standard benchmarks available through Inspect AI:

ifeval

Instruction following

mmlu_0_shot

Multitask understanding

gsm8k

Math reasoning

View full list of supported evaluations in the Bios Cookbook documentation

Creating Custom Evaluations

Bios recommends two approaches for custom evaluations:

1. Inspect AI Tasks

Create custom tasks using the Inspect AI framework, then run with the evaluation script above

2. SamplingClientEvaluator

Lower-level abstraction with fine-grained control over datasets and metrics

LLM-as-Judge with Inspect AI

Example showing how to create an evaluation with an LLM grading model responses:

LLM-as-Judge Evaluation
1import bios
2from inspect_ai import Task, task
3from inspect_ai.dataset import MemoryDataset, Sample
4from inspect_ai.scorer import model_graded_qa
5from inspect_ai.solver import generate
6from bios_cookbook.eval.inspect_utils import InspectAPIFromBiosSampling
7
8# Define QA dataset
9QA_DATASET = MemoryDataset(
10    name="qa_dataset",
11    samples=[
12        Sample(input="What is the capital of France?", target="Paris"),
13        Sample(input="What is the capital of Italy?", target="Rome"),
14    ]
15)
16
17# Create Bios sampling client
18service_client = bios.ServiceClient()
19sampling_client = service_client.create_sampling_client(
20    base_model="ultrasafe/usf-mini"
21)
22
23# Wrap for Inspect AI
24api = InspectAPIFromBiosSampling(
25    renderer_name="ultrasafe",
26    model_name="ultrasafe/usf-mini",
27    sampling_client=sampling_client,
28    verbose=False
29)
30
31@task
32def example_lm_as_judge() -> Task:
33    return Task(
34        name="llm_as_judge",
35        dataset=QA_DATASET,
36        solver=generate(),
37        scorer=model_graded_qa(
38            instructions="Grade strictly. Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
39            partial_credit=False
40        )
41    )

Custom SamplingClientEvaluator

For fine-grained control, implement your own evaluator class:

Custom Evaluator Implementation
1from typing import Any, Callable
2import bios
3from bios import types
4from bios_cookbook import renderers
5from bios_cookbook.evaluators import SamplingClientEvaluator
6from bios_cookbook.tokenizer_utils import get_tokenizer
7
8class CustomEvaluator(SamplingClientEvaluator):
9    def __init__(self, dataset, grader_fn, model_name, renderer_name):
10        self.dataset = dataset
11        self.grader_fn = grader_fn
12        tokenizer = get_tokenizer(model_name)
13        self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
14    
15    async def __call__(self, sampling_client) -> dict[str, float]:
16        num_correct = 0
17        sampling_params = types.SamplingParams(
18            max_tokens=100,
19            temperature=0.7,
20            stop=self.renderer.get_stop_sequences()
21        )
22        
23        for datum in self.dataset:
24            model_input = self.renderer.build_generation_prompt(
25                [{"role": "user", "content": datum["input"]}]
26            )
27            result = await sampling_client.sample_async(
28                prompt=model_input,
29                num_samples=1,
30                sampling_params=sampling_params
31            )
32            tokens = result.sequences[0].tokens
33            response = self.renderer.parse_response(tokens)[0]
34            
35            if self.grader_fn(response["content"], datum["output"]):
36                num_correct += 1
37        
38        return {"accuracy": num_correct / len(self.dataset)}

Using Custom Evaluators

Example of running a custom evaluator on a model checkpoint:

Run Custom Evaluation
1import asyncio
2import bios
3
4# Define test dataset
5QA_DATASET = [
6    {"input": "What is the capital of France?", "output": "Paris"},
7    {"input": "What is the capital of Germany?", "output": "Berlin"},
8    {"input": "What is the capital of Italy?", "output": "Rome"},
9]
10
11# Simple grader function
12def grader_fn(response: str, target: str) -> bool:
13    return target.lower() in response.lower()
14
15# Create evaluator
16evaluator = CustomEvaluator(
17    dataset=QA_DATASET,
18    grader_fn=grader_fn,
19    model_name="ultrasafe/usf-mini",
20    renderer_name="ultrasafe"
21)
22
23# Run evaluation
24service_client = bios.ServiceClient()
25sampling_client = service_client.create_sampling_client(
26    base_model="ultrasafe/usf-mini"
27)
28
29async def main():
30    result = await evaluator(sampling_client)
31    print(f"Evaluation Results: {result}")
32    # Output: {'accuracy': 1.0}
33
34asyncio.run(main())

Evaluation Best Practices

✓ Do

  • • Run inline evals to catch issues early
  • • Use multiple diverse evaluation tasks
  • • Save evaluation results with checkpoints
  • • Test on held-out data, not training set
  • • Use Inspect AI for standard benchmarks
  • • Create domain-specific custom evaluators

✗ Don't

  • • Don't run expensive evals too frequently (slows training)
  • • Don't skip evaluation on important checkpoints
  • • Don't use only training loss as quality metric
  • • Don't evaluate on training data (overfitting blind spot)
  • • Don't ignore eval failures or anomalies