Quick Start

This guide will help you run your first benchmark evaluation.

Prerequisites

OpenEvals installed (see Installation)
HuggingFace token configured
GPU with at least 4GB VRAM (or CPU)

Your First Benchmark

Download Benchmark Data

python -m openevals.scripts.download_data --all

Run a Simple Evaluation

python -m openevals.scripts.run_benchmark \
  --config configs/benchmark_config.yaml \
  --models gemma-2b \
  --tasks mmlu \
  --visualize

View Results

Results are saved to results/ with the following structure:

results/
+-- 20250128_143022/
    +-- results.yaml       # Raw results
    +-- summary.json       # Aggregated metrics
    +-- visualizations/    # Charts and graphs
    +-- report.md          # Human-readable report

Using the Python API

from openevals.core.benchmark import Benchmark

# Initialize
benchmark = Benchmark("config.yaml")
benchmark.load_models(["gemma-2b"])
benchmark.load_tasks(["mmlu"])

# Run and save
results = benchmark.run_benchmarks()
benchmark.save_results("results.yaml")

Configuration Example

Create a configuration file config.yaml:

models:
  gemma-2b:
    type: gemma
    size: 2b
    variant: it
    quantization: true

tasks:
  mmlu:
    type: mmlu
    subset: mathematics
    shot_count: 5

evaluation:
  runs: 1
  batch_size: auto

output:
  path: ./results
  visualize: true
  export_formats: [json, yaml]

hardware:
  device: auto
  precision: bfloat16

Using the Web Interface

For a browser-based experience:

# Start backend
cd web/backend && uvicorn app.main:app --port 8000

# Start frontend (new terminal)
cd web/frontend && npm install && npm run dev

Access at http://localhost:3000.

Next Steps

Data Loading and Processing - Learn about data loading and processing
Evaluation Metrics - Understand evaluation metrics
API Reference - Detailed API reference