API Reference
This section provides detailed documentation for all OpenEvals classes and methods.
Core Modules
Benchmark
Main orchestration class for running benchmarks.
from openevals.core.benchmark import Benchmark
benchmark = Benchmark("config.yaml")
Constructor
Benchmark(config_path: str)
Parameter |
Type |
Description |
|---|---|---|
config_path |
str |
Path to YAML configuration file |
Raises:
FileNotFoundError- Config file does not existyaml.YAMLError- Invalid YAML syntax
ModelWrapper
Unified interface for language models.
from openevals.core.model_loader import ModelWrapper
wrapper = ModelWrapper("model-name", model, tokenizer)
Main Classes
Benchmark Methods
load_models
load_models(model_names: Optional[List[str]] = None) -> None
Load specified models or all models in config.
Parameter |
Type |
Description |
|---|---|---|
model_names |
Optional[List[str]] |
Models to load. If None, loads all. |
load_tasks
load_tasks(task_names: Optional[List[str]] = None) -> None
Load specified tasks or all tasks in config.
Parameter |
Type |
Description |
|---|---|---|
task_names |
Optional[List[str]] |
Tasks to load. If None, loads all. |
run_benchmarks
run_benchmarks() -> Dict[str, Dict[str, Any]]
Run all loaded benchmarks for all loaded models.
Returns: Nested dictionary with results per model per task.
{
"gemma-2b": {
"mmlu": {
"overall": {"accuracy": 0.65, "total": 1000},
"subjects": {"mathematics": {"accuracy": 0.58}}
}
}
}
save_results
save_results(output_path: Optional[str] = None) -> str
Save results to disk.
Parameter |
Type |
Description |
|---|---|---|
output_path |
Optional[str] |
Path to save. If None, generates timestamp-based path. |
Returns: Path where results were saved.
Evaluation Components
BenchmarkTask Base Interface
All tasks implement this interface:
class BenchmarkTask:
def __init__(self, config: Dict[str, Any]):
"""Initialize with configuration."""
pass
def load_data(self) -> Any:
"""Load benchmark dataset."""
pass
def evaluate(self, model: ModelWrapper) -> Dict[str, Any]:
"""Evaluate model on this task."""
pass
MMLUBenchmark
from openevals.tasks.mmlu import MMLUBenchmark
config = {"subset": "mathematics", "shot_count": 5}
benchmark = MMLUBenchmark(config)
results = benchmark.evaluate(model_wrapper)
Configuration:
Option |
Type |
Description |
|---|---|---|
subset |
str |
Subject subset: “all”, “mathematics”, etc. |
shot_count |
int |
Few-shot examples (0-10) |
Returns:
{
"overall": {"correct": 650, "total": 1000, "accuracy": 0.65},
"subjects": {
"algebra": {"correct": 45, "total": 50, "accuracy": 0.90}
}
}
Gsm8kBenchmark
from openevals.tasks.gsm8k import Gsm8kBenchmark
config = {"shot_count": 5, "use_chain_of_thought": True}
benchmark = Gsm8kBenchmark(config)
Configuration:
Option |
Type |
Description |
|---|---|---|
shot_count |
int |
Few-shot examples |
use_chain_of_thought |
bool |
Enable CoT prompting |
HumanevalBenchmark
from openevals.tasks.humaneval import HumanevalBenchmark
config = {"timeout": 10, "temperature": 0.2}
benchmark = HumanevalBenchmark(config)
Configuration:
Option |
Type |
Description |
|---|---|---|
timeout |
int |
Execution timeout (seconds) |
temperature |
float |
Sampling temperature |
max_new_tokens |
int |
Maximum tokens |
Utilities
Metrics Functions
from openevals.utils.metrics import (
calculate_accuracy,
calculate_pass_at_k,
calculate_confidence_interval,
aggregate_results
)
calculate_accuracy
calculate_accuracy(correct: int, total: int) -> float
calculate_pass_at_k
calculate_pass_at_k(n_samples: int, n_correct: int, k: int) -> float
calculate_confidence_interval
calculate_confidence_interval(
accuracy: float,
n_samples: int,
confidence: float = 0.95
) -> Tuple[float, float]
Visualization
ChartGenerator
from openevals.visualization.charts import ChartGenerator
generator = ChartGenerator("output/charts")
create_performance_heatmap
create_performance_heatmap(results: Dict) -> str
Generate heatmap of model x task performance.
create_model_comparison_chart
create_model_comparison_chart(results: Dict, task_name: str) -> str
Generate bar chart comparing models on a task.
create_efficiency_comparison_chart
create_efficiency_comparison_chart(results: Dict) -> Dict[str, str]
Generate efficiency charts (latency, memory, throughput).
Exceptions
Exception |
Description |
|---|---|
ModelLoadingError |
Model loading failed |
TaskInitializationError |
Task initialization failed |
BenchmarkExecutionError |
Benchmark execution failed |
ConfigurationError |
Invalid configuration |