API Reference

This section provides detailed documentation for all OpenEvals classes and methods.

Core Modules

Benchmark

Main orchestration class for running benchmarks.

from openevals.core.benchmark import Benchmark

benchmark = Benchmark("config.yaml")

Constructor

Benchmark(config_path: str)

Parameter	Type	Description
config_path	str	Path to YAML configuration file

Raises:

FileNotFoundError - Config file does not exist
yaml.YAMLError - Invalid YAML syntax

ModelWrapper

Unified interface for language models.

from openevals.core.model_loader import ModelWrapper

wrapper = ModelWrapper("model-name", model, tokenizer)

Main Classes

Benchmark Methods

load_models

load_models(model_names: Optional[List[str]] = None) -> None

Load specified models or all models in config.

Parameter	Type	Description
model_names	Optional[List[str]]	Models to load. If None, loads all.

load_tasks

load_tasks(task_names: Optional[List[str]] = None) -> None

Load specified tasks or all tasks in config.

Parameter	Type	Description
task_names	Optional[List[str]]	Tasks to load. If None, loads all.

run_benchmarks

run_benchmarks() -> Dict[str, Dict[str, Any]]

Run all loaded benchmarks for all loaded models.

Returns: Nested dictionary with results per model per task.

{
    "gemma-2b": {
        "mmlu": {
            "overall": {"accuracy": 0.65, "total": 1000},
            "subjects": {"mathematics": {"accuracy": 0.58}}
        }
    }
}

save_results

save_results(output_path: Optional[str] = None) -> str

Save results to disk.

Parameter	Type	Description
output_path	Optional[str]	Path to save. If None, generates timestamp-based path.

Returns: Path where results were saved.

Evaluation Components

BenchmarkTask Base Interface

All tasks implement this interface:

class BenchmarkTask:
    def __init__(self, config: Dict[str, Any]):
        """Initialize with configuration."""
        pass

    def load_data(self) -> Any:
        """Load benchmark dataset."""
        pass

    def evaluate(self, model: ModelWrapper) -> Dict[str, Any]:
        """Evaluate model on this task."""
        pass

MMLUBenchmark

from openevals.tasks.mmlu import MMLUBenchmark

config = {"subset": "mathematics", "shot_count": 5}
benchmark = MMLUBenchmark(config)
results = benchmark.evaluate(model_wrapper)

Configuration:

Option	Type	Description
subset	str	Subject subset: “all”, “mathematics”, etc.
shot_count	int	Few-shot examples (0-10)

Returns:

{
    "overall": {"correct": 650, "total": 1000, "accuracy": 0.65},
    "subjects": {
        "algebra": {"correct": 45, "total": 50, "accuracy": 0.90}
    }
}

Gsm8kBenchmark

from openevals.tasks.gsm8k import Gsm8kBenchmark

config = {"shot_count": 5, "use_chain_of_thought": True}
benchmark = Gsm8kBenchmark(config)

Configuration:

Option	Type	Description
shot_count	int	Few-shot examples
use_chain_of_thought	bool	Enable CoT prompting

HumanevalBenchmark

from openevals.tasks.humaneval import HumanevalBenchmark

config = {"timeout": 10, "temperature": 0.2}
benchmark = HumanevalBenchmark(config)

Configuration:

Option	Type	Description
timeout	int	Execution timeout (seconds)
temperature	float	Sampling temperature
max_new_tokens	int	Maximum tokens

Utilities

Metrics Functions

from openevals.utils.metrics import (
    calculate_accuracy,
    calculate_pass_at_k,
    calculate_confidence_interval,
    aggregate_results
)

calculate_accuracy

calculate_accuracy(correct: int, total: int) -> float

calculate_pass_at_k

calculate_pass_at_k(n_samples: int, n_correct: int, k: int) -> float

calculate_confidence_interval

calculate_confidence_interval(
    accuracy: float,
    n_samples: int,
    confidence: float = 0.95
) -> Tuple[float, float]

Visualization

ChartGenerator

from openevals.visualization.charts import ChartGenerator

generator = ChartGenerator("output/charts")

create_performance_heatmap

create_performance_heatmap(results: Dict) -> str

Generate heatmap of model x task performance.

create_model_comparison_chart

create_model_comparison_chart(results: Dict, task_name: str) -> str

Generate bar chart comparing models on a task.

create_efficiency_comparison_chart

create_efficiency_comparison_chart(results: Dict) -> Dict[str, str]

Generate efficiency charts (latency, memory, throughput).

Exceptions

Exception	Description
ModelLoadingError	Model loading failed
TaskInitializationError	Task initialization failed
BenchmarkExecutionError	Benchmark execution failed
ConfigurationError	Invalid configuration