API Reference
=============

This section provides detailed documentation for all OpenEvals classes and methods.

Core Modules
------------

Benchmark
^^^^^^^^^

Main orchestration class for running benchmarks.

.. code-block:: python

   from openevals.core.benchmark import Benchmark

   benchmark = Benchmark("config.yaml")

**Constructor**

.. code-block:: python

   Benchmark(config_path: str)

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Parameter
     - Type
     - Description
   * - config_path
     - str
     - Path to YAML configuration file

**Raises:**

- ``FileNotFoundError`` - Config file does not exist
- ``yaml.YAMLError`` - Invalid YAML syntax

ModelWrapper
^^^^^^^^^^^^

Unified interface for language models.

.. code-block:: python

   from openevals.core.model_loader import ModelWrapper

   wrapper = ModelWrapper("model-name", model, tokenizer)

Main Classes
------------

Benchmark Methods
^^^^^^^^^^^^^^^^^

load_models
"""""""""""

.. code-block:: python

   load_models(model_names: Optional[List[str]] = None) -> None

Load specified models or all models in config.

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Parameter
     - Type
     - Description
   * - model_names
     - Optional[List[str]]
     - Models to load. If None, loads all.

load_tasks
""""""""""

.. code-block:: python

   load_tasks(task_names: Optional[List[str]] = None) -> None

Load specified tasks or all tasks in config.

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Parameter
     - Type
     - Description
   * - task_names
     - Optional[List[str]]
     - Tasks to load. If None, loads all.

run_benchmarks
""""""""""""""

.. code-block:: python

   run_benchmarks() -> Dict[str, Dict[str, Any]]

Run all loaded benchmarks for all loaded models.

**Returns:** Nested dictionary with results per model per task.

.. code-block:: python

   {
       "gemma-2b": {
           "mmlu": {
               "overall": {"accuracy": 0.65, "total": 1000},
               "subjects": {"mathematics": {"accuracy": 0.58}}
           }
       }
   }

save_results
""""""""""""

.. code-block:: python

   save_results(output_path: Optional[str] = None) -> str

Save results to disk.

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Parameter
     - Type
     - Description
   * - output_path
     - Optional[str]
     - Path to save. If None, generates timestamp-based path.

**Returns:** Path where results were saved.

Evaluation Components
---------------------

BenchmarkTask Base Interface
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All tasks implement this interface:

.. code-block:: python

   class BenchmarkTask:
       def __init__(self, config: Dict[str, Any]):
           """Initialize with configuration."""
           pass

       def load_data(self) -> Any:
           """Load benchmark dataset."""
           pass

       def evaluate(self, model: ModelWrapper) -> Dict[str, Any]:
           """Evaluate model on this task."""
           pass

MMLUBenchmark
^^^^^^^^^^^^^

.. code-block:: python

   from openevals.tasks.mmlu import MMLUBenchmark

   config = {"subset": "mathematics", "shot_count": 5}
   benchmark = MMLUBenchmark(config)
   results = benchmark.evaluate(model_wrapper)

**Configuration:**

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Option
     - Type
     - Description
   * - subset
     - str
     - Subject subset: "all", "mathematics", etc.
   * - shot_count
     - int
     - Few-shot examples (0-10)

**Returns:**

.. code-block:: python

   {
       "overall": {"correct": 650, "total": 1000, "accuracy": 0.65},
       "subjects": {
           "algebra": {"correct": 45, "total": 50, "accuracy": 0.90}
       }
   }

Gsm8kBenchmark
^^^^^^^^^^^^^^

.. code-block:: python

   from openevals.tasks.gsm8k import Gsm8kBenchmark

   config = {"shot_count": 5, "use_chain_of_thought": True}
   benchmark = Gsm8kBenchmark(config)

**Configuration:**

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Option
     - Type
     - Description
   * - shot_count
     - int
     - Few-shot examples
   * - use_chain_of_thought
     - bool
     - Enable CoT prompting

HumanevalBenchmark
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from openevals.tasks.humaneval import HumanevalBenchmark

   config = {"timeout": 10, "temperature": 0.2}
   benchmark = HumanevalBenchmark(config)

**Configuration:**

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Option
     - Type
     - Description
   * - timeout
     - int
     - Execution timeout (seconds)
   * - temperature
     - float
     - Sampling temperature
   * - max_new_tokens
     - int
     - Maximum tokens

Utilities
---------

Metrics Functions
^^^^^^^^^^^^^^^^^

.. code-block:: python

   from openevals.utils.metrics import (
       calculate_accuracy,
       calculate_pass_at_k,
       calculate_confidence_interval,
       aggregate_results
   )

calculate_accuracy
""""""""""""""""""

.. code-block:: python

   calculate_accuracy(correct: int, total: int) -> float

calculate_pass_at_k
"""""""""""""""""""

.. code-block:: python

   calculate_pass_at_k(n_samples: int, n_correct: int, k: int) -> float

calculate_confidence_interval
"""""""""""""""""""""""""""""

.. code-block:: python

   calculate_confidence_interval(
       accuracy: float,
       n_samples: int,
       confidence: float = 0.95
   ) -> Tuple[float, float]

Visualization
-------------

ChartGenerator
^^^^^^^^^^^^^^

.. code-block:: python

   from openevals.visualization.charts import ChartGenerator

   generator = ChartGenerator("output/charts")

create_performance_heatmap
""""""""""""""""""""""""""

.. code-block:: python

   create_performance_heatmap(results: Dict) -> str

Generate heatmap of model x task performance.

create_model_comparison_chart
"""""""""""""""""""""""""""""

.. code-block:: python

   create_model_comparison_chart(results: Dict, task_name: str) -> str

Generate bar chart comparing models on a task.

create_efficiency_comparison_chart
""""""""""""""""""""""""""""""""""

.. code-block:: python

   create_efficiency_comparison_chart(results: Dict) -> Dict[str, str]

Generate efficiency charts (latency, memory, throughput).

Exceptions
----------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Exception
     - Description
   * - ModelLoadingError
     - Model loading failed
   * - TaskInitializationError
     - Task initialization failed
   * - BenchmarkExecutionError
     - Benchmark execution failed
   * - ConfigurationError
     - Invalid configuration