OpenEvals

Getting Started

  • Installation
    • Requirements
      • Hardware
      • Software
    • Basic Installation
      • Clone the Repository
      • Create Virtual Environment
      • Install Dependencies
    • Development Installation
    • Optional Dependencies
    • Configuration
      • Authentication
    • Docker Installation
    • Troubleshooting
      • Common Issues
  • Quick Start
    • Prerequisites
    • Your First Benchmark
      • Download Benchmark Data
      • Run a Simple Evaluation
      • View Results
    • Using the Python API
    • Configuration Example
    • Using the Web Interface
    • Next Steps

Core Functionality

  • Data Loading and Processing
    • Supported Datasets
    • Downloading Data
    • Data Validation
    • Dataset Statistics
    • Custom Data Loading
    • Data Preprocessing
      • Tokenization
      • Prompt Formatting
    • Caching
  • Evaluation Metrics
    • Evaluation Framework
    • Standard Metrics
      • Accuracy
      • Exact Match
      • Pass@k
    • Safety Metrics
      • TruthfulQA Scoring
    • Custom Scoring Components
    • Statistical Analysis
      • Confidence Intervals
      • Significance Testing
    • Aggregating Results
    • Efficiency Metrics
      • Latency
      • Throughput
      • Memory Utilization
    • Per-Task Metrics
  • Public Leaderboard
    • Overview
    • Using the CLI
      • View Current Leaderboard
      • Submit Results
      • Filter by Model Family
    • Input Formats
      • YAML Format
      • JSON Format
    • Customization
      • Custom Ranking
      • Filtering Options
    • Deployment
      • Web Interface
      • API Endpoints
    • Export Options
      • Export for Publications
      • Example LaTeX Output

API Reference

  • API Reference
    • Core Modules
      • Benchmark
      • ModelWrapper
    • Main Classes
      • Benchmark Methods
        • load_models
        • load_tasks
        • run_benchmarks
        • save_results
    • Evaluation Components
      • BenchmarkTask Base Interface
      • MMLUBenchmark
      • Gsm8kBenchmark
      • HumanevalBenchmark
    • Utilities
      • Metrics Functions
        • calculate_accuracy
        • calculate_pass_at_k
        • calculate_confidence_interval
    • Visualization
      • ChartGenerator
        • create_performance_heatmap
        • create_model_comparison_chart
        • create_efficiency_comparison_chart
    • Exceptions
OpenEvals
  • Search


© Copyright 2025, Cheng Hei Lam.

Built with Sphinx using a theme provided by Read the Docs.