OpenEvals
Getting Started
Installation
Requirements
Hardware
Software
Basic Installation
Clone the Repository
Create Virtual Environment
Install Dependencies
Development Installation
Optional Dependencies
Configuration
Authentication
Docker Installation
Troubleshooting
Common Issues
Quick Start
Prerequisites
Your First Benchmark
Download Benchmark Data
Run a Simple Evaluation
View Results
Using the Python API
Configuration Example
Using the Web Interface
Next Steps
Core Functionality
Data Loading and Processing
Supported Datasets
Downloading Data
Data Validation
Dataset Statistics
Custom Data Loading
Data Preprocessing
Tokenization
Prompt Formatting
Caching
Evaluation Metrics
Evaluation Framework
Standard Metrics
Accuracy
Exact Match
Pass@k
Safety Metrics
TruthfulQA Scoring
Custom Scoring Components
Statistical Analysis
Confidence Intervals
Significance Testing
Aggregating Results
Efficiency Metrics
Latency
Throughput
Memory Utilization
Per-Task Metrics
Public Leaderboard
Overview
Using the CLI
View Current Leaderboard
Submit Results
Filter by Model Family
Input Formats
YAML Format
JSON Format
Customization
Custom Ranking
Filtering Options
Deployment
Web Interface
API Endpoints
Export Options
Export for Publications
Example LaTeX Output
API Reference
API Reference
Core Modules
Benchmark
ModelWrapper
Main Classes
Benchmark Methods
load_models
load_tasks
run_benchmarks
save_results
Evaluation Components
BenchmarkTask Base Interface
MMLUBenchmark
Gsm8kBenchmark
HumanevalBenchmark
Utilities
Metrics Functions
calculate_accuracy
calculate_pass_at_k
calculate_confidence_interval
Visualization
ChartGenerator
create_performance_heatmap
create_model_comparison_chart
create_efficiency_comparison_chart
Exceptions
OpenEvals
Index
Index