Public Leaderboard
MedExplain-Evals includes a comprehensive leaderboard system that generates static HTML pages displaying model performance across different audiences and complexity levels.
Overview
The leaderboard system processes evaluation results from multiple models and creates an interactive HTML dashboard featuring:
Overall Model Rankings: Comprehensive performance comparison
Audience-Specific Performance: Breakdown by physician, nurse, patient, and caregiver
Complexity-Level Analysis: Performance across basic, intermediate, and advanced content
Interactive Visualizations: Charts and graphs for performance analysis
Quick Start
Generate a leaderboard from evaluation results:
# Basic usage
python -m src.leaderboard --input results/ --output docs/index.html
# With custom options
python -m src.leaderboard \
--input evaluation_results/ \
--output leaderboard.html \
--title "Custom MedExplain-Evals Results" \
--verbose
Command Line Interface
Option |
Description |
|---|---|
|
Directory containing JSON evaluation result files (required) |
|
Output path for HTML leaderboard (default: docs/index.html) |
|
Custom title for the leaderboard page |
|
Enable verbose logging during generation |
Input Data Format
The leaderboard expects JSON files containing evaluation results in the following format:
{
"model_name": "GPT-4",
"total_items": 1000,
"audience_scores": {
"physician": [0.85, 0.90, 0.88, ...],
"nurse": [0.82, 0.85, 0.83, ...],
"patient": [0.75, 0.80, 0.78, ...],
"caregiver": [0.80, 0.82, 0.81, ...]
},
"complexity_scores": {
"basic": [0.85, 0.90, ...],
"intermediate": [0.80, 0.85, ...],
"advanced": [0.75, 0.80, ...]
},
"detailed_results": [...],
"summary": {
"overall_mean": 0.82,
"physician_mean": 0.88,
"nurse_mean": 0.83,
"patient_mean": 0.78,
"caregiver_mean": 0.81
}
}
Programmatic Usage
Basic Leaderboard Generation
from src.leaderboard import LeaderboardGenerator
from pathlib import Path
# Initialize generator
generator = LeaderboardGenerator()
# Load results from directory
results_dir = Path("evaluation_results/")
generator.load_results(results_dir)
# Generate HTML leaderboard
output_path = Path("docs/leaderboard.html")
generator.generate_html(output_path)
Advanced Usage
# Get leaderboard statistics
stats = generator.calculate_leaderboard_stats()
print(f"Total models: {stats['total_models']}")
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Best score: {stats['best_score']:.3f}")
# Get model rankings
ranked_models = generator.rank_models()
for model in ranked_models[:3]: # Top 3
print(f"{model['rank']}. {model['model_name']}: {model['overall_score']:.3f}")
# Get audience-specific breakdowns
audience_breakdown = generator.generate_audience_breakdown(ranked_models)
for audience, models in audience_breakdown.items():
print(f"\n{audience.title()} Rankings:")
for model in models[:3]:
print(f" {model['rank']}. {model['model_name']}: {model['score']:.3f}")
Leaderboard Features
Overall Rankings
The main leaderboard table displays:
Model rankings by overall performance
Total items evaluated per model
Audience-specific average scores
Interactive sorting and filtering
Ranking Highlights:
🥇 1st Place: Gold highlighting with special styling
🥈 2nd Place: Silver highlighting
🥉 3rd Place: Bronze highlighting
Audience-Specific Analysis
Dedicated sections for each audience showing:
Rankings specific to that audience type
Performance differences across audiences
Top performers for each professional group
Audience Categories:
Physician: Technical medical explanations
Nurse: Clinical care and monitoring focus
Patient: Simple, empathetic communication
Caregiver: Practical instructions and warnings
Complexity-Level Breakdown
Analysis by content difficulty:
Basic: Elementary/middle school reading level
Intermediate: High school reading level
Advanced: College/professional reading level
This helps identify models that excel at different complexity levels.
Interactive Visualizations
The leaderboard includes Chart.js-powered visualizations:
- Performance Comparison Chart
Bar chart showing overall scores for top models
- Audience Performance Radar
Radar chart displaying average performance across all audiences
// Example chart configuration (automatically generated)
{
type: 'bar',
data: {
labels: ['GPT-4', 'Claude-3', 'PaLM-2', ...],
datasets: [{
label: 'Overall Score',
data: [0.85, 0.82, 0.79, ...],
backgroundColor: 'rgba(59, 130, 246, 0.8)'
}]
}
}
Responsive Design
The leaderboard is fully responsive and works on:
Desktop: Full feature set with side-by-side comparisons
Tablet: Optimized layout with collapsible sections
Mobile: Touch-friendly interface with stacked content
CSS Grid and Flexbox ensure optimal viewing across all devices.
Customization
Styling
Customize the leaderboard appearance by modifying the CSS:
class CustomLeaderboardGenerator(LeaderboardGenerator):
def _get_css_styles(self):
# Override with custom styles
return custom_css_content
Color Schemes
The default color scheme uses:
Primary: Blue (#3b82f6) for highlights and buttons
Success: Green (#059669) for scores and positive indicators
Warning: Gold (#ffd700) for first place highlighting
Neutral: Gray scale for general content
Branding
Customize titles, logos, and contact information:
# Modify the HTML template generation
def _generate_html_template(self, ...):
return f"""
<header>
<h1>🏆 {custom_title}</h1>
<img src="your_logo.png" alt="Logo">
</header>
...
"""
Performance Optimization
Large Dataset Handling
For leaderboards with many models:
# Pagination for large leaderboards
def generate_paginated_leaderboard(models, page_size=50):
pages = [models[i:i+page_size] for i in range(0, len(models), page_size)]
return pages
# Top-N filtering
top_models = ranked_models[:20] # Show only top 20
Caching
Cache expensive calculations:
import functools
@functools.lru_cache(maxsize=100)
def cached_statistics_calculation(self, data_hash):
return self.calculate_leaderboard_stats()
CDN Assets
For better performance, load external assets from CDN:
<!-- Chart.js from CDN -->
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<!-- Custom fonts -->
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
Deployment
Static Hosting
The generated HTML is completely self-contained and can be hosted on:
GitHub Pages: Perfect for open source projects
Netlify: Easy deployment with automatic builds
AWS S3: Scalable static hosting
Apache/Nginx: Traditional web servers
GitHub Pages Example
# .github/workflows/leaderboard.yml
name: Update Leaderboard
on:
schedule:
- cron: '0 0 * * *' # Daily updates
workflow_dispatch:
jobs:
update-leaderboard:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Generate leaderboard
run: python -m src.leaderboard --input results/ --output docs/index.html
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs
Automated Updates
Set up automated leaderboard updates:
#!/bin/bash
# update_leaderboard.sh
# Download latest results
rsync -av results_server:/path/to/results/ ./results/
# Regenerate leaderboard
python -m src.leaderboard \
--input results/ \
--output docs/index.html \
--verbose
# Deploy to hosting
aws s3 sync docs/ s3://your-bucket/ --delete
SEO and Analytics
Search Engine Optimization
The generated HTML includes SEO-friendly features:
<head>
<title>MedExplain-Evals Leaderboard - Medical LLM Evaluation Results</title>
<meta name="description" content="Comprehensive evaluation results for medical language models on audience-adaptive explanation quality.">
<meta name="keywords" content="medical AI, language models, evaluation, leaderboard">
<meta property="og:title" content="MedExplain-Evals Leaderboard">
<meta property="og:description" content="Medical LLM evaluation results">
</head>
Analytics Integration
Add analytics tracking:
def add_analytics_tracking(self, html_content, tracking_id):
analytics_code = f"""
<!-- Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id={tracking_id}"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){{dataLayer.push(arguments);}}
gtag('js', new Date());
gtag('config', '{tracking_id}');
</script>
"""
return html_content.replace('</head>', analytics_code + '</head>')
API Reference
Examples
Complete Evaluation to Leaderboard Pipeline
from src.benchmark import MedExplain
from src.leaderboard import LeaderboardGenerator
from pathlib import Path
# 1. Run evaluations for multiple models
models = ['gpt-4', 'claude-3-opus', 'llama-2-70b']
bench = MedExplain()
for model_name in models:
model_func = get_model_function(model_name) # Your model interface
results = bench.evaluate_model(model_func, max_items=1000)
# Save individual results
output_path = f"results/{model_name}_evaluation.json"
bench.save_results(results, output_path)
# 2. Generate leaderboard from all results
generator = LeaderboardGenerator()
generator.load_results(Path("results/"))
generator.generate_html(Path("docs/index.html"))
print("✅ Leaderboard generated successfully!")
Multi-Language Support
class MultiLanguageLeaderboard(LeaderboardGenerator):
def __init__(self, language='en'):
super().__init__()
self.language = language
self.translations = self._load_translations()
def _load_translations(self):
# Load language-specific strings
return {
'en': {'title': 'MedExplain-Evals Leaderboard', ...},
'es': {'title': 'Tabla de Clasificación MedExplain-Evals', ...},
# ... other languages
}
Best Practices
Regular Updates: Update leaderboards regularly as new results become available
Data Validation: Validate result files before generating leaderboards
def validate_results_directory(results_dir): required_fields = ['model_name', 'total_items', 'audience_scores', 'summary'] for file_path in results_dir.glob("*.json"): with open(file_path) as f: data = json.load(f) assert all(field in data for field in required_fields)
Version Control: Track leaderboard versions and source data
Accessibility: Ensure leaderboards are accessible to users with disabilities
Mobile Testing: Test leaderboard display across different screen sizes
Performance Monitoring: Monitor page load times and optimize as needed