Public Leaderboard
==================
MedExplain-Evals includes a comprehensive leaderboard system that generates static HTML pages displaying model performance across different audiences and complexity levels.
Overview
--------
The leaderboard system processes evaluation results from multiple models and creates an interactive HTML dashboard featuring:
* **Overall Model Rankings**: Comprehensive performance comparison
* **Audience-Specific Performance**: Breakdown by physician, nurse, patient, and caregiver
* **Complexity-Level Analysis**: Performance across basic, intermediate, and advanced content
* **Interactive Visualizations**: Charts and graphs for performance analysis
Quick Start
-----------
Generate a leaderboard from evaluation results:
.. code-block:: bash
# Basic usage
python -m src.leaderboard --input results/ --output docs/index.html
# With custom options
python -m src.leaderboard \
--input evaluation_results/ \
--output leaderboard.html \
--title "Custom MedExplain-Evals Results" \
--verbose
Command Line Interface
----------------------
.. list-table:: Leaderboard CLI Options
:header-rows: 1
:widths: 30 70
* - Option
- Description
* - ``--input PATH``
- Directory containing JSON evaluation result files (required)
* - ``--output PATH``
- Output path for HTML leaderboard (default: docs/index.html)
* - ``--title TEXT``
- Custom title for the leaderboard page
* - ``--verbose``
- Enable verbose logging during generation
Input Data Format
-----------------
The leaderboard expects JSON files containing evaluation results in the following format:
.. code-block:: json
{
"model_name": "GPT-4",
"total_items": 1000,
"audience_scores": {
"physician": [0.85, 0.90, 0.88, ...],
"nurse": [0.82, 0.85, 0.83, ...],
"patient": [0.75, 0.80, 0.78, ...],
"caregiver": [0.80, 0.82, 0.81, ...]
},
"complexity_scores": {
"basic": [0.85, 0.90, ...],
"intermediate": [0.80, 0.85, ...],
"advanced": [0.75, 0.80, ...]
},
"detailed_results": [...],
"summary": {
"overall_mean": 0.82,
"physician_mean": 0.88,
"nurse_mean": 0.83,
"patient_mean": 0.78,
"caregiver_mean": 0.81
}
}
Programmatic Usage
------------------
Basic Leaderboard Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from src.leaderboard import LeaderboardGenerator
from pathlib import Path
# Initialize generator
generator = LeaderboardGenerator()
# Load results from directory
results_dir = Path("evaluation_results/")
generator.load_results(results_dir)
# Generate HTML leaderboard
output_path = Path("docs/leaderboard.html")
generator.generate_html(output_path)
Advanced Usage
~~~~~~~~~~~~~~
.. code-block:: python
# Get leaderboard statistics
stats = generator.calculate_leaderboard_stats()
print(f"Total models: {stats['total_models']}")
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Best score: {stats['best_score']:.3f}")
# Get model rankings
ranked_models = generator.rank_models()
for model in ranked_models[:3]: # Top 3
print(f"{model['rank']}. {model['model_name']}: {model['overall_score']:.3f}")
# Get audience-specific breakdowns
audience_breakdown = generator.generate_audience_breakdown(ranked_models)
for audience, models in audience_breakdown.items():
print(f"\n{audience.title()} Rankings:")
for model in models[:3]:
print(f" {model['rank']}. {model['model_name']}: {model['score']:.3f}")
Leaderboard Features
--------------------
Overall Rankings
~~~~~~~~~~~~~~~~
The main leaderboard table displays:
* Model rankings by overall performance
* Total items evaluated per model
* Audience-specific average scores
* Interactive sorting and filtering
.. image:: _static/leaderboard_overall.png
:alt: Overall Rankings Table
:width: 800px
**Ranking Highlights:**
* 🥇 **1st Place**: Gold highlighting with special styling
* 🥈 **2nd Place**: Silver highlighting
* 🥉 **3rd Place**: Bronze highlighting
Audience-Specific Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~
Dedicated sections for each audience showing:
* Rankings specific to that audience type
* Performance differences across audiences
* Top performers for each professional group
**Audience Categories:**
* **Physician**: Technical medical explanations
* **Nurse**: Clinical care and monitoring focus
* **Patient**: Simple, empathetic communication
* **Caregiver**: Practical instructions and warnings
Complexity-Level Breakdown
~~~~~~~~~~~~~~~~~~~~~~~~~~
Analysis by content difficulty:
* **Basic**: Elementary/middle school reading level
* **Intermediate**: High school reading level
* **Advanced**: College/professional reading level
This helps identify models that excel at different complexity levels.
Interactive Visualizations
~~~~~~~~~~~~~~~~~~~~~~~~~~
The leaderboard includes Chart.js-powered visualizations:
**Performance Comparison Chart**
Bar chart showing overall scores for top models
**Audience Performance Radar**
Radar chart displaying average performance across all audiences
.. code-block:: javascript
// Example chart configuration (automatically generated)
{
type: 'bar',
data: {
labels: ['GPT-4', 'Claude-3', 'PaLM-2', ...],
datasets: [{
label: 'Overall Score',
data: [0.85, 0.82, 0.79, ...],
backgroundColor: 'rgba(59, 130, 246, 0.8)'
}]
}
}
Responsive Design
-----------------
The leaderboard is fully responsive and works on:
* **Desktop**: Full feature set with side-by-side comparisons
* **Tablet**: Optimized layout with collapsible sections
* **Mobile**: Touch-friendly interface with stacked content
CSS Grid and Flexbox ensure optimal viewing across all devices.
Customization
-------------
Styling
~~~~~~~
Customize the leaderboard appearance by modifying the CSS:
.. code-block:: python
class CustomLeaderboardGenerator(LeaderboardGenerator):
def _get_css_styles(self):
# Override with custom styles
return custom_css_content
Color Schemes
~~~~~~~~~~~~~
The default color scheme uses:
* **Primary**: Blue (#3b82f6) for highlights and buttons
* **Success**: Green (#059669) for scores and positive indicators
* **Warning**: Gold (#ffd700) for first place highlighting
* **Neutral**: Gray scale for general content
Branding
~~~~~~~~
Customize titles, logos, and contact information:
.. code-block:: python
# Modify the HTML template generation
def _generate_html_template(self, ...):
return f"""
🏆 {custom_title}