Ensemble Judge
The ensemble judge module implements a multi-model LLM-as-Judge framework for robust, calibrated evaluation of medical explanations.
Overview
MedExplain-Evals uses an ensemble of frontier LLMs to evaluate medical explanations, providing more reliable and calibrated scores than single-model evaluation.
Default Ensemble Composition:
Primary Judges (85% weight): - GPT-5.2 (30%) - Claude Opus 4.5 (30%) - Gemini 3 Pro (25%)
Secondary Judges (15% weight): - DeepSeek-V3.2 (10%) - Qwen3-Max (5%)
Quick Start
from src import EnsembleLLMJudge, UnifiedModelClient
# Initialize
client = UnifiedModelClient()
judge = EnsembleLLMJudge(client)
# Evaluate an explanation
score = judge.evaluate(
original_content="Patient has Type 2 DM with HbA1c 8.5%...",
explanation="You have high blood sugar. This means...",
audience="patient"
)
print(f"Overall: {score.overall:.2f}/5.0")
print(f"Agreement: {score.agreement_score:.2f}")
print(f"Dimensions: {score.dimensions}")
Core Classes
EnsembleLLMJudge
The main evaluation class that orchestrates multiple judge models.
class EnsembleLLMJudge:
"""Multi-model LLM judge ensemble for robust evaluation.
Uses weighted ensemble of top reasoning models for evaluation:
- Primary: GPT-5.2, Claude Opus 4.5, Gemini 3 Pro (85%)
- Secondary: DeepSeek-V3.2, Qwen3-Max (15%)
"""
def __init__(
self,
client: Optional[UnifiedModelClient] = None,
judges: Optional[List[JudgeConfig]] = None,
parallel: bool = True,
min_judges: int = 2
):
"""Initialize the ensemble judge.
Args:
client: Model client for API calls
judges: Custom judge configurations
parallel: Execute judges in parallel
min_judges: Minimum judges required for valid evaluation
"""
def evaluate(
self,
original_content: str,
explanation: str,
audience: str,
persona: Optional[AudiencePersona] = None
) -> EnsembleScore:
"""Evaluate an explanation with the ensemble.
Args:
original_content: Original medical content
explanation: Generated explanation to evaluate
audience: Target audience type
persona: Optional detailed persona configuration
Returns:
EnsembleScore with overall and dimension scores
"""
def evaluate_batch(
self,
items: List[Dict[str, Any]],
max_workers: int = 4
) -> List[EnsembleScore]:
"""Evaluate multiple explanations in parallel."""
Usage Examples:
from src import EnsembleLLMJudge, PersonaFactory
judge = EnsembleLLMJudge()
# Basic evaluation
score = judge.evaluate(
original_content="Acute myocardial infarction...",
explanation="You had a heart attack...",
audience="patient"
)
# With detailed persona
persona = PersonaFactory.get_predefined_persona("patient_low_literacy")
score = judge.evaluate(
original_content="Acute myocardial infarction...",
explanation="Your heart muscle was hurt...",
audience="patient",
persona=persona
)
# Analyze disagreement
analysis = judge.get_disagreement_analysis(score)
print(analysis)
JudgeConfig
Configuration for individual judges in the ensemble.
@dataclass
class JudgeConfig:
"""Configuration for an individual judge in the ensemble."""
model: str # Model identifier (e.g., "gpt-5.2")
provider: str # Provider name (e.g., "openai")
weight: float # Weight in ensemble (0.0-1.0)
enabled: bool = True
temperature: float = 0.1
max_retries: int = 3
EnsembleScore
The comprehensive evaluation result.
@dataclass
class EnsembleScore:
"""Final ensemble evaluation score."""
overall: float # Weighted overall score (1-5)
dimensions: Dict[str, float] # Per-dimension scores
dimension_details: Dict[str, Dict[str, Any]] # Detailed breakdown
judge_results: List[JudgeResult] # Individual judge outputs
agreement_score: float # Inter-judge agreement (0-1)
confidence: float # Ensemble confidence (0-1)
audience: str # Target audience
metadata: Dict[str, Any] # Additional metadata
DimensionScore
Score for a single evaluation dimension.
@dataclass
class DimensionScore:
"""Score for a single evaluation dimension."""
dimension: str # Dimension name
score: float # Score (1-5)
reasoning: str # Justification text
confidence: float = 1.0
Evaluation Dimensions
The ensemble evaluates explanations across six weighted dimensions:
Dimension |
Weight |
Description |
|---|---|---|
Factual Accuracy |
25% |
Clinical correctness and evidence alignment |
Terminological Appropriateness |
15% |
Language complexity matching audience needs |
Explanatory Completeness |
20% |
Comprehensive yet accessible coverage |
Actionability |
15% |
Clear, practical guidance |
Safety |
15% |
Appropriate warnings and harm avoidance |
Empathy & Tone |
10% |
Audience-appropriate communication style |
Factory Functions
Convenience functions for creating judge configurations:
from src import create_single_judge, create_fast_ensemble, create_full_ensemble
# Single model judge (for testing)
judge = create_single_judge(model="gpt-5.2")
# Fast 2-model ensemble
judge = create_fast_ensemble()
# Full 5-model ensemble (default)
judge = create_full_ensemble()
EvaluationRubricBuilder
Builds G-Eval style evaluation prompts with audience-specific rubrics.
from src.ensemble_judge import EvaluationRubricBuilder
builder = EvaluationRubricBuilder()
# Build evaluation prompt
messages = builder.build_evaluation_prompt(
original_content="Medical content...",
explanation="Explanation...",
audience="patient",
persona=persona # Optional
)
Custom Judge Configuration
Create a custom ensemble configuration:
from src import EnsembleLLMJudge, JudgeConfig
custom_judges = [
JudgeConfig(model="gpt-5.2", provider="openai", weight=0.5),
JudgeConfig(model="claude-opus-4-5", provider="anthropic", weight=0.5),
]
judge = EnsembleLLMJudge(judges=custom_judges)