MedExplain-Evals Documentation

Welcome to MedExplain-Evals, a resource-efficient benchmark for evaluating audience-adaptive explanation quality in medical Large Language Models.

This project is developed as part of Google Summer of Code 2025, mentored by Google DeepMind.

Overview

MedExplain-Evals addresses a critical gap in medical AI evaluation by providing the first benchmark specifically designed to assess an LLM’s ability to generate audience-adaptive medical explanations for four key stakeholders:

  • Physicians - Technical, evidence-based explanations

  • Nurses - Practical care implications and monitoring

  • Patients - Simple, empathetic, jargon-free language

  • Caregivers - Concrete tasks and warning signs

Key Features

  • Novel evaluation framework for audience-adaptive medical explanations

  • Support for MedQA-USMLE, iCliniq, and Cochrane Reviews datasets

  • Advanced safety metrics including contradiction and hallucination detection

  • Automated complexity stratification using Flesch-Kincaid Grade Level

  • Interactive HTML leaderboards for result visualization

  • Multi-dimensional scoring with LLM-as-a-judge paradigm

  • Optimized for open-weight models on consumer hardware

Quick Start

pip install -r requirements.txt
from src.benchmark import MedExplain
from src.evaluator import MedExplainEvaluator

# Initialize benchmark
bench = MedExplain()

# Generate audience-adaptive explanations
explanations = bench.generate_explanations(medical_content, model)

# Evaluate explanations
evaluator = MedExplainEvaluator()
scores = evaluator.evaluate_all_audiences(explanations)

Architecture

MedExplain-Evals is built with SOLID principles:

  • Strategy Pattern for audience-specific scoring

  • Dependency Injection for flexible component management

  • Configuration-driven design with YAML configuration

  • Comprehensive logging for debugging and monitoring

Getting Help

Documentation

  • Primary documentation: This comprehensive guide covers installation, usage, and advanced topics

  • API Reference: Detailed function and class documentation with examples

  • Quickstart Guide: Quick Start Guide

  • Installation Guide: Installation

Support Channels

Troubleshooting

# Verify installation
python -c "import src; print('MedExplain-Evals is working')"

# Run basic test
python run_benchmark.py --model_name dummy --max_items 2

Contributing

We welcome contributions:

  • Code contributions via Pull Requests

  • Bug reports and feature requests via Issues

  • Documentation improvements

  • Research collaborations

See our Contributing Guidelines.

Citation

@software{medexplain-evals-2025,
  title={MedExplain-Evals: A Resource-Efficient Benchmark for Evaluating Audience-Adaptive Explanation Quality in Medical Large Language Models},
  author={Cheng Hei Lam},
  year={2025},
  url={https://github.com/heilcheng/medexplain-evals}
}

Indices and tables