Enter your information for instant access to the TrialGPT Benchmarking Report
Health Universe tested our agentic oncology pipeline head-to-head against TrialGPT — the NIH-developed framework published in Nature Communications. Download the full evaluation report to see what we found.
Full methodology — how we designed the two-arm evaluation using the original TrialGPT datasets.
Head-to-head accuracy data — ROC AUC, precision, recall, and PR AUC across 176 labeled patient-trial pairs.
LLM-as-judge results — independent GPT-5 evaluation of clinical reasoning quality in divergent cases.
Four real case vignettes — where our system caught critical eligibility errors the original missed.
Discussion of tradeoffs — why higher recall isn't always better, and what precision means for your workflow.
Honest limitations — what this study does and doesn't prove, and what we're doing next.
TrialGPT flagged a 75-year-old male as strongly eligible for a pediatric trial. Our system correctly identified ineligibility.
TrialGPT misread "women ≥ 40" as an exclusion criterion, wrongly ruling out an eligible patient. Our system got it right.
TrialGPT scored a patient as strongly eligible despite lacking a required VTE diagnosis. Our system caught the gap.
For a patient with NHL, SLE, diabetes, and hepatitis C, our system treated missing critical data with appropriate caution.
This evaluation benchmarks against TrialGPT (Jin et al., Nature Communications 2024) — the most rigorous publicly available framework for AI-assisted patient-to-trial matching, developed at NIH and evaluated on 75,000+ trial annotations.
Matching patients to clinical trials with large language models
Jin et al. · Nature Communications · 2024 · DOI: 10.1038/s41467-024-53081-z