We Benchmarked Health Universe’s AI Against the Gold Standard in Clinical Trial Matching. Here’s What We Found.
AI clinical trial matching is one of the most promising — and most overhyped — applications of large language models in healthcare. Vendors make bold claims. Pilots get announced. But rigorous, head-to-head evaluations against independent benchmarks? Those are rare.
Health Universe decided to change that. We took our agentic oncology pipeline and tested it directly against TrialGPT — the patient-to-trial matching framework published in Nature Communications by NIH-affiliated researchers in 2024, evaluated on more than 75,000 trial annotations. It’s the closest thing the field has to a gold standard.
Same patients. Same trials. Independent evaluation. Here’s what we found.
Why Benchmarking AI Clinical Trial Matching Actually Matters
Clinical trial matching is broken in ways that are easy to underestimate. The average cancer patient is eligible for trials they never hear about. Research sites spend thousands of hours screening patients who won’t qualify. And the cost of a missed match — or a wrong one — compounds quickly at scale.
AI has real potential to fix this. But “AI-powered” means nothing without evidence. At Health Universe, we believe that deploying AI in clinical workflows requires the same standard of proof we’d expect from any other clinical tool: transparent methodology, honest metrics, and comparison against a known baseline.
That’s why we built this evaluation.
The Headline Numbers
Health Universe’s agentic pipeline matched TrialGPT output 89.9% of the time — near-identical concordance on the binary match/no-match question. Both systems achieved identical discriminative ability for eligible-versus-not (ROC AUC = 0.736).
But concordance alone doesn’t tell you which system is right. When the two systems disagreed, here’s what happened:
Health Universe’s agentic approach was correct 69% of the time in divergent cases. TrialGPT was correct 31% of the time.
In 29 head-to-head divergent cases with verified expert ground truth, our system correctly identified patient eligibility more than twice as often. An independent evaluation using GPT-5 as a clinical reasoning judge preferred Health Universe’s outputs in 60% of disputed cases.
And on the metric that matters most for clinical workflows — precision — Health Universe achieved 40% versus TrialGPT’s 33%. In a real workflow processing 400+ trials per patient, that translates to roughly 15 fewer irrelevant trial reviews per patient.
Why “More Matches” Isn’t Always Better in Trial Recruitment AI
Here’s the counterintuitive finding: TrialGPT produced 58% more “strong match” designations than Health Universe’s pipeline (158 vs. 100). On the surface, that sounds like it’s catching more opportunities.
It isn’t.
What TrialGPT’s higher recall actually reflects is a tendency to over-score patients as eligible — including for trials they’re explicitly excluded from. Health Universe’s pipeline is more conservative with strong-match designations, and the ground truth data shows that conservatism pays off. At the standard matching threshold, TrialGPT generated 33 false positives per 176 trials (19%). Health Universe generated 18 (10%).
A false positive in AI clinical trial matching isn’t just a metric. It’s a coordinator spending an hour on a chart review that was never going to result in enrollment. It’s a patient getting their hopes up. It’s institutional trust eroding, one bad referral at a time.
What Makes Health Universe’s Approach Different: The Patient Fact Sheet
The key architectural difference between Health Universe’s pipeline and standard TrialGPT is a step that might seem simple: before any trial matching occurs, Health Universe generates a structured Patient Fact Sheet.
Instead of feeding raw clinical notes directly into a matching algorithm, the Health Universe agentic pipeline first organizes patient information into standardized clinical categories — producing an interpretable, auditable summary before a single trial is evaluated. This upstream step is what drives the accuracy advantage.
The evaluation caught real errors that this approach prevents:
Age restriction missed (pediatric trial, elderly patient)
TrialGPT scored a 75-year-old male with metastatic thyroid cancer as strongly eligible for a pediatric radioiodine study. Health Universe correctly identified the age mismatch and scored him ineligible.
Logical inversion of an eligibility criterion
TrialGPT misread a “women ≥ 40” inclusion criterion as an exclusion criterion, incorrectly ruling out a clearly eligible 54-year-old patient. Health Universe’s pipeline interpreted the criterion correctly.
Missing required diagnosis
TrialGPT flagged a patient as strongly eligible for a trial requiring a recent VTE diagnosis — a condition the patient didn’t have. Health Universe’s system identified the missing criteria and correctly scored the patient ineligible.
Complex comorbidity handling
For a patient with non-Hodgkin lymphoma, SLE, diabetes, and hepatitis C, TrialGPT assumed missing data implied eligibility. Health Universe’s pipeline treated the absence of critical confirmatory information as a reason for caution — the clinically appropriate response.
How We Ran the Evaluation: Methodology and Transparency
Health Universe used the same datasets published in the original TrialGPT study, focusing on 17 oncology patients and 417 cancer-related clinical trials. Both pipelines used GPT-4o as the underlying model, ensuring the comparison reflects architectural differences rather than model version differences.
Ground-truth labels from the TREC and SIGIR shared tasks provided an external benchmark for 176 of approximately 7,078 patient-trial pairs. We’re transparent about what that means: the ground-truth analyses are informative but statistically limited. A GPT-5-based LLM-as-judge evaluation provided an independent assessment of clinical reasoning quality in the 15 most divergent cases.
We’re sharing the full methodology, all metrics, and the limitations alongside the results — because we think that’s how AI in clinical settings should be evaluated.
If you’re evaluating AI clinical trial matching tools — or building the case internally for adopting one — this is the kind of evidence that matters.










