AI Clinical Trial Matching: Health Universe vs. TrialGPT Benchmark

We Benchmarked Health Universe’s AI Against the Gold Standard in Clinical Trial Matching. Here’s What We Found.

AI clinical trial matching is one of the most promising — and most overhyped — applications of large language models in healthcare. Vendors make bold claims. Pilots get announced. But rigorous, head-to-head evaluations against independent benchmarks? Those are rare.

Health Universe decided to change that. We took our agentic oncology pipeline and tested it directly against TrialGPT — the patient-to-trial matching framework published in Nature Communications by NIH-affiliated researchers in 2024, evaluated on more than 75,000 trial annotations. It’s the closest thing the field has to a gold standard.

Same patients. Same trials. Independent evaluation. Here’s what we found.

Why Benchmarking AI Clinical Trial Matching Actually Matters

Clinical trial matching is broken in ways that are easy to underestimate. The average cancer patient is eligible for trials they never hear about. Research sites spend thousands of hours screening patients who won’t qualify. And the cost of a missed match — or a wrong one — compounds quickly at scale.

AI has real potential to fix this. But “AI-powered” means nothing without evidence. At Health Universe, we believe that deploying AI in clinical workflows requires the same standard of proof we’d expect from any other clinical tool: transparent methodology, honest metrics, and comparison against a known baseline.

That’s why we built this evaluation.

The Headline Numbers

Health Universe’s agentic pipeline matched TrialGPT output 89.9% of the time — near-identical concordance on the binary match/no-match question. Both systems achieved identical discriminative ability for eligible-versus-not (ROC AUC = 0.736).

But concordance alone doesn’t tell you which system is right. When the two systems disagreed, here’s what happened:

Health Universe’s agentic approach was correct 69% of the time in divergent cases. TrialGPT was correct 31% of the time.

In 29 head-to-head divergent cases with verified expert ground truth, our system correctly identified patient eligibility more than twice as often. An independent evaluation using GPT-5 as a clinical reasoning judge preferred Health Universe’s outputs in 60% of disputed cases.

And on the metric that matters most for clinical workflows — precision — Health Universe achieved 40% versus TrialGPT’s 33%. In a real workflow processing 400+ trials per patient, that translates to roughly 15 fewer irrelevant trial reviews per patient.

Why “More Matches” Isn’t Always Better in Trial Recruitment AI

Here’s the counterintuitive finding: TrialGPT produced 58% more “strong match” designations than Health Universe’s pipeline (158 vs. 100). On the surface, that sounds like it’s catching more opportunities.

It isn’t.

What TrialGPT’s higher recall actually reflects is a tendency to over-score patients as eligible — including for trials they’re explicitly excluded from. Health Universe’s pipeline is more conservative with strong-match designations, and the ground truth data shows that conservatism pays off. At the standard matching threshold, TrialGPT generated 33 false positives per 176 trials (19%). Health Universe generated 18 (10%).

A false positive in AI clinical trial matching isn’t just a metric. It’s a coordinator spending an hour on a chart review that was never going to result in enrollment. It’s a patient getting their hopes up. It’s institutional trust eroding, one bad referral at a time.

What Makes Health Universe’s Approach Different: The Patient Fact Sheet

The key architectural difference between Health Universe’s pipeline and standard TrialGPT is a step that might seem simple: before any trial matching occurs, Health Universe generates a structured Patient Fact Sheet.

Instead of feeding raw clinical notes directly into a matching algorithm, the Health Universe agentic pipeline first organizes patient information into standardized clinical categories — producing an interpretable, auditable summary before a single trial is evaluated. This upstream step is what drives the accuracy advantage.

The evaluation caught real errors that this approach prevents.

Age restriction missed (pediatric trial, elderly patient)

TrialGPT scored a 75-year-old male with metastatic thyroid cancer as strongly eligible for a pediatric radioiodine study. Health Universe correctly identified the age mismatch and scored him ineligible.

Logical inversion of an eligibility criterion

TrialGPT misread a “women ≥ 40” inclusion criterion as an exclusion criterion, incorrectly ruling out a clearly eligible 54-year-old patient. Health Universe’s pipeline interpreted the criterion correctly.

Missing required diagnosis

TrialGPT flagged a patient as strongly eligible for a trial requiring a recent VTE diagnosis — a condition the patient didn’t have. Health Universe’s system identified the missing criteria and correctly scored the patient ineligible.

Complex comorbidity handling

For a patient with non-Hodgkin lymphoma, SLE, diabetes, and hepatitis C, TrialGPT assumed missing data implied eligibility. Health Universe’s pipeline treated the absence of critical confirmatory information as a reason for caution — the clinically appropriate response.

How We Ran the Evaluation: Methodology and Transparency

Health Universe used the same datasets published in the original TrialGPT study, focusing on 17 oncology patients and 417 cancer-related clinical trials. Both pipelines used GPT-4o as the underlying model, ensuring the comparison reflects architectural differences rather than model version differences.

Ground-truth labels from the TREC and SIGIR shared tasks provided an external benchmark for 176 of approximately 7,078 patient-trial pairs. We’re transparent about what that means: the ground-truth analyses are informative but statistically limited. A GPT-5-based LLM-as-judge evaluation provided an independent assessment of clinical reasoning quality in the 15 most divergent cases.

We’re sharing the full methodology, all metrics, and the limitations alongside the results — because we think that’s how AI in clinical settings should be evaluated.

If you’re evaluating AI clinical trial matching tools — or building the case internally for adopting one — this is the kind of evidence that matters.

‍

AI Clinical Trial Matching: Health Universe vs. TrialGPT Benchmark

We Benchmarked Health Universe’s AI Against the Gold Standard in Clinical Trial Matching. Here’s What We Found.

Why Benchmarking AI Clinical Trial Matching Actually Matters

The Headline Numbers

Why “More Matches” Isn’t Always Better in Trial Recruitment AI

What Makes Health Universe’s Approach Different: The Patient Fact Sheet

Age restriction missed (pediatric trial, elderly patient)

Logical inversion of an eligibility criterion

Missing required diagnosis

Complex comorbidity handling

How We Ran the Evaluation: Methodology and Transparency

We’re sharing the full methodology, all metrics, and the limitations alongside the results — because we think that’s how AI in clinical settings should be evaluated.

If you’re evaluating AI clinical trial matching tools — or building the case internally for adopting one — this is the kind of evidence that matters.

Read the Full Report

Try the best "AI clinical companion" in healthcare

Related Articles

ASCO 2026: AI, Informatics & the Rise of Agentic Oncology

AI Clinical Trial Matching: Health Universe vs. TrialGPT Benchmark

CEO Building AI’s “Health Universe” Utterly Unperturbed By Rivals

ASCO 2026: AI, Informatics & the Rise of Agentic Oncology

AI Clinical Trial Matching: Health Universe vs. TrialGPT Benchmark

CEO Building AI’s “Health Universe” Utterly Unperturbed By Rivals

Health Universe Raises $6M | Verily Secures $300 Million Investment

Health Universe Secures $6M for Healthcare AI Agent Platform

Fierce Healthcare Fundraising Tracker '26: Health Universe nabs $6M; Conduit Health streamlines DME

Kleiner Perkins just placed a $6M bet on a healthcare AI startup that regulators can actually read

Exclusive: Health Universe raises $6M for AI agents for medical workflows

HIMSSCast: Should GPT-5 be HIPAA compliant, FDA regulated?

Health Universe Raises $6M to Integrate AI Agents into Healthcare Organizations

Can AI Be the Road to Faster, Smarter Care?

Navigating the Future of Health AI: A Conversation with Dr. Doug Fridsma on B.11 Compliance

Understanding the ONC's HTI-1 Rule for Decision Support Interventions

Generative AI in Health Is Here

65% of Physicians Aware of Clinical AI

How to Publish a Scientific Model using Streamlit

How One Physician Uses Generative AI in their Personal Life

New Startup Aims to Improve Outcomes in Oncology with ML

A DSI Intervention Checklist for Health Systems

Top 5 Trends In Healthcare AI for 2023

It's so fast to get started, you'll already be ahead of the game.

Platform

Company

Socials