News
OpenAI is Trying to Fix AI’s Biggest Medical Blind Spot
- By John K. Waters
- 05/14/2025
When OpenAI launched ChatGPT, it cracked open the door to a future where AI could do everything from drafting your emails to diagnosing your sore throat. But here’s the thing: medicine doesn’t play by Silicon Valley’s rules. In healthcare, good enough isn’t good enough. And that’s why OpenAI’s latest move—a massive new benchmark called HealthBench—is worth paying attention to.
Unveiled this week, HealthBench is OpenAI’s first major independent health care project. It’s a sprawling dataset of 5,000 synthetic-yet-realistic health conversations, paired with more than 48,000 medical criteria written by doctors from 60 countries. It was built to answer a deceptively simple question: How well do large language models actually perform when people’s health is on the line?
"Improving human health will be one of the defining impacts of AGI," OpenAI wrote in a company blog post. "If developed and deployed effectively, large language models have the potential to expand access to health information, support clinicians in delivering high-quality care, and help people advocate for their health and that of their communities."
And unlike previous benchmarks, many of which rely on exam-style multiple choice questions, HealthBench doesn’t grade you like a med school professor. It evaluates AI in the wild: navigating emergencies, addressing ambiguity, speaking with patients and clinicians, sometimes in languages other than English. It asks not just whether a model knows the right answer—but whether it can deliver it clearly, safely, and in context.
AI, Meet Reality
The dataset simulates interactions among AI models and both patients and healthcare professionals. Conversations span everything from prenatal counseling to pandemic triage, all written or reviewed by real doctors. Even the grading is sophisticated: rubrics score models on instruction following, factual accuracy, communication quality, and medical appropriateness.
Some conversations are intentionally difficult, including 1,000 examples where most AI systems currently fail. That’s not a bug; it’s the point.
"Benchmarks should reflect reality, not just test-taking," OpenAI wrote. "Otherwise, we incentivize models that ace quizzes but flunk life."
But while HealthBench raises the bar, it’s also OpenAI grading its own homework. The company tested not only competitors like Google, Meta, Anthropic, and xAI—but also its own new model, "o3," which came out on top.
Doctors Are Using ChatGPT. Should They?
Whether OpenAI likes it or not, AI is already being used in hospitals. ChatGPT and other LLMs are being quietly tested (and sometimes deployed) to help with charting, triage, even patient communication. The tools are seductive—fast, fluent, and available 24/7. But they’re also flawed: prone to hallucinations, overconfidence, and subtle misunderstandings that can carry devastating consequences.
That’s where HealthBench could be a game-changer. It brings rigorous, clinically informed evaluation to a field that desperately needs it. And it’s open source.
The Ecosystem Play
HealthBench is also a strategic flex. OpenAI isn’t just publishing a dataset—it’s positioning itself at the center of the health AI evaluation ecosystem.
And the company isn’t stopping there. It’s working with:
- Sanofi and Formation Bio to build AI tools for faster clinical trials.
- Color Health on an AI-powered cancer co-pilot that designs personalized care plans.
- Iodine Software to plug GPT-4 into hospital operations.
- UTHealth Houston to bring AI into the classroom—and the clinic.
The subtext? OpenAI isn’t just playing catch-up with Google's Med-PaLM. It’s coming for healthcare’s core infrastructure.
A Better Benchmark—But Not a Silver Bullet
Still, HealthBench is a first draft. Critics point out that many evaluations are done by AI itself—raising the risk that shared flaws go unnoticed.
OpenAI acknowledges that. HealthBench is meant to spur progress, not declare victory. The company says future iterations will expand human reviews and subgroup testing to assess fairness across gender, age, and geography.
And that’s the core tension: AI in medicine promises equity, speed, and scale—but risks bias, opacity, and overreach if rushed.
HealthBench is not the end of the story. But it might be the beginning of a new chapter, where AI systems are judged not by how well they test, but by how well they heal.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].