News

Microsoft AI Outperforms Doctors in Complex Diagnoses, Study Shows

A new diagnostic system developed by Microsoft AI researchers has outperformed practicing physicians in a series of complex clinical cases, offering a glimpse into how generative artificial intelligence could transform high-stakes medical decision making.

The system, called MAI-DxO (Microsoft AI Diagnostic Orchestrator), was evaluated using detailed patient case records from the New England Journal of Medicine (NEJM), which are widely regarded as among the most intellectually demanding diagnostic challenges in medicine. In benchmark tests, MAI-DxO achieved an 85.5% diagnostic accuracy, significantly higher than the 20% average accuracy of a cohort of 21 experienced physicians from the U.S. and the U.K.

The system also demonstrated a key economic advantage, delivering correct diagnoses at lower estimated testing costs than both human doctors and individual large language models tested independently, according to Microsoft.

"We believe that orchestrating multiple AI models will be critical to managing complex clinical workflows," the company said in a statement accompanying the release of its research findings. "MAI-DxO enables safer, more cost-effective decision-making in medical diagnostics."

Moving Beyond Multiple Choice
While previous AI evaluations often relied on multiple-choice tests such as the U.S. Medical Licensing Examination (USMLE), Microsoft's new benchmark—known as the Sequential Diagnosis Benchmark (SD Bench)—mirrors real-world clinical reasoning. Each case involves a stepwise approach, allowing the AI to ask questions and order tests sequentially, adjusting its hypotheses as new information becomes available.

This iterative diagnostic process more accurately reflects the work of clinicians, who must synthesize evolving data over time. In contrast to past benchmarks that emphasize memorization, SD Bench tests reasoning and resource allocation—factors critical to real-world clinical practice.

A Virtual Panel of AI Doctors
MAI-DxO acts as an "orchestrator" by coordinating the diagnostic reasoning of multiple large language models, including OpenAI's GPT-4o, Meta's Llama, Google's Gemini, and others. Microsoft researchers found that this ensemble-style approach not only improved diagnostic accuracy but also allowed the system to simulate a virtual panel of specialists, enabling it to blend the breadth of generalist care with the depth of subspecialty insight.

Notably, the best-performing configuration paired MAI-DxO with OpenAI's o3 model. Even under simulated cost constraints—which penalized unnecessary tests—the system continued to outperform human physicians, suggesting that AI can deliver better care with fewer resources.

Not Yet Ready for Prime Time
Despite its strong performance, Microsoft emphasized that MAI-DxO remains a research demonstration. It is not currently available for public use and has not yet been validated in real-world clinical settings. The researchers acknowledged that their findings are limited to the kinds of rare, diagnostically challenging cases featured in NEJM case studies, and that further testing is needed for more common clinical presentations.

The study's human physicians also worked without access to standard diagnostic tools or peer consultations—conditions that likely disadvantaged their performance compared to the AI system.

The Long Road Ahead
The launch of MAI-DxO follows Microsoft's broader push into healthcare AI. The company's portfolio already includes tools like RAD-DINO, which supports radiology workflows, and Dragon Copilot, a voice-first assistant for clinicians.

While generative AI has made rapid strides in medical applications, Microsoft said its ultimate goal is to build trust in AI through transparency, auditability, and rigorous benchmarking. It is now working with external partners to validate MAI-DxO's results and assess its clinical safety.

"Important challenges remain before AI can be safely and responsibly deployed across healthcare," the company said. "But we believe that the future of medicine will be defined by the collaboration between human expertise and machine intelligence."

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Must Read Articles

Welcome to MedCloudInsider.com, the new site for healthcare IT Pros looking for insights on cloud and other cutting-edge IT tech.
Sign up now for our newsletter and don’t miss out! Sign Up Today