Appendix E — Appendix E: Notable AI Failures in Medicine

Introduction

Learning from failures is as important as celebrating successes. This appendix catalogs notable medical AI failures—what went wrong, why, and lessons for future development. Understanding these failures helps physicians recognize warning signs and demand better AI.

1. IBM Watson for Oncology (2013-2021)

What it promised: AI “cognitive computing” analyzing millions of journal articles to recommend personalized cancer treatments.

What went wrong: - Recommendations contradicted evidence-based guidelines - Suggested dangerous treatments (e.g., chemotherapy for bleeding patients—contraindicated) - Trained on synthetic hypothetical cases, not real patient outcomes - Failed to generalize internationally (trained on U.S. data, deployed globally) - Expensive (millions in licensing), poor usability

Outcome: Multiple hospitals canceled contracts (2017-2019). IBM sold Watson Health division (2021), acknowledging failure to deliver value.

Why it failed: - Hype exceeded capability (Jeopardy! success didn’t translate to medicine) - Lack of deep clinical expertise in development - Insufficient validation before widespread marketing - Synthetic training data captured idealized scenarios, not clinical complexity - Business model prioritized revenue over patient benefit

Lesson: Marketing ≠ medical reality. Demand rigorous validation, clinical grounding, and transparency.

2. Epic Sepsis Model (2015-present)

What it promised: EHR-embedded AI predicting sepsis 6-12 hours before onset, enabling early intervention.

What went wrong: - 2019 external validation (University of Michigan): Detected only 7% of sepsis cases before onset - 67% of sepsis cases never triggered alert - High false positive rate → alert fatigue, ignored warnings - Performance far below vendor claims

Why it failed: - Trained on retrospectively identified sepsis cases (billing codes—noisy labels) - Look-ahead bias: Model likely learned from data collected after sepsis onset (tests ordered because sepsis suspected) - Sepsis definition inconsistency across institutions - No prospective external validation before widespread deployment - Lack of transparency (Epic didn’t disclose algorithm details initially)

Outcome: Still deployed in many Epic-using hospitals (embedded in EHR, difficult to remove). Eroded clinician trust in AI alerts.

Lesson: External validation before deployment is non-negotiable. Retrospective performance ≠ prospective utility. Transparent reporting essential.

3. Google Flu Trends (2008-2015)

What it promised: Real-time flu outbreak prediction from search queries, faster than CDC surveillance.

What went wrong: - 2012-2013: Overestimated flu prevalence by 50-100% - Predictions increasingly inaccurate over time - Discontinued 2015 after consistent failures

Why it failed: - Overfitting to historical correlations that didn’t generalize - Search behavior changed (algorithm updates, media coverage → spurious searches unrelated to actual illness) - Confounding: Increased searches ≠ increased flu (cyberchondria, media-induced anxiety) - Performance drift: Model degraded as data distribution shifted - Lack of transparency prevented independent evaluation and correction

Lesson: Correlations are not causal. Monitor for performance drift. AI should complement, not replace, traditional surveillance. Transparency enables improvement.

4. COVID-19 Chest X-Ray AI (2020-2021)

What it promised: Rapid COVID-19 diagnosis from chest X-rays, supplementing scarce PCR testing.

What went wrong: - 2021 systematic review: 2,000+ models published, ZERO suitable for clinical use - All had critical methodological flaws - Few deployed clinically; those that were performed poorly and were withdrawn

Why they failed: - Training data biases: COVID-positive images from different sources/equipment than controls—AI learned image source, not pathology - Confounding: COVID patients supine (portable X-rays), controls standing (PA/lateral)—AI detected positioning - Data leakage: Duplicate images in training and test sets - Lack of external validation - Biological implausibility: CXR findings in COVID-19 non-specific (similar to other viral pneumonias)

Outcome: Wasted research resources, media misinformation (“AI can diagnose COVID!”), damaged AI credibility in radiology.

Lesson: Speed ≠ rigor. Beware confounders. Biological plausibility matters—if radiologists can’t reliably distinguish, AI likely can’t either. Pre-registration prevents cherry-picking.

5. Babylon Health Symptom Checker (2018-present)

What it promised: AI chatbot providing medical advice, “better than doctors” (company claims).

What went wrong: - Independent evaluations showed poor diagnostic accuracy - Missed serious conditions, recommended inappropriate self-care - Safety concerns raised by physicians, researchers - Regulatory scrutiny in UK

Why it failed (partially): - Overconfident claims not supported by evidence - Black-box algorithm, limited transparency - Validation studies low quality (retrospective, cherry-picked cases) - Conflict of interest: Company-funded studies vs. independent evaluations showed different results

Outcome: Still operational but reputation damaged. Highlights risks of unregulated AI symptom checkers.

Lesson: Independent validation essential. Conflict of interest in AI evaluation is real. Symptom checkers should augment, not replace, clinical assessment.

6. Optum/UnitedHealth Algorithmic Bias (2019)

What it promised: AI predicting which patients need extra care coordination, targeting high-risk individuals for interventions.

What went wrong: - 2019 Science paper revealed racial bias: For same level of illness, Black patients scored as lower risk than white patients - Result: Black patients less likely to receive needed care coordination - Bias affected millions of patients

Why it failed: - Algorithm used healthcare costs as proxy for health needs - Black patients historically receive less care (structural racism in healthcare) → lower costs - AI learned: “Lower costs = healthier”—when actually lower costs often meant under-treatment - Developers didn’t test for racial disparities before deployment

Outcome: Optum revised algorithm after exposure. Sparked national conversation on algorithmic bias in healthcare.

Lesson: Historical data reflects historical biases. Proxies (cost) may not align with goals (health need). Test AI performance across demographic groups before deployment. Equity audits essential.

7. Amazon Rekognition (Medical Imaging Application Attempt, 2019)

What it promised: General-purpose image recognition AI adapted for medical imaging.

What went wrong: - Radiologists tested Amazon Rekognition on medical images—performed terribly - Missed obvious pathology, flagged normal findings - Never intended for medical use, but marketed as applicable

Why it failed: - General computer vision ≠ medical imaging expertise - No domain-specific training on medical images - Misapplication of technology to inappropriate domain

Outcome: Attempt abandoned. Illustrated dangers of applying consumer AI to medicine without validation.

Lesson: Medical AI requires medical training data and clinical expertise. General-purpose AI doesn’t automatically transfer to healthcare.

8. Theranos Blood Testing (2003-2018)

Not AI-specific, but instructive:

What it promised: Revolutionary blood testing from finger-prick samples, hundreds of tests from tiny volumes.

What went wrong: - Technology didn’t work as claimed - Company misled investors, regulators, patients - Founder Elizabeth Holmes convicted of fraud (2022)

Why relevant to AI: - Pattern of hype, secrecy, insufficient validation parallels some AI ventures - “Black box” technology claims without transparent evidence - Regulatory failure to demand proof before widespread use

Lesson: Demand transparency. Secrecy is red flag. Independent validation before trust. Regulatory oversight necessary.

9. DeepMind Streams (Renal Deterioration Alert, 2016-2019)

What it promised: AI-powered app alerting clinicians to acute kidney injury (AKI), integrated with UK NHS.

What went wrong (partially): - Data privacy violations: DeepMind/Google accessed 1.6 million patient records without proper consent - UK Information Commissioner ruled data-sharing agreement illegal (2017) - Clinical utility unclear—alerts didn’t clearly improve outcomes - Project discontinued (2019)

Why it failed: - Prioritized technology development over patient privacy, consent - Regulatory compliance inadequate - Clinical benefit not demonstrated despite hype

Lesson: Privacy and consent are foundational, not afterthoughts. Technology must prove clinical benefit, not just technical feasibility.

10. Sepsis Prediction Models (Multiple Vendors, 2015-present)

Beyond Epic—general problem:

What promised: Early sepsis detection from EHR data (vital signs, labs).

What went wrong: - Multiple models show poor prospective performance despite strong retrospective results - High false positive rates → alert fatigue - Uncertain clinical benefit (early antibiotics for “pre-sepsis” may not improve outcomes)

Why failing: - Sepsis definition ambiguous, varies across institutions - Look-ahead bias common (models learn from data collected after clinicians suspected sepsis) - Outcome labels noisy (billing codes, retrospective chart review) - Sepsis heterogeneous syndrome—AI struggles with clinical complexity

Outcome: Ongoing controversy. Some models withdrawn, others revised. Evidence for clinical benefit remains limited.

Lesson: Complex, heterogeneous syndromes are hard for AI. Label quality critical. Prospective validation with clinical outcomes (not just algorithmic accuracy) essential.

Common Themes in AI Failures

1. Validation Failures: - Retrospective only, no prospective - Single-site, no external validation - Cherry-picked datasets - Insufficient sample sizes

2. Data Problems: - Biased, non-representative training data - Confounders not recognized - Noisy labels (billing codes, retrospective annotation) - Data leakage (overlap between training and test)

3. Overfitting and Poor Generalization: - Models memorize training data specifics - Fail when applied to different populations, institutions, timepoints

4. Lack of Transparency: - Black-box algorithms - Proprietary secrecy prevents independent evaluation - Vendor claims not backed by peer-reviewed evidence

5. Hype and Conflicts of Interest: - Marketing exceeds evidence - Company-funded studies vs. independent evaluations - Media amplifies claims without scrutiny

6. Inadequate Clinical Grounding: - Developers lack domain expertise - AI solutions to non-existent problems - Technology-first, not problem-first approach

7. Regulatory and Oversight Gaps: - Insufficient FDA scrutiny (especially for CDSS) - Privacy violations - Deployment before adequate safety evidence

How to Spot Warning Signs

Red flags suggesting potential failure:

Extraordinary claims without extraordinary evidence (“Better than doctors,” “Revolutionary”)
Lack of peer-reviewed publications (only vendor white papers, press releases)
Retrospective validation only (no prospective studies)
Single-site development and testing (no external validation)
Black-box with no transparency (“Proprietary algorithm,” refuses to disclose methods)
Conflict of interest (only company-funded studies, no independent evaluation)
Deployment before validation (widespread use without proof of benefit)
No discussion of limitations (every AI has failure modes—if none mentioned, red flag)
Privacy concerns (inadequate consent, data governance)
Regulatory evasion (marketed as “wellness,” “not a medical device” to avoid FDA oversight)

Conclusion

Failures teach as much as successes. Common threads: inadequate validation, biased data, overfitting, lack of transparency, hype exceeding evidence. Physicians must recognize these patterns and demand better.

Key principles to avoid failures: - Evidence before adoption: Prospective, external validation required - Transparency: Understand how AI works, what data trained it, where it fails - Equity: Test performance across demographics, address biases - Clinical grounding: Involve clinicians from conception through deployment - Continuous monitoring: Detect performance drift, safety issues - Healthy skepticism: Extraordinary claims require extraordinary proof

The future of medical AI depends on learning from past failures and building systems that prioritize patient welfare over technological novelty or commercial gain.