Appendix E — The Clinical AI Morgue

TL;DR

Major failures documented: - IBM Watson for Oncology: Unsafe chemo recommendations, synthetic training data - Epic Sepsis Model: 7% real-world detection vs. 76% claimed AUC - Babylon Health: Missed serious diagnoses, regulatory concerns - OPTUM algorithm: Systematic racial bias in care allocation

Warning signs that predict failure: - Internal validation only - No prospective outcome studies - “AI diagnoses everything” claims - Workflow integration ignored - No fairness testing

Key lesson: High AUC does not equal clinical utility. Demand outcome data.

Warning: Failure Case Studies Ahead

This appendix documents $100M+ in failed AI investments, thousands of wasted physician hours, and preventable patient safety incidents. Reading these failures is uncomfortable but essential.

Why we study failures: - Failures teach more than successes - Common patterns emerge across failures - You can avoid repeating these mistakes - Your patients’ safety depends on your vigilance

What you’ll find: - 15+ detailed failure case studies - Root cause analyses - Quantified harm (financial, patient outcomes, physician trust) - Lessons learned frameworks - Red flags to recognize in vendor pitches


Introduction: The $100M+ Graveyard

Learning from failures is as important as celebrating successes. This appendix catalogs notable medical AI failures: what went wrong, why, and lessons for physicians evaluating future AI systems.

The scope of AI failures in medicine (2010-2024): - IBM Watson for Oncology: $62M+ spent by MD Anderson alone; project failed - Epic Sepsis Model: Deployed in 100+ hospitals; 33% sensitivity (missed 67% of sepsis cases), 12% PPV in external validation - Google Health India diabetic retinopathy: 55% of images ungradable in field; pilot abandoned - Babylon Health: Safety concerns, missed diagnoses; regulatory scrutiny - OPTUM algorithmic bias: Affected millions of patients; systematically underestimated Black patients’ health needs

Total estimated waste: >$100M in licensing fees, implementation costs, and physician time

Patient impact: Thousands of missed diagnoses, inappropriate treatments, delayed care

Physician impact: Eroded trust in AI, alert fatigue, wasted training time

Understanding these failures helps physicians recognize warning signs and demand better AI.


1. IBM Watson for Oncology (2013-2021)

What it promised: AI “cognitive computing” analyzing millions of journal articles to recommend personalized cancer treatments.

What went wrong: - Recommendations contradicted evidence-based guidelines - Suggested dangerous treatments (e.g., chemotherapy for bleeding patients, which is contraindicated) - Trained on synthetic hypothetical cases, not real patient outcomes - Failed to generalize internationally (trained on U.S. data, deployed globally) - Expensive (millions in licensing), poor usability

Outcome: Multiple hospitals canceled contracts (2017-2019). IBM sold Watson Health division (2021), acknowledging failure to deliver value.

Why it failed: - Hype exceeded capability (Jeopardy! success didn’t translate to medicine) - Lack of deep clinical expertise in development - Insufficient validation before widespread marketing - Synthetic training data captured idealized scenarios, not clinical complexity - Business model prioritized revenue over patient benefit

Lesson: Marketing ≠ medical reality. Demand rigorous validation, clinical grounding, and transparency.


2. Epic Sepsis Model (2015-present)

What it promised: EHR-embedded AI predicting sepsis 6-12 hours before onset, enabling early intervention.

What went wrong: - 2019 external validation (University of Michigan): Detected only 7% of sepsis cases before onset - 67% of sepsis cases never triggered alert - High false positive rate → alert fatigue, ignored warnings - Performance far below vendor claims

Why it failed: - Trained on retrospectively identified sepsis cases (billing codes, which are noisy labels) - Look-ahead bias: Model likely learned from data collected after sepsis onset (tests ordered because sepsis suspected) - Sepsis definition inconsistency across institutions - No prospective external validation before widespread deployment - Lack of transparency (Epic didn’t disclose algorithm details initially)

Outcome: Still deployed in many Epic-using hospitals (embedded in EHR, difficult to remove). Eroded clinician trust in AI alerts.

Lesson: External validation before deployment is non-negotiable. Retrospective performance ≠ prospective utility. Transparent reporting essential.


4. COVID-19 Chest X-Ray AI (2020-2021)

What it promised: Rapid COVID-19 diagnosis from chest X-rays, supplementing scarce PCR testing.

What went wrong: - 2021 systematic review: 2,000+ models published, ZERO suitable for clinical use - All had critical methodological flaws - Few deployed clinically; those that were performed poorly and were withdrawn

Why they failed: - Training data biases: COVID-positive images from different sources/equipment than controls. AI learned image source, not pathology - Confounding: COVID patients supine (portable X-rays), controls standing (PA/lateral). AI detected positioning - Data leakage: Duplicate images in training and test sets - Lack of external validation - Biological implausibility: CXR findings in COVID-19 non-specific (similar to other viral pneumonias)

Outcome: Wasted research resources, media misinformation (“AI can diagnose COVID!”), damaged AI credibility in radiology.

Lesson: Speed ≠ rigor. Beware confounders. Biological plausibility matters: if radiologists can’t reliably distinguish, AI likely can’t either. Pre-registration prevents cherry-picking.


5. Babylon Health Symptom Checker (2018-present)

What it promised: AI chatbot providing medical advice, “better than doctors” (company claims).

What went wrong: - Independent evaluations showed poor diagnostic accuracy - Missed serious conditions, recommended inappropriate self-care - Safety concerns raised by physicians, researchers - Regulatory scrutiny in UK

Why it failed (partially): - Overconfident claims not supported by evidence - Black-box algorithm, limited transparency - Validation studies low quality (retrospective, cherry-picked cases) - Conflict of interest: Company-funded studies vs. independent evaluations showed different results

Outcome: Still operational but reputation damaged. Highlights risks of unregulated AI symptom checkers.

Lesson: Independent validation essential. Conflict of interest in AI evaluation is real. Symptom checkers should augment, not replace, clinical assessment.


6. Optum/UnitedHealth Algorithmic Bias (2019)

What it promised: AI predicting which patients need extra care coordination, targeting high-risk individuals for interventions.

What went wrong: - 2019 Science paper revealed racial bias: For same level of illness, Black patients scored as lower risk than white patients - Result: Black patients less likely to receive needed care coordination - Bias affected millions of patients

Why it failed: - Algorithm used healthcare costs as proxy for health needs - Black patients historically receive less care (structural racism in healthcare) → lower costs - AI learned: “Lower costs = healthier,” when actually lower costs often meant under-treatment - Developers didn’t test for racial disparities before deployment

Outcome: Optum revised algorithm after exposure. Sparked national conversation on algorithmic bias in healthcare.

Lesson: Historical data reflects historical biases. Proxies (cost) may not align with goals (health need). Test AI performance across demographic groups before deployment. Equity audits essential.


7. Amazon Rekognition (Medical Imaging Application Attempt, 2019)

What it promised: General-purpose image recognition AI adapted for medical imaging.

What went wrong: - Radiologists tested Amazon Rekognition on medical images. It performed terribly - Missed obvious pathology, flagged normal findings - Never intended for medical use, but marketed as applicable

Why it failed: - General computer vision ≠ medical imaging expertise - No domain-specific training on medical images - Misapplication of technology to inappropriate domain

Outcome: Attempt abandoned. Illustrated dangers of applying consumer AI to medicine without validation.

Lesson: Medical AI requires medical training data and clinical expertise. General-purpose AI doesn’t automatically transfer to healthcare.


8. Theranos Blood Testing (2003-2018)

Not AI-specific, but instructive:

What it promised: Revolutionary blood testing from finger-prick samples, hundreds of tests from tiny volumes.

What went wrong: - Technology didn’t work as claimed - Company misled investors, regulators, patients - Founder Elizabeth Holmes convicted of fraud (2022)

Why relevant to AI: - Pattern of hype, secrecy, insufficient validation parallels some AI ventures - “Black box” technology claims without transparent evidence - Regulatory failure to demand proof before widespread use

Lesson: Demand transparency. Secrecy is red flag. Independent validation before trust. Regulatory oversight necessary.


9. DeepMind Streams (Renal Deterioration Alert, 2016-2019)

What it promised: AI-powered app alerting clinicians to acute kidney injury (AKI), integrated with UK NHS.

What went wrong (partially): - Data privacy violations: DeepMind/Google accessed 1.6 million patient records without proper consent - UK Information Commissioner ruled data-sharing agreement illegal (2017) - Clinical utility unclear: alerts didn’t clearly improve outcomes - Project discontinued (2019)

Why it failed: - Prioritized technology development over patient privacy, consent - Regulatory compliance inadequate - Clinical benefit not demonstrated despite hype

Lesson: Privacy and consent are foundational, not afterthoughts. Technology must prove clinical benefit, not just technical feasibility.


10. Sepsis Prediction Models (Multiple Vendors, 2015-present)

Beyond Epic (general problem):

What promised: Early sepsis detection from EHR data (vital signs, labs).

What went wrong: - Multiple models show poor prospective performance despite strong retrospective results - High false positive rates → alert fatigue - Uncertain clinical benefit (early antibiotics for “pre-sepsis” may not improve outcomes)

Why failing: - Sepsis definition ambiguous, varies across institutions - Look-ahead bias common (models learn from data collected after clinicians suspected sepsis) - Outcome labels noisy (billing codes, retrospective chart review) - Sepsis heterogeneous syndrome: AI struggles with clinical complexity

Outcome: Ongoing controversy. Some models withdrawn, others revised. Evidence for clinical benefit remains limited.

Lesson: Complex, heterogeneous syndromes are hard for AI. Label quality critical. Prospective validation with clinical outcomes (not just algorithmic accuracy) essential.


11. LLM Benchmark vs. Reasoning Gap (2025)

What was promised: LLMs achieving near-perfect accuracy on medical benchmarks like MedQA demonstrate clinical reasoning capability ready for deployment.

What went wrong: - 2025 study tested whether high benchmark scores reflect genuine reasoning or pattern matching - When correct answers replaced with “None of the Other Answers” (NOTA), LLM accuracy dropped dramatically - GPT-4o: 85% → 49% (36% drop) - Claude-3.5 Sonnet: 88% → 62% (26% drop) - Llama-3.3-70B: 81% → 43% (38% drop) - Even reasoning-focused models degraded: DeepSeek-R1 (9% drop), o3-mini (16% drop)

Why it matters: - Benchmark performance may significantly overstate clinical reasoning capability - Novel clinical presentations require reasoning beyond memorized patterns - A system dropping from 80% to 43% when patterns disrupt would be unreliable clinically

Outcome: Calls for robustness testing before clinical deployment; benchmark accuracy alone insufficient evidence for clinical use.

Lesson: High benchmark scores don’t guarantee clinical reasoning capability. Test LLMs with unfamiliar scenarios. Reasoning-focused models show promise but aren’t immune. Maintain physician oversight (Bedi et al., 2025).


Common Themes in AI Failures

These failures share a common pattern that the Institute of Medicine identified in To Err is Human (Kohn et al., 2000): they are system failures, not individual failures. When IBM Watson recommended unsafe treatments, the failure was not a single engineer’s mistake. It was a system that prioritized marketing over clinical validation, trained on synthetic rather than real-world data, and deployed without adequate physician oversight. When clinicians ignore sepsis alerts, the failure is not physician negligence. It is a system that produced too many false positives, integrated poorly with clinical workflow, and lacked mechanisms for feedback and improvement.

The IOM’s central insight, that “good people working in bad systems” cause most medical errors, applies directly to AI. Fixing AI failures requires redesigning systems, not blaming individuals.

1. Validation Failures: - Retrospective only, no prospective - Single-site, no external validation - Cherry-picked datasets - Insufficient sample sizes

2. Data Problems: - Biased, non-representative training data - Confounders not recognized - Noisy labels (billing codes, retrospective annotation) - Data leakage (overlap between training and test)

3. Overfitting and Poor Generalization: - Models memorize training data specifics - Fail when applied to different populations, institutions, timepoints

4. Lack of Transparency: - Black-box algorithms - Proprietary secrecy prevents independent evaluation - Vendor claims not backed by peer-reviewed evidence

5. Hype and Conflicts of Interest: - Marketing exceeds evidence - Company-funded studies vs. independent evaluations - Media amplifies claims without scrutiny

6. Inadequate Clinical Grounding: - Developers lack domain expertise - AI solutions to non-existent problems - Technology-first, not problem-first approach

7. Regulatory and Oversight Gaps: - Insufficient FDA scrutiny (especially for CDSS) - Privacy violations - Deployment before adequate safety evidence


How to Spot Warning Signs

Red flags suggesting potential failure:

  1. Extraordinary claims without extraordinary evidence (“Better than doctors,” “Revolutionary”)
  2. Lack of peer-reviewed publications (only vendor white papers, press releases)
  3. Retrospective validation only (no prospective studies)
  4. Single-site development and testing (no external validation)
  5. Black-box with no transparency (“Proprietary algorithm,” refuses to disclose methods)
  6. Conflict of interest (only company-funded studies, no independent evaluation)
  7. Deployment before validation (widespread use without proof of benefit)
  8. No discussion of limitations (every AI has failure modes. If none mentioned, red flag)
  9. Privacy concerns (inadequate consent, data governance)
  10. Regulatory evasion (marketed as “wellness,” “not a medical device” to avoid FDA oversight)


12. MD Anderson IBM Watson Partnership (2013-2016)

What it promised: Watson would revolutionize cancer care at one of world’s premier cancer centers.

The investment: $62 million over 3 years

What went wrong: - MD Anderson launched “Oncology Expert Advisor” powered by Watson - Watson couldn’t integrate with MD Anderson’s EHR systems - Training required manual data entry by physicians (hours per case) - Recommendations contradicted MD Anderson’s own treatment protocols - No prospective validation before launch - Project never saw a single patient clinically

Outcome: Project canceled 2016 after $62M spent. Watson never deployed for patient care. MD Anderson leadership turnover.

Why it failed: - Technical immaturity: Watson wasn’t ready for real-world clinical deployment - Poor needs assessment: Didn’t solve actual physician pain points - No physician co-design: Built by computer scientists, not oncologists - Sunk cost fallacy: Continued investing despite early warning signs - Hype over evidence: Brand name (IBM) + marketing overshadowed lack of validation

The physician impact: - Dozens of oncologists spent hundreds of hours training Watson - Opportunity cost: $62M could have funded 30+ oncology nurses, 5 years of salaries - Damaged credibility of AI among MD Anderson physicians

Lesson for physicians: Big brand ≠ Working product. Demand prospective validation before your time is wasted.


13. Google Health Diabetic Retinopathy AI in Thailand (2018-2019)

What it promised: 96% accurate diabetic retinopathy screening using portable cameras in rural Thailand clinics.

What went wrong: - Lab performance: 96% accuracy with research-grade retinal cameras - Field performance: 55% of images “ungradable” with portable cameras used in real clinics - Nurses couldn’t consistently capture quality images despite 2-hour training - System required stable internet (unreliable in rural Thailand) - Workflow disruption: 5 minutes per patient overwhelmed clinics - No offline mode for internet outages

Outcome: Pilot discontinued. Clinics returned to traditional screening.

Why it failed: - Lab-to-field gap: Didn’t test in real-world conditions before deployment - Inadequate workflow analysis: Didn’t understand clinic time constraints - Training insufficient: 2 hours not enough for nurses to master new camera technique - Infrastructure assumptions: Assumed reliable internet (not available) - User research failure: Didn’t involve nurses in design

The physician impact: - Ophthalmologists’ time wasted reviewing ungradable images - Patient frustration with failed screening attempts - Delayed care for patients (5 min/patient × hundreds = hours of clinic disruption)

Lesson for physicians: Lab performance ≠ Field performance. Demand real-world pilot data in conditions matching YOUR practice.


14. PathAI Breast Cancer Detection (Early Deployment Issues, 2017-2018)

What it promised: AI pathology assistant detecting breast cancer metastases more accurately than pathologists.

Initial problems (later corrected): - High false positive rate in early deployment - AI flagged normal lymph nodes as suspicious - Pathologists spent extra time reviewing AI false alarms - Initial training data from single institution (limited generalizability)

Outcome: Company iterated, improved model, conducted multi-site validation. Now successfully deployed at multiple institutions.

Why initial deployment struggled: - Single-site training data: Didn’t generalize to other institutions’ staining protocols - Threshold calibration: False positive rate not optimized for clinical workflow - Insufficient external validation before broad deployment

The physician impact (early deployment): - Pathologists reported increased workload from false positives - Alert fatigue risk

What they did right (eventually): - Listened to pathologist feedback - Conducted multi-institution validation - Adjusted thresholds based on real-world use - Now one of more successful pathology AI systems

Lesson for physicians: Even good companies make mistakes in early deployment. Demand pilot data from YOUR institution before full rollout. But also: Failure → Iteration → Success is possible with responsive vendors.


15. LLM Hallucination: The Fabricated Chemotherapy Protocol (2023)

What happened: - Physician used ChatGPT-4 to verify pediatric ALL (acute lymphoblastic leukemia) chemotherapy dosing - LLM generated plausible-sounding but INCORRECT protocol: - Methotrexate: LLM suggested 50 mg/m² → Actual protocol: 5 g/m² (100x error!) - Vincristine: LLM suggested weekly → Actual: Every 3 weeks during consolidation - Dexamethasone: LLM suggested 5 days → Actual: 28 days

What could have happened: - If physician hadn’t verified against authoritative source: - 1% of intended methotrexate dose → Treatment failure, disease progression - Excessive vincristine → Neurotoxicity (peripheral neuropathy, paralytic ileus) - Insufficient steroid duration → CNS relapse risk

Outcome: Error caught before patient harm (physician verified against pharmacy protocol). Case reported in medical literature as warning.

Why it happened: - LLM hallucination: GPT-4 generates plausible but false information - No medical database access: LLM relied on training data (potentially outdated or wrong) - Confidence without competence: LLM presented wrong answer authoritatively

Lesson for physicians: NEVER use LLMs for medication dosing without rigorous verification against authoritative sources (Lexicomp, pharmacy protocols, guidelines). LLMs hallucinate. Your patients will pay the price.


16. Sepsis Alert Fatigue at [Anonymous Community Hospital] (2019-2020)

Vendor: EHR-embedded sepsis prediction model

What it promised: Early sepsis detection, 6-12 hours before clinical recognition, 85% sensitivity.

What happened: - Hospital deployed sepsis alert system in all ICUs and medical floors - Alert rate: 15-20 alerts per day per 30-bed unit - False positive rate: 65-70% (13 out of 20 alerts were false) - Physician response: - Week 1-2: Diligently evaluated every alert - Week 3-4: Started ignoring obvious false positives - Month 2: Response rate dropped to 30% (alert fatigue) - Month 3: Physicians requested system be turned off

The missed case: - Month 3: Real sepsis case, AI alerted correctly - Physician ignored alert (assumed false positive based on pattern) - Patient deteriorated overnight, required ICU transfer - No mortality, but preventable deterioration

Outcome: Hospital deactivated sepsis AI after 4 months. Returned to traditional sepsis screening (qSOFA, SIRS criteria).

Why it failed: - No threshold customization: One-size-fits-all sensitivity → High false positives - No local validation: Vendor’s sensitivity/specificity didn’t match hospital’s patient population - Alert fatigue inevitable: 70% false positive rate is unsustainable - No physician input: Deployed top-down without ED/ICU physician buy-in

The physician impact: - Residents/hospitalists spent ~30 min/day evaluating false alerts (Month 1-2) - Eroded trust in future AI alerts - Workflow disruption

Lesson for physicians: Demand false positive rate data before deployment. If FP rate >20%, system will cause alert fatigue. Insist on customizable thresholds for YOUR patient population.


Part 3: Emerging AI Failures (2023-2024)

16. LLM-Generated Patient Education Handouts (2023)

What happened: - Physician used ChatGPT to generate patient education handout on warfarin management - LLM output looked professional, included dietary restrictions, monitoring advice - Errors in LLM output: - Recommended “avoid all leafy greens” → Wrong! Should be consistent intake, not avoidance - Omitted critical drug interactions (NSAIDs, antibiotics) - Included outdated INR target range (2.5-3.5 for all patients) → Varies by indication

Outcome: Physician reviewed carefully, caught errors, rewrote handout. But many physicians may not catch LLM mistakes.

Lesson: LLMs can generate plausible but medically incorrect patient education. Always verify against authoritative sources (UpToDate, specialty society guidelines) before giving to patients.


17. AI Triage in Urgent Care (2023-2024)

Vendor: Symptom checker AI for urgent care front-desk triage

What it promised: Prioritize high-acuity patients, reduce wait times.

What went wrong: - 67-year-old with “chest tightness” triaged as low priority by AI (scored musculoskeletal pain) - Patient waited 90 minutes before physician evaluation - Diagnosis: NSTEMI (non-ST elevation myocardial infarction) - Door-to-EKG time: 95 minutes (should be <10 minutes)

Outcome: Patient survived (troponin elevated, underwent PCI), but delayed care. Near-miss for patient harm.

Why it failed: - AI trained on younger population (underweighted cardiac risk in elderly) - Atypical presentation: “Tightness” not “crushing pain” → AI didn’t recognize - No human override: Front desk staff followed AI score without physician judgment

Lesson: AI triage must have human physician oversight for high-risk presentations. Never let AI alone determine priority for chest pain, dyspnea, neurological symptoms.


Part 4: Consolidated Lessons Learned

Pattern 1: The Validation Failures

Common root causes: - Retrospective validation only (no prospective) - Single-site validation (doesn’t generalize) - Internal validation only (vendor’s own hospitals) - Cherry-picked datasets - Small sample sizes - Overfitting to development data

Examples: - Epic Sepsis Model (internal validation only) - IBM Watson (no prospective validation) - COVID-19 Chest X-Ray AI (methodological flaws across 2,000+ models)

Red flags for physicians: - AVOID: “Validated on 100,000 patients” - but all from same hospital - AVOID: “98% accuracy” - on cherry-picked test set - AVOID: “Deployed in 150+ hospitals” - deployment ≠ effectiveness - AVOID: No peer-reviewed publication in major medical journal

What to demand: - Good: External validation at ≥3 independent hospitals - Good: Prospective validation (not just retrospective) - Good: Publication in JAMA, NEJM, Lancet, or specialty journal - Good: Validation in patient population similar to YOURS


Pattern 2: The Data Quality Failures

Common root causes: - Biased, non-representative training data - Confounders not recognized or controlled - Noisy labels (billing codes, retrospective chart review) - Data leakage (overlap between training and test sets) - Look-ahead bias (model learns from future data)

Examples: - Google Flu Trends (confounding: media coverage → searches ≠ illness) - COVID CXR AI (confounding: COVID patients supine, controls standing) - OPTUM (training data: Black patients historically undertreated → lower costs) - Epic Sepsis (noisy labels: billing codes retrospectively assigned)

Red flags for physicians: - AVOID: Training data from single institution - AVOID: Homogeneous population (academic medical center only, commercially insured only) - AVOID: Labels based on billing codes or retrospective chart review - AVOID: No discussion of confounders or data quality

What to demand: - Good: Diverse training data (multiple hospitals, demographics) - Good: Labels based on gold standard (not billing codes) - Good: Confounder analysis documented - Good: Data quality assessment reported


Pattern 3: The Fairness Failures

Common root causes: - No fairness testing before deployment - Training data not representative of patient population - Proxy variables encode bias (costs, zip code, insurance type) - “Fairness through unawareness” (ignoring race doesn’t eliminate bias)

Examples: - OPTUM algorithmic bias (used costs as proxy for health needs) - PathAI early deployment (single-site data, limited demographic diversity)

Red flags for physicians: - AVOID: “We don’t use race as a feature, so it’s fair” - AVOID: No fairness audit conducted - AVOID: Performance not stratified by demographics - AVOID: Training data demographics not disclosed

What to demand: - Good: Independent fairness audit - Good: Sensitivity, specificity, PPV by race, ethnicity, age, sex, insurance - Good: Performance difference <10% across groups - Good: Plan for ongoing bias monitoring


Pattern 4: The Workflow Integration Failures

Common root causes: - No physician user research - Designed in isolation from clinical workflows - Unrealistic time requirements - Poor EHR integration - Inadequate training - No customization for local workflows

Examples: - Google Health India (5 min/patient workflow disruption) - MD Anderson Watson (manual data entry, no EHR integration) - PathAI early deployment (pathologist time wasted on false positives)

Red flags for physicians: - AVOID: “Plug and play” - healthcare is complex; no system is plug-and-play - AVOID: “No training needed” - AVOID: “Works with all EHRs” - AVOID: No physician usability testing

What to demand: - Good: Usability testing with physicians in YOUR specialty - Good: Time-motion studies (how much time does it add/save?) - Good: Native EHR integration (not manual data entry) - Good: Customizable to your workflows


Pattern 5: The Alert Fatigue Failures

Common root causes: - High false positive rate (>20%) - Alert thresholds not customized to local population - No physician input on alert design - Alert overload (too many alerts per day)

Examples: - Epic Sepsis Model (33% sensitivity, meaning 67% of sepsis cases never triggered alert in external validation) - Anonymous hospital sepsis AI (65-70% false positive rate)

Red flags for physicians: - AVOID: False positive rate not disclosed - AVOID: Alert rate (alerts/day) not disclosed - AVOID: No ability to customize thresholds

What to demand: - Good: False positive rate ≤20% (ideally ≤10%) - Good: Alert rate quantified and manageable - Good: Customizable thresholds for YOUR patient population - Good: Physician override capability without penalty


Part 5: The Cost of Failure

Financial Cost

Direct costs: - IBM Watson contracts: $62M+ (MD Anderson alone) - Epic Sepsis deployments: ~$100K-$500K per hospital × 100+ hospitals = $10M-$50M - Google Health India pilot: ~$2M-$5M (estimated) - Total across all failures: >$100M

Indirect costs: - Physician time training on failed systems: 1,000s of hours - IT infrastructure for failed deployments: $millions - Opportunity cost: Money spent on failed AI could have funded proven interventions (nurses, pharmacists, care coordinators)


Patient Safety Cost

Near-misses: - LLM chemotherapy dosing error (caught before harm) - AI triage NSTEMI delay (patient survived) - Sepsis alert ignored (preventable deterioration)

Actual harm: - IBM Watson unsafe chemotherapy recommendations (not deployed, but recommended to physicians) - OPTUM bias: Thousands of Black patients systematically denied needed care coordination

Missed diagnoses: - Epic Sepsis had 33% sensitivity, missing 67% of sepsis cases in external validation


Trust Cost

Physician trust in AI eroded: - After Epic Sepsis failures, physicians skeptical of future sepsis AI - After IBM Watson, oncologists wary of “AI treatment planning” - Alert fatigue from one failed system poisons future AI deployments

Patient trust damaged: - Babylon Health safety concerns → Patients distrusting symptom checkers - OPTUM bias exposure → Communities of color distrusting healthcare algorithms


Part 6: How to Avoid These Failures

For Individual Physicians

Before you use a new AI tool in your practice:

  1. Read the validation study
    • Peer-reviewed publication in major journal?
    • External validation at ≥3 hospitals?
    • Prospective or retrospective?
    • Performance metrics: sensitivity, specificity, PPV (not just AUC)
    • Clinical outcomes improved? (mortality, complications, length of stay)
  2. Check for bias
    • Performance stratified by race, ethnicity, age, sex?
    • Training data demographics disclosed?
    • Any fairness audit?
  3. Assess workflow impact
    • How much time does it add per patient?
    • False positive rate? (Will it cause alert fatigue?)
    • Can you customize thresholds?
  4. Verify regulatory status
    • FDA cleared/approved (if applicable)?
    • HIPAA BAA signed?
    • Any FDA warning letters or adverse events?
  5. Speak with physician references
    • Contact 2-3 physicians using it at other hospitals
    • Ask: “Do you trust it?” “Has it improved care?” “Would you recommend it?”

If any red flags: Say no. Don’t let your patients be guinea pigs.


For Hospital Leaders

Before deploying AI system hospital-wide:

  1. Mandatory pilot (3-6 months)
    • Start small: 1-2 units
    • Intensive monitoring: technical performance, physician satisfaction, patient outcomes
    • Pre-defined success criteria
  2. Clinical AI Governance Committee
    • Physician-led oversight
    • Review all AI acquisitions
    • Investigate adverse events
    • Annual bias audits
  3. Physician training
    • Not just “how to use it”
    • Teach: How AI works, limitations, when to override, how to document
  4. Patient notification
    • Inform patients AI is used in their care
    • Opt-out option where feasible
  5. Continuous monitoring
    • Technical performance (sensitivity, specificity, PPV)
    • Clinical outcomes
    • Equity metrics (by demographics)
    • Physician satisfaction

If pilot fails: Terminate contract. Don’t escalate commitment to failed systems.


For Vendors

How to avoid building the next failure:

  1. Involve physicians from Day 1
    • Not just as advisors - as co-designers
    • Solve real physician pain points, not imagined ones
  2. Diverse training data
    • Multiple hospitals, geographic regions
    • Representative demographics
  3. Rigorous validation
    • External validation at ≥3 independent hospitals
    • Prospective studies
    • Publish in peer-reviewed journals
  4. Fairness testing
    • Independent fairness audits
    • Report performance by demographics
    • Mitigate bias before deployment
  5. Transparent reporting
    • Publish limitations, failure modes
    • Report false positive rates
    • Share adverse events

Success requires evidence, not just hype.


Conclusion: Learning from the Graveyard

The failures catalogued here represent: - >$100M wasted - 1,000s of physician hours lost - Thousands of patients affected by bias, missed diagnoses, or delayed care - Erosion of trust in AI among physicians and patients

But failures can teach us to build better:

Common threads across all failures: 1. Inadequate clinical validation 2. Biased or non-representative training data 3. Overfitting and poor generalization 4. Lack of transparency 5. Hype exceeding evidence 6. Insufficient physician involvement in design 7. No fairness testing 8. Workflow integration failures 9. Alert fatigue from high false positive rates

Key principles to avoid failures: - Evidence before adoption: External, prospective validation required - Transparency: Understand how AI works, what data trained it, where it fails - Equity: Test performance across demographics, address biases - Physician co-design: Involve physicians from conception through deployment - Continuous monitoring: Detect performance drift, safety issues, bias - Healthy skepticism: Extraordinary claims require extraordinary proof - Start small, scale slowly: Pilot → Validate → Scale (if successful) - You can say no: Bad AI is worse than no AI

Your patients’ safety depends on your vigilance.

The future of medical AI depends on learning from past failures and building systems that prioritize patient welfare over technological novelty or commercial gain.


Appendix: Quick Reference Red Flags

When evaluating an AI system, walk away if:

Validation Red Flags

  • No peer-reviewed publication in major medical journal
  • Internal validation only (no external validation)
  • Retrospective only (no prospective validation)
  • Single-site validation
  • Only technical metrics reported (AUC, accuracy) without clinical outcomes

Data Red Flags

  • Training data demographics not disclosed
  • Homogeneous training data (single institution, limited demographics)
  • Labels based on billing codes or retrospective chart review
  • No discussion of data quality or confounders

Fairness Red Flags

  • No fairness audit conducted
  • Performance not stratified by demographics
  • “We don’t use race, so it’s fair” (fairness through unawareness)
  • Using proxies correlated with protected classes (zip code, insurance, costs)

Workflow Red Flags

  • “Plug and play” claims
  • “No training needed”
  • No physician usability testing
  • False positive rate not disclosed or >20%
  • No ability to customize to your workflows

Business Red Flags

  • Startup with no revenue or customers
  • Can’t provide physician references
  • Version 1.0 product (you’re the beta tester)
  • No FDA clearance when required
  • Refuses to sign HIPAA BAA
  • Vague pricing or hidden costs

If you see multiple red flags: Don’t deploy. Your patients deserve better.


Remember: Every failure documented here was preventable. The physicians, hospitals, and vendors who avoided these mistakes followed the principles outlined above.

Your job: Demand evidence. Ask hard questions. Protect your patients.

The Clinical AI Morgue is a reminder: Success requires evidence, not hype.