Appendix C — Appendix C: Case Studies in Medical AI
Introduction
This appendix presents detailed case studies of medical AI implementations—both successes and failures. Each case study follows the structure: Background, Implementation, Outcomes, and Lessons Learned. These real-world examples illustrate principles discussed throughout the handbook.
Case Study 1: IDx-DR (Diabetic Retinopathy) - Success
Background
Diabetic retinopathy (DR) is leading cause of preventable blindness. Early detection and treatment prevent vision loss, but screening requires ophthalmologists or trained specialists—scarce in many regions. Millions with diabetes lack regular eye exams.
Solution: IDx-DR, an autonomous AI system analyzing retinal images to detect referable DR (moderate or worse retinopathy requiring ophthalmologist referral).
Implementation
Development: IDx Technologies trained deep learning model on 1.3 million retinal images, validated on prospective clinical trial (900 patients, 10 primary care sites).
FDA Submission: First autonomous AI diagnostic system (no physician review required before result) submitted to FDA.
FDA Clearance: April 2018—De Novo pathway (novel device). Required: Sensitivity >85%, specificity >82.5%.
Clinical Performance: Sensitivity 87.2%, specificity 90.7% in pivotal trial.
Deployment: Primary care clinics, pharmacies, endocrinology offices. Non-ophthalmologists capture retinal photos, IDx-DR interprets immediately, result provided within minutes.
Reimbursement: CPT code 92229 established (2020), Medicare coverage approved.
Outcomes
Positive: - Thousands screened annually in settings without ophthalmologists - Reduced time-to-referral for patients with referable DR - Increased screening rates in underserved populations - First autonomous AI to achieve FDA approval, regulatory pathway, reimbursement
Challenges: - Image quality requirements (dilated pupils, specific camera) limit accessibility - Not all healthcare systems adopted (cost, workflow integration) - Follow-up care coordination remains challenge (diagnosis without treatment pathway insufficient)
Lessons Learned
1. Prospective validation essential: IDx-DR’s success partly due to rigorous prospective trial before FDA submission—not just retrospective studies.
2. Addressing real clinical need: DR screening gap was clear, well-documented problem. AI solved actual bottleneck.
3. Autonomous AI requires higher bar: FDA required higher performance thresholds for autonomous system (no physician review) than for decision-support tools.
4. Reimbursement matters: Without CPT code and Medicare coverage, adoption would have been minimal despite FDA clearance.
5. Implementation ≠ Impact: Clearance doesn’t guarantee adoption. Workflow integration, training, follow-up pathways all critical for real-world benefit.
Case Study 2: Viz.ai LVO (Stroke Detection) - Success
Background
Large vessel occlusion (LVO) strokes require emergent thrombectomy—mechanical clot removal. Time-critical: “time is brain.” But identifying LVO on CT angiography (CTA) requires expert interpretation, often delayed in emergency settings.
Solution: Viz.ai LVO, AI analyzing head CTA to detect LVO and automatically alert stroke team (neurointerventionalists, neurologists) via mobile app.
Implementation
Development: Deep learning model trained on thousands of CTA scans, validated externally across multiple centers.
FDA Clearance: 2018 (510k pathway for triage/notification, not autonomous diagnosis).
Deployment: Integrated into hospital PACS (picture archiving systems). When CT performed, AI analyzes automatically, flags suspected LVO, sends mobile alerts to stroke team with images.
Clinical Workflow: ED physician orders CTA, AI processes in <5 minutes, stroke team notified simultaneously with radiologist review, team mobilizes while formal read pending.
Outcomes
Positive: - Multiple studies show reduced time-to-treatment (door-to-groin puncture time) by 30-50 minutes - Improved functional outcomes (mRS scores) at 90 days - Widely adopted (500+ hospitals globally by 2024) - Published RCT data supporting clinical benefit
Challenges: - False positives generate unnecessary alerts (stroke team fatigue) - Requires robust IT infrastructure, integration with PACS - Cost varies by institution (subscription model)
Lessons Learned
1. Triage/notification niche: AI doesn’t replace radiologist—flags urgent cases for expedited review. Less regulatory burden than autonomous diagnosis.
2. Time-to-treatment matters: In time-critical conditions (stroke, trauma), even modest time savings translate to meaningful outcome improvements.
3. Mobile integration key: Sending alerts to clinicians’ phones (not just radiologist workstation) enables rapid mobilization.
4. Multi-site validation builds trust: External validation across diverse hospitals demonstrated generalizability, facilitated adoption.
5. Measure outcomes, not just accuracy: Viz.ai tracked clinical outcomes (time-to-treatment, functional recovery), not just sensitivity/specificity—compelling for hospitals and payers.
Case Study 3: Epic Sepsis Model - Failure
Background
Sepsis is leading cause of hospital deaths. Early recognition and treatment (antibiotics, fluids) save lives. Hospitals sought AI to predict sepsis risk, enabling proactive intervention.
Solution: Epic (major EHR vendor) developed sepsis prediction model (Epic Sepsis Model, ESM) integrated into EHR, alerting clinicians to high-risk patients.
Implementation
Development: Trained on Epic’s multi-institutional dataset. Deployed to hundreds of hospitals using Epic EHR.
Algorithm: Predicted sepsis risk based on vital signs, labs, clinical notes. Generated alerts when risk exceeded threshold.
Intended Use: Alert clinicians 6-12 hours before sepsis onset, prompt early intervention.
What Went Wrong
2019 Study (University of Michigan): - External validation on 38,000 patients - Found ESM detected only 7% of sepsis cases before onset - 67% of sepsis cases never triggered alert - Sensitivity far lower than advertised
Reasons for Failure: 1. Training data issues: Model trained on sepsis cases identified retrospectively (billing codes), not real-time. Billing codes imperfect (miss cases, misclassify). 2. Look-ahead bias: Model may have learned from data collected after sepsis onset (e.g., labs ordered because sepsis suspected), inflating apparent performance. 3. Definition inconsistency: “Sepsis” defined differently across institutions. Model trained on one definition performed poorly when applied to others. 4. Alert fatigue: High false positive rate meant clinicians ignored alerts. 5. Lack of external validation before deployment: Epic deployed widely before independent, prospective validation.
Outcomes
Negative: - Hospitals continued using despite poor performance (embedded in EHR, difficult to remove) - Clinicians lost trust in AI-based alerts generally - Patients potentially harmed (missed sepsis cases, delayed treatment) - Legal and ethical scrutiny of Epic
Positive (inadvertently): - Highlighted need for external validation, transparent reporting - Spurred regulatory discussion (FDA scrutiny of clinical decision support) - Motivated independent research on sepsis prediction
Lessons Learned
1. External validation before widespread deployment: Models performing well internally may fail externally. Don’t deploy nationally without multi-site prospective validation.
2. Training data quality critical: Garbage in, garbage out. Billing codes, retrospective labels are noisy. Gold standard: prospective, clinician-adjudicated diagnoses.
3. Beware look-ahead bias: Ensure model uses only data available at prediction time, not future data.
4. Transparency matters: Epic initially didn’t disclose algorithm details, validation data, making independent evaluation difficult. Secrecy erodes trust.
5. Alert fatigue is real: High false positive rate → ignored alerts → missed true positives. Better to have no alert than unreliable alert.
Case Study 4: IBM Watson for Oncology - Failure
Background
IBM Watson, famous for winning Jeopardy! (2011), was positioned as AI revolution in healthcare. Watson for Oncology (WFO) promised to analyze medical literature, patient data, and provide evidence-based treatment recommendations for cancer.
Marketing: “Cognitive computing” could “read” millions of journal articles, guidelines, suggest personalized cancer treatments.
Implementation
Development: Trained on Memorial Sloan Kettering Cancer Center (MSKCC) cases, expert oncologist input. Deployed internationally (India, Thailand, Korea, others) via partnerships with hospitals.
Intended Use: Oncologists input patient data, WFO recommends treatment options with supporting evidence.
What Went Wrong
Performance Issues: 1. Recommendations didn’t match evidence: Often suggested treatments contradicting guidelines or expert consensus 2. Unsafe recommendations: Internal documents revealed WFO suggested dangerous treatments (e.g., chemotherapy for patients with severe bleeding—contraindicated) 3. Trained on hypothetical cases, not real patients: MSKCC used synthetic cases for training, not actual patient outcomes 4. Limited generalizability: Trained primarily on U.S. patients, performed poorly in populations with different disease presentations, resources
Adoption Challenges: - Oncologists found recommendations unhelpful, contradicted their judgment - Interface clunky, disrupted workflow - Expensive (millions in licensing fees) - Several high-profile hospitals canceled contracts (2017-2019)
IBM’s Response: - 2018: Scaled back oncology ambitions - 2021: IBM sold Watson Health division - Acknowledged WFO didn’t deliver promised value
Outcomes
Negative: - Wasted resources (hospitals invested millions) - Damaged credibility of medical AI generally (if Watson failed, can AI work?) - Patients potentially harmed by unsafe recommendations (extent unclear)
Positive (inadvertently): - Sobered hype around AI—reminder that marketing ≠ clinical reality - Reinforced need for evidence, validation, transparency
Lessons Learned
1. Hype ≠ substance: Watson’s Jeopardy! success didn’t translate to medicine. Natural language processing of trivia questions fundamentally different from clinical reasoning.
2. Training on real outcomes essential: Synthetic cases don’t capture clinical complexity. Models must learn from actual patient data, outcomes.
3. Domain expertise required: IBM lacked deep oncology expertise. Effective medical AI requires collaboration between AI engineers and clinicians.
4. Validate in deployment populations: U.S.-trained model failed internationally. Can’t assume generalizability.
5. Usability matters: Even accurate AI is useless if clinicians won’t use it. Workflow integration, user interface critical.
Case Study 5: COVID-19 Chest X-Ray AI - Failure
Background
Early COVID-19 pandemic (March-April 2020): Testing limited, results slow. Researchers rushed to develop AI detecting COVID-19 from chest X-rays (CXRs)—rapid, widely available diagnostic tool.
Dozens of models published within months: Papers claimed high accuracy (90-99%) distinguishing COVID-19 from other pneumonias, normal CXRs.
Implementation
Development: Trained on publicly available datasets (CXR images labeled COVID-positive or negative).
Claims: “AI can diagnose COVID-19 from CXR, supplement scarce PCR testing.”
What Went Wrong
2021 Systematic Review (Nature Machine Intelligence): - Reviewed 2,000+ COVID-19 imaging AI models - Found ZERO suitable for clinical use - All had critical flaws
Common Failures: 1. Training data biases: COVID-positive images often from different sources (hospitals, equipment) than COVID-negative images. AI learned to detect image source, not disease. 2. Confounding: COVID patients often supine (portable X-rays), healthy controls standing (PA/lateral). AI detected positioning, not pathology. 3. Data leakage: Duplicate images in training and test sets, inflating performance. 4. Lack of external validation: Models tested on same dataset used for development. 5. Clinical implausibility: CXR findings in COVID-19 non-specific (similar to other viral pneumonias). Unrealistic to expect perfect discrimination.
Few models ever deployed clinically. Those that were showed poor real-world performance, quietly withdrawn.
Outcomes
Negative: - Research resources wasted (thousands of papers, little clinical impact) - Misinformation spread (media reported “AI can diagnose COVID” based on flawed studies) - Damage to AI credibility in radiology
Positive: - Highlighted methodological failures in AI research - Motivated better standards (CONSORT-AI, TRIPOD-AI reporting guidelines) - Taught important lessons quickly (entire arc—hype to failure—in ~18 months)
Lessons Learned
1. Speed vs. rigor: Pandemic urgency led to corner-cutting. Fast publication doesn’t mean good science. External validation, rigorous methods still essential.
2. Beware confounders: AI learns shortcuts. Always consider: “What else could explain this association?”
3. Biological plausibility matters: If radiologists can’t reliably distinguish diseases, AI likely can’t either. Don’t expect AI miracles.
4. Pre-registration and transparency: Many COVID AI studies retrospectively cherry-picked datasets, methods. Pre-registration (specify methods before seeing test set) prevents p-hacking.
5. Independent review critical: Peer review failed to catch obvious flaws. Need AI-literate reviewers, reproducibility checks (code, data sharing).
Case Study 6: Google Flu Trends - Failure (Historical Context)
Background
Though not medical imaging AI, Google Flu Trends (GFT) is cautionary tale relevant to medical AI generally.
Concept (2008): Google analyzed search queries (“flu symptoms,” “fever treatment”) to predict flu outbreaks in real-time—faster than CDC’s traditional surveillance (2-week lag).
Initial Success: GFT predictions matched CDC data closely (2009-2011), lauded as “big data” revolution in public health.
What Went Wrong
2012-2013: GFT overestimated flu prevalence by 50-100%. Predictions increasingly inaccurate.
Reasons: 1. Overfitting: Model trained on correlations in historical data (certain searches correlated with flu), but correlations changed over time (search behavior evolved). 2. Algorithm updates: Google changed search algorithms, altering data generating process without updating GFT model. 3. Media coverage: Articles about flu outbreak prompted searches unrelated to actual illness (“cyberchondria”), creating false signals. 4. Lack of transparency: Google didn’t disclose algorithm details, making independent evaluation impossible.
2015: Google discontinued GFT.
Lessons Learned
1. Performance drift: AI models degrade when underlying data distributions change. Continuous monitoring essential.
2. Confounding and spurious correlations: Increased searches don’t necessarily mean increased illness. Alternative explanations (media attention, anxiety) must be considered.
3. Feedback loops: AI predictions can influence behavior (searches), which influences data, which influences predictions—unstable system.
4. Transparency enables improvement: Closed, proprietary models can’t be independently evaluated or corrected.
5. Complement, don’t replace: GFT attempted to replace CDC surveillance. Better approach: use AI to supplement traditional methods, not substitute.
Conclusion
These case studies reveal common themes:
Successes share: - Rigorous prospective validation - Addressing clear clinical needs - Appropriate deployment (triage/support vs. autonomous) - Transparency and external evaluation - Attention to workflow, usability, reimbursement
Failures share: - Inadequate validation (retrospective only, single-site, no external) - Confounders and biases in training data - Overfitting and poor generalization - Lack of transparency - Hype exceeding evidence
The difference between success and failure often isn’t algorithm sophistication—it’s methodological rigor, clinical grounding, and humility about limitations. These lessons apply to current and future medical AI development.