Clinical AI Safety and Risk Management
A widely deployed sepsis prediction algorithm missed 67% of sepsis cases while generating 88% false positives. It had been validated retrospectively. It failed prospectively. Traditional medical devices break in obvious ways. AI systems fail silently, producing plausible-looking but dangerously wrong outputs. This chapter teaches you to recognize failure modes before they harm patients.
After reading this chapter, you will be able to:
- Apply FDA regulatory frameworks for medical AI devices
- Conduct systematic failure mode and effects analysis (FMEA) for AI systems
- Recognize common AI failure patterns and sentinel events
- Implement safety monitoring and adverse event reporting
- Build a safety culture for AI deployment
- Understand human factors, automation bias, and cognitive anchoring risks
- Maintain independent clinical skills to prevent de-skilling
- Design fail-safe mechanisms for clinical AI
Introduction
In 2021, a study examining Epic’s widely-deployed sepsis prediction model revealed a sobering finding: the algorithm had a sensitivity of only 33% in real-world clinical use, meaning it missed 67% of sepsis cases (Wong et al., 2021). The model also had a positive predictive value (PPV) of just 12%, meaning 88% of alerts were false positives. This was the same algorithm that showed promising retrospective performance and had been implemented across hundreds of hospitals.
The Epic sepsis case illustrates a fundamental truth about medical AI safety: retrospective accuracy does not guarantee real-world safety. Unlike traditional medical devices (which fail predictably through mechanical breakdown or electrical malfunction), AI systems fail in subtle, context-dependent ways that may not be apparent until deployment.
This chapter examines the unique safety challenges of clinical AI, regulatory frameworks, systematic approaches to risk assessment, documented failure modes, and strategies for building a culture of AI safety.
AI Safety Within the Patient Safety Tradition
AI safety is not a new problem requiring new frameworks. It is the latest chapter in a patient safety tradition crystallized by the Institute of Medicine’s landmark 1999 report, To Err is Human: Building a Safer Health System (Kohn et al., 2000).
That report documented an uncomfortable truth: medical errors cause an estimated 44,000-98,000 deaths annually in U.S. hospitals, more than motor vehicle accidents, breast cancer, or AIDS. The IOM’s central insight was that these errors were not primarily caused by bad people making careless mistakes. They were good people working in bad systems. The solution was not punishment or retraining. It was redesigning systems to make errors less likely and less harmful when they occurred.
This framing applies directly to AI safety:
AI failures are system failures, not individual failures. When a physician over-relies on incorrect AI output, the failure is not the physician’s poor judgment. It is a system that presented AI recommendations without appropriate uncertainty indicators, training on automation bias, or workflow designs that preserve independent clinical reasoning.
Blame-free reporting enables learning. The IOM advocated for non-punitive error reporting so organizations could learn from failures. AI safety requires the same culture: physicians must feel safe reporting when AI recommendations were wrong or when they overrode AI incorrectly.
Design for safety, not just performance. The IOM emphasized building safety into systems through redundancy, forcing functions, and fail-safes. AI systems need similar design: not just high accuracy, but graceful degradation, uncertainty quantification, and human oversight at critical decision points.
The physicians who successfully navigate clinical AI will be those who understand that AI safety is patient safety, requiring the same systems thinking, just culture, and continuous improvement that the IOM articulated 25 years ago.
Why AI Safety is Different
Traditional Medical Device Safety
Medical devices have well-established safety paradigms: - Predictable failure modes: Pacemakers have battery depletion, monitors have sensor failures - Testable before deployment: Devices can be bench-tested, stress-tested, validated in controlled conditions - Static performance: Once validated, device performance doesn’t change (until hardware degrades) - Visible failures: Most failures are obvious (device stops working, alarm sounds)
AI System Safety Challenges
Medical AI introduces fundamentally different risks:
1. Silent Failures: - AI can produce plausible-looking but incorrect outputs - Errors may not be immediately apparent to clinicians - Example: AI misses subtle fracture on X-ray, radiologist trusts AI and also misses it
2. Context-Dependent Performance: - AI performs differently across populations, hospitals, workflows - What works at academic center may fail at community hospital - Performance varies with disease prevalence, patient demographics, image acquisition protocols
3. Performance Drift Over Time: - Clinical practice evolves (new treatments, changing patient populations) - AI trained on historical data becomes outdated - Performance degrades silently unless monitored (Finlayson et al., 2021)
4. Unpredictable Edge Cases: - AI may fail catastrophically on inputs unlike training data - Impossible to test all possible scenarios - Example: Chest X-ray AI trained pre-pandemic fails on COVID-19 pneumonia patterns - LLMs show 26-38% accuracy drops when faced with unfamiliar answer patterns, suggesting pattern matching over robust reasoning (Bedi et al., 2025)
5. Inscrutability: - Deep learning models often can’t explain their predictions - Makes root cause analysis difficult when failures occur - Clinicians can’t validate AI reasoning process
6. Cascading Failures: - AI errors propagate through clinical workflows - Wrong AI prediction → wrong clinical decision → patient harm - Multiple systems may compound errors
These differences demand new approaches to safety assessment and monitoring (Kelly et al., 2019).
FDA Regulatory Framework for Medical AI
The FDA regulates medical AI as Software as a Medical Device (SaMD) under existing device regulations, but is developing AI-specific frameworks.
SaMD Classification
Medical AI is classified based on risk level:
Class I: Low Risk - Definition: Minimal risk to patients if device malfunctions - Examples: Dental caries detection, skin condition photo analysis - Regulation: Exempt from premarket notification (510(k)) - Requirements: General controls (labeling, adverse event reporting)
Class II: Moderate Risk - Definition: Could cause temporary or minor harm if device malfunctions - Examples: Computer-aided detection (CAD) for mammography, diabetic retinopathy screening - Regulation: 510(k) clearance required (demonstrate substantial equivalence to predicate device) - Requirements: General controls plus special controls (performance standards, post-market surveillance)
Class III: High Risk - Definition: Could cause serious injury or death if device malfunctions - Examples: Autonomous diagnostic systems, treatment decision algorithms - Regulation: Premarket Approval (PMA) required (rigorous clinical trial evidence) - Requirements: General controls plus special controls plus premarket approval
Most Medical AI Currently Approved: - Majority are Class II devices (CAD, triage systems) - Few Class III AI devices (FDA cautious about fully autonomous systems) - Trend toward Class II with real-world performance monitoring
FDA’s AI/ML Action Plan
In 2021, FDA released framework specifically for AI/ML-based medical devices:
1. Pre-Determined Change Control Plans (PCCP): - Allows manufacturers to update AI models without new FDA submissions - Must specify: - Types of changes anticipated (new training data, architecture modifications) - Methodology for updates (retraining protocols, validation procedures) - Impact assessment (when changes require new submission) - Balances innovation (rapid updates) with safety (FDA oversight)
2. Good Machine Learning Practice (GMLP): - Quality and safety standards for AI development - Covers: - Data quality and representativeness - Feature engineering and selection - Model training and testing - Performance monitoring - Documentation and transparency
3. Algorithm Change Protocol: - Document describing how algorithm will be modified post-market - Safety and performance guardrails - Triggers for re-validation
4. Real-World Performance Monitoring: - Manufacturers must monitor deployed AI performance - Report performance drift or safety signals - Update or withdraw device if performance degrades
5. Transparency and Explainability: - FDA encourages (but doesn’t mandate) transparency about: - How algorithm works - Training data characteristics - Intended use and limitations - Known failure modes - Trend toward requiring more transparency for high-risk devices
Implications for Healthcare Organizations: - Can’t assume FDA clearance = proven clinical benefit - 510(k) clearance means “similar to existing device,” not “clinically validated” - PMA devices have more rigorous evidence - Organizations must conduct own validation even for FDA-cleared AI
Failure Mode and Effects Analysis (FMEA) for AI
FMEA is systematic approach to identifying potential failures before they cause harm. Applied to AI:
FMEA Process
Step 1: Map Clinical Workflow - Document end-to-end process where AI will be used - Identify inputs, outputs, decision points, handoffs
Example: AI for Pulmonary Embolism (PE) Detection on CT - Input: CT pulmonary angiography scan - AI processing: Algorithm analyzes images, outputs PE probability - Notification: Alerts radiologist if high probability - Review: Radiologist reviews images and AI output - Reporting: Radiologist issues final report - Action: Clinical team acts on report
Step 2: Identify Potential Failure Modes
For each step, brainstorm what could go wrong:
| Workflow Step | Potential Failure Modes |
|---|---|
| Image acquisition | Poor image quality (motion, contrast timing), incompatible scanner |
| AI processing | Software crash, wrong patient, dataset shift, spurious correlation |
| Notification | Alert doesn’t fire, alert sent to wrong person, alert buried in inbox |
| Radiologist review | Automation bias (misses error), cognitive anchoring (AI biases judgment), alert fatigue (ignores AI), misinterprets AI output, loss of independent skills |
| Reporting | Report unclear, doesn’t reach ordering provider |
| Clinical action | Provider doesn’t see report, misinterprets recommendation, delays treatment |
Step 3: Assess Severity, Likelihood, and Detectability
For each failure mode: - Severity: How bad if it happens? (1=negligible, 10=catastrophic) - Likelihood: How often will it happen? (1=rare, 10=frequent) - Detectability: Will failure be caught before harm? (1=always detected, 10=never detected) - Risk Priority Number (RPN) = Severity × Likelihood × Detectability
Example:
| Failure Mode | Severity | Likelihood | Detectability | RPN | Priority |
|---|---|---|---|---|---|
| AI misses PE | 10 | 3 | 7 | 210 | HIGH |
| Cognitive anchoring to incorrect AI | 8 | 7 | 6 | 336 | CRITICAL |
| False positive PE | 4 | 6 | 3 | 72 | MEDIUM |
| Alert not delivered | 9 | 2 | 8 | 144 | HIGH |
| Radiologist ignores alert | 8 | 4 | 6 | 192 | HIGH |
Step 4: Implement Risk Mitigations
For high-priority failure modes, design safeguards:
Cognitive Anchoring to Incorrect AI (RPN=336): - Mitigation 1: Require radiologist to document preliminary impression before viewing AI output (for complex cases) - Mitigation 2: Training on anchoring bias recognition with case examples - Mitigation 3: Audit AI-concordant errors (cases where both AI and radiologist were wrong) - Mitigation 4: Monitor override rates (low override rate may indicate over-reliance) - Impact: Reduces likelihood from 7 to 4 and detectability from 6 to 4 (RPN drops to 128)
AI Misses PE (RPN=210): - Mitigation 1: Radiologist reviews all cases (not just AI-flagged ones) - Mitigation 2: Quality assurance sampling (re-review AI-negative cases) - Mitigation 3: Performance monitoring (track missed PE rate) - Impact: Reduces detectability from 7 to 3 (RPN drops to 90)
Alert Not Delivered (RPN=144): - Mitigation 1: Redundant notification (EHR inbox + page for critical findings) - Mitigation 2: Require acknowledgment within 1 hour - Mitigation 3: Escalation if not acknowledged - Impact: Reduces likelihood from 2 to 1 (RPN drops to 72)
Step 5: Document and Monitor - Document FMEA findings and mitigations - Revisit FMEA periodically (workflows and AI change) - Track actual failures and update risk assessments
AI-Specific FMEA Considerations
1. Data Quality Failures: - Incorrect patient matched to AI input - Missing or corrupted data elements - Data format incompatible with AI expectations
2. Model Performance Failures: - Dataset shift (population differs from training) - Adversarial inputs (deliberately fooling AI) - Edge cases not in training data
3. Integration Failures: - AI output misinterpreted by clinicians - Timing issues (AI result arrives too late) - AI recommendations conflict with other clinical data
4. Human Factors Failures: - Automation bias (over-reliance on AI) - Cognitive anchoring (AI biases clinician judgment) - Alert fatigue (too many false positives) - Loss of clinical skills from AI dependence (de-skilling) - Inability to diagnose independently when AI unavailable or unreliable
Common AI Failure Patterns
Understanding how AI systems fail helps prevent and detect errors.
1. Dataset Shift and Generalization Failure
What It Is: AI trained on one population/setting performs poorly when deployed in different context.
Why It Happens: - Training data not representative of deployment population - Clinical workflows differ between development and deployment sites - Disease prevalence, patient demographics, or comorbidities differ
Examples:
COVID-19 Chest X-ray AI (DeGrave et al., 2021): - AI trained on pre-pandemic chest X-rays to detect pneumonia - When deployed during pandemic, many AI systems failed on COVID-19 pneumonia - Reason: COVID-19 patterns not in training data, AI learned non-generalizable features - Some AI learned to detect portable X-rays (used for sicker patients) rather than actual pneumonia
Pneumonia Detection Dataset Shift (Zech et al., 2018): - AI trained at one hospital achieved 90%+ accuracy - Same AI deployed at different hospital: accuracy dropped to ~60% - Reason: AI learned hospital-specific artifacts (patient positioning, X-ray machine markers) instead of pneumonia
Mitigation: - Train on diverse data from multiple institutions - External validation before deployment - Monitor real-world performance continuously - Retrain when performance drifts
2. Spurious Correlations (Clever Hans Effect)
What It Is: AI learns irrelevant patterns that happen to correlate with outcome in training data but don’t reflect true causal relationships.
Why It Happens: - Training data contains confounding variables - AI optimizes for accuracy, not clinical reasoning - Limited data causes AI to latch onto any predictive signal
Examples:
Skin Cancer Detection and Rulers (Esteva et al., 2017): - Dermatology AI appeared highly accurate - Later discovered AI partially relied on rulers/color calibration markers in images - Malignant lesions more likely to be photographed with rulers (clinical documentation practice) - AI learned “ruler = cancer” instead of visual features of cancer
ICU Mortality Prediction and Time of Admission: - AI predicted ICU mortality based on admission time - Patients admitted at night had higher mortality - AI learned “night admission = high risk” rather than disease severity - Spurious correlation: sicker patients tend to arrive at night
Mitigation: - Careful feature engineering (include only clinically relevant variables) - Interpretability analysis (understand what AI is using) - Adversarial testing (remove expected signals, see if performance drops) - Clinical review of AI features/logic
3. Automation Bias and Over-Reliance
What It Is: Clinicians uncritically accept AI recommendations, even when wrong or when contradicted by other clinical information.
Why It Happens: - Cognitive bias toward trusting automated systems - AI presented as authoritative (“algorithm says…”) - Time pressure and cognitive load - Deskilling from prolonged AI use (loss of independent judgment)
Evidence:
Radiology Studies (Beam & Kohane, 2018): - Radiologists shown AI-flagged images made more errors than without AI when AI was wrong - Effect stronger for less experienced radiologists - Automation bias overcame clinical judgment
Pathology AI (Campanella et al., 2019): - Pathologists reviewing AI-assisted slides sometimes missed obvious errors - Trust in AI reduced vigilance
Cognitive Anchoring as a Critical Safety Failure Mode
Cognitive anchoring represents one of the most insidious AI safety risks: clinicians systematically bias their assessment toward AI recommendations, losing the capacity for independent clinical judgment that serves as the essential safety net for catching AI errors.
The Anchoring Effect:
When clinicians see an AI recommendation before forming their own clinical impression, the AI output becomes a cognitive anchor that disproportionately influences their final judgment. This differs from simple automation bias (accepting AI uncritically) in that clinicians believe they are exercising independent judgment when, in fact, their reasoning has been subtly channeled by the AI’s suggestion (Khullar, 2025).
Why Anchoring is Particularly Dangerous:
- Invisible influence: Clinicians don’t recognize they’ve been anchored, assuming their judgment remains independent
- Affects experienced physicians: Unlike simple automation bias (stronger in novices), anchoring can influence even seasoned clinicians
- Undermines error detection: The AI’s primary safety check (physician oversight) becomes compromised
- Compounds over time: Repeated exposure to AI recommendations gradually erodes independent diagnostic capability
FMEA Risk Assessment for Cognitive Anchoring:
| Parameter | Rating | Rationale |
|---|---|---|
| Severity | High (8/10) | Missed diagnoses, inappropriate treatments, patient harm |
| Likelihood | High (7/10) | Well-documented cognitive bias, occurs across specialties |
| Detectability | Medium (6/10) | Difficult to distinguish from appropriate AI-concordant decisions |
| Risk Priority Number | 336 | CRITICAL PRIORITY |
Evidence of Anchoring in Clinical AI:
Diagnostic Decision-Making: - Physicians shown AI diagnostic suggestions anchored toward those diagnoses, even when clinical findings pointed elsewhere - Effect persisted when AI was explicitly labeled as “low confidence” - Clinicians underweighted contradictory clinical information
Radiology Interpretation: - Radiologists shown AI-flagged regions focused disproportionately on those areas - Missed abnormalities in non-flagged regions they would have caught without AI - False sense of completeness: “AI didn’t flag anything else, so I’m done”
Mitigation Strategies for Anchoring:
1. Workflow Design: - Independent assessment first: Require clinicians to document preliminary impression before viewing AI output (for high-stakes diagnoses) - Delayed AI display: AI recommendations appear only after clinician forms initial judgment - Blind review sampling: Periodic audits where clinicians interpret cases without AI access - Two-stage review: Independent clinician review, then AI-assisted review, then reconciliation
2. Training and Awareness: - Education on anchoring bias and its mechanisms - Case-based training showing examples of AI-induced anchoring - Regular feedback on cases where clinician anchored to incorrect AI - Competency assessment: Can physicians diagnose accurately without AI?
3. Performance Monitoring: - Audit AI-concordant errors: Review cases where clinician agreed with incorrect AI - Monitor override rates: Paradoxically, very low override rates may indicate over-reliance rather than AI accuracy - Healthy override rate (10-20%) suggests independent clinical judgment - Very low override rate (<5%) raises concern for anchoring - Very high override rate (>50%) suggests AI is not clinically useful - Track diagnostic accuracy with vs. without AI: Compare performance when AI available vs. unavailable
4. AI Interface Design: - Present AI as “additional data point,” not “recommendation” - Require clinicians to justify agreement or disagreement with AI - Show confidence intervals/uncertainty (discourage anchoring to low-confidence outputs) - Avoid authoritative framing (“AI diagnosis:”) in favor of neutral language (“AI analysis suggests consideration of:”)
5. Organizational Safeguards: - Regular case conferences reviewing AI-discordant cases (AI wrong, clinician caught it) and AI-concordant errors (AI wrong, clinician missed it) - Celebrate appropriate AI overrides (reinforce that disagreeing with AI is professionally acceptable) - Continuous learning culture: every anchoring-related near-miss triggers workflow review
De-skilling as Long-Term Safety Risk
Prolonged reliance on AI diagnostic support creates a secondary safety risk: physicians lose the independent diagnostic capability necessary to catch AI errors. This represents an insidious erosion of the very safety mechanism (physician oversight) that AI deployment assumes will prevent harm.
The De-skilling Phenomenon:
What Happens: - Physicians who routinely use AI assistance gradually lose fluency in independent diagnosis - Pattern recognition skills atrophy from disuse - Diagnostic reasoning becomes dependent on AI prompts - When AI fails or is unavailable, physicians struggle to diagnose independently
Why It Matters for Safety: - AI safety model assumes physician can catch AI errors through independent judgment - If physicians can no longer diagnose without AI, this safety net disappears - Creates dangerous dependency: AI errors go uncaught because physicians lack capability to recognize them - Particularly concerning for rare diseases or atypical presentations (where AI may be unreliable and physician pattern recognition critical)
Evidence:
Radiology De-skilling: - Residents trained with AI assistance showed reduced independent interpretation skills - When AI removed, performance dropped below baseline - Particular deficits in subtle findings and atypical presentations
Clinical Reasoning: - Medical students using AI diagnostic assistants during training showed weaker differential diagnosis generation - Dependence on AI suggestions rather than systematic clinical reasoning - Difficulty articulating reasoning independent of AI prompts
Certification and Competency Questions:
The rise of AI assistance forces difficult questions about physician competency:
- Can physicians diagnose without AI? If not, what happens during system downtime or novel scenarios where AI is unreliable?
- Should board certification exams include AI-free assessments? Ensuring baseline independent diagnostic capability
- How do we maintain skills in AI era? Deliberate practice without AI assistance to preserve independent judgment
- What is minimum acceptable independent performance? Standards for physicians working with AI
Mitigation Strategies for De-skilling:
1. Training Programs: - AI-free training periods: Medical students and residents must demonstrate independent diagnostic competency before AI assistance - Continued AI-free practice: Regular cases interpreted without AI to maintain skills - Competency assessments: Periodic testing of independent diagnostic capability (without AI) - Cross-training: Rotate between AI-assisted and AI-free workflows
2. Workflow Integration: - Mandatory independent assessment: For complex or high-stakes cases, require documented independent impression before AI consultation - Scheduled AI-free days: Periodic practice without AI assistance to maintain skills - Case variety: Ensure physicians see sufficient case volume and diversity to maintain pattern recognition
3. Institutional Policies: - Competency standards: Define minimum independent diagnostic capability regardless of AI availability - Skills maintenance requirements: Continuing education focused on independent clinical reasoning - Backup protocols: Procedures for AI system downtime that don’t compromise patient safety - Hiring and credentialing: Assess independent diagnostic capability, not just AI-assisted performance
4. System Design: - Degradable AI: Systems designed to provide varying levels of assistance, allowing skills practice - Deliberate difficulty: Periodic cases where AI intentionally withheld to maintain physician skills - Educational mode: AI provides feedback after independent assessment rather than during
Cross-Reference: See Chapter 20 (AI-Human Collaboration Patterns) for workflow designs that preserve independent judgment while leveraging AI capabilities. The goal is not to avoid AI assistance, but to design collaboration patterns that maintain rather than erode physician diagnostic expertise.
Mitigation (General): - Present AI as “second opinion,” not ground truth - Require independent clinical assessment before viewing AI output (for high-stakes decisions) - Training on automation bias and anchoring recognition - Audit cases where clinician agreed with incorrect AI - Calibrate trust: highlight when AI is uncertain or in novel scenario - Monitor override rates to detect over-reliance - Ensure physicians maintain diagnostic competency independent of AI
4. Alert Fatigue and Integration Failures
What It Is: AI produces too many alerts (often false positives), causing clinicians to ignore all alerts, including true positives.
Why It Happens: - AI optimized for high sensitivity, accepting low specificity - Poor integration with clinical workflow (alerts at wrong time, wrong place) - No prioritization (all alerts treated equally)
Examples:
Epic Sepsis Model (Wong et al., 2021): - Low sensitivity (missed most sepsis) but still produced many false positives - Clinicians became desensitized to alerts - True positives ignored alongside false positives
General EHR Alert Fatigue: - Studies show clinicians override 49-96% of drug interaction alerts - Adding AI alerts without workflow consideration worsens problem
Mitigation: - Tune AI threshold based on acceptable false positive rate (not just maximizing sensitivity) - Smart alerting: right information, right person, right time, right format - Tiered alerts: critical vs. informational - Require acknowledgment for critical alerts with escalation - Monitor alert override rates and reasons
5. Performance Drift Over Time
What It Is: AI performance degrades after deployment as clinical practice, patient populations, or data characteristics evolve.
Why It Happens: - Clinical practice changes (new treatments, diagnostic criteria, guidelines) - Patient demographics shift - Changes in data collection or EHR systems - AI becomes outdated but continues to be used
Example: Cardiovascular Risk Prediction (Finlayson et al., 2021): - Risk models trained on historical data - Performance degrades over time as treatment improves (statins, blood pressure management) - Historical risk factors less predictive in modern era
Mitigation: - Continuous performance monitoring - Set performance thresholds triggering retraining - Regular scheduled revalidation (e.g., annually) - Version control and change management - Willingness to decommission outdated AI
Safety Monitoring and Adverse Event Reporting
Ongoing monitoring is essential for catching AI failures before widespread harm.
Real-World Performance Monitoring
What to Monitor:
1. Discrimination Metrics: - Sensitivity, specificity, AUC-ROC - Track overall and by patient subgroups - Set threshold for acceptable performance
2. Calibration: - Do predicted probabilities match observed outcomes? - Example: Of patients AI predicts 30% mortality risk, do ~30% actually die? - Miscalibration suggests model drift
3. Alert Metrics: - Alert rate (alerts per day) - Override rate (% of alerts ignored) - False positive and false negative rates - Positive predictive value in clinical practice
4. Clinical Outcomes: - Patient outcomes when AI used vs. not used (if feasible) - Time to treatment, missed diagnoses, unnecessary testing - Ideally compare against pre-AI baseline
5. Subgroup Performance: - Performance across race, ethnicity, age, sex, insurance status - Detect disparate impact or bias - Ensure equity
6. User Metrics: - Physician trust and satisfaction - Workflow disruption reports - Time spent reviewing AI outputs
How to Monitor:
Automated Dashboards: - Real-time or daily updates on key metrics - Alert when metrics fall below thresholds - Drill-down capability for root cause analysis
Periodic Audits: - Sample cases for detailed review - Compare AI output to ground truth - Identify systematic errors
Prospective Studies: - Randomized trials or cohort studies evaluating AI impact - Gold standard but resource-intensive
Adverse Event Reporting
What Counts as AI Adverse Event: - Incorrect AI output leading to patient harm (delayed diagnosis, wrong treatment) - AI system failure preventing timely care - Alert fatigue causing true positive to be ignored - Workflow disruption from AI integration - Privacy breach from AI system
Reporting Mechanisms:
Internal Reporting: - Easy-to-use reporting system for clinicians - Non-punitive culture (just culture, not blame culture) - Rapid response to reports - Feedback to reporters on outcomes
FDA Reporting (Medical Device Reporting, MDR): - Required for manufacturers and user facilities - Report if AI device: - Caused or contributed to death or serious injury - Malfunctioned and would likely cause harm if recurred - Timelines: Death (manufacturer: 30 days; user facility: 10 days), injury (manufacturer: 30 days; user facility: annually)
Institutional Quality/Safety Reporting: - Incorporate AI into existing safety event reporting - Root cause analysis (RCA) for serious AI-related events - Failure mode analysis to prevent recurrence
Learning from Events: - Share lessons across institutions (de-identified case reports) - National registries for AI adverse events (emerging) - Vendor accountability (require vendors to address identified failures)
Building a Safety Culture for AI
Technology alone doesn’t ensure safety. Organizational culture matters.
Core Principles
1. Physician Oversight is Non-Negotiable: - AI assists, humans decide (especially for high-stakes decisions) - Physicians retain ultimate authority and accountability - Can’t delegate responsibility to algorithms
2. Transparency About Limitations: - Honest communication about what AI can and can’t do - Don’t oversell AI capabilities to staff or patients - Acknowledge uncertainty
3. Maintain Independent Clinical Skills: - Physicians must retain diagnostic capability independent of AI - Regular practice without AI assistance to prevent de-skilling - Competency assessment should include AI-free performance - Skills maintenance is a safety requirement, not optional
4. Just Culture: - Encourage error reporting without blame - Focus on system improvement, not individual fault - Psychological safety for raising concerns about AI - Applies the IOM principle: most errors result from faulty systems, not faulty people (Kohn et al., 2000)
4. Continuous Learning: - Every failure is learning opportunity - Regular review of AI performance and incidents - Update protocols based on lessons learned
5. Patient-Centered: - Safety trumps efficiency or cost - Patient welfare always first priority - Equitable AI performance across patient populations
Organizational Safeguards
AI Governance Committee: - Multidisciplinary: clinicians, informatics, quality/safety, ethics, legal - Reviews AI before deployment (safety assessment, FMEA) - Monitors AI performance and adverse events - Authority to pause or decommission AI if safety concerns
Training and Education: - Educate clinicians about AI capabilities and limitations - Training on automation bias, anchoring effects, and appropriate AI use - Competency assessment before independent use (including AI-free diagnostic capability) - Regular skills maintenance: periodic practice without AI assistance - Case-based training showing examples of AI-induced anchoring and appropriate AI overrides
Standard Operating Procedures: - Document clinical protocols for AI use - Escalation procedures for AI failures or uncertain cases - Criteria for overriding AI recommendations
Audit and Feedback: - Regular audits of AI-assisted cases - Feedback to clinicians on performance - Identify and address misuse or over-reliance
Case Studies: Learning from AI Safety Failures
Case Study 1: Epic Sepsis Model
Background: - Sepsis prediction model widely deployed across U.S. hospitals - Promised early sepsis detection to improve outcomes - Retrospective studies showed reasonable accuracy
What Went Wrong (Wong et al., 2021): - External validation study (University of Michigan) found: - Sensitivity only 7% (missed 93% of sepsis cases) - Positive predictive value 18% (82% false positives) - Performance far worse than retrospective claims
Root Causes: - Dataset shift: training data from different patient populations - Retrospective validation overestimated performance (selection bias) - Integration issues: alert timing often too late - Lack of prospective validation before wide deployment
Lessons: - External validation essential (don’t trust vendor claims alone) - Retrospective accuracy does not equal prospective clinical utility - Test AI in your specific population before relying on it - Monitor real-world performance continuously
Outcome: - Many hospitals paused or discontinued use - Epic modified algorithm and validation approach - Highlighted need for transparency in AI performance claims
Case Study 2: IBM Watson for Oncology
Background: - IBM marketed Watson as AI for personalized cancer treatment - Promised evidence-based treatment recommendations - Adopted by hospitals worldwide
What Went Wrong (Ross & Swetlitz, 2018): - STAT News investigation revealed: - Unsafe and incorrect treatment recommendations - Recommendations based on limited training data (synthetic cases from single cancer center) - Never validated in prospective clinical trials - Doctors trained to use Watson in 2-day sessions (insufficient)
Examples of Unsafe Recommendations: - Recommended chemotherapy for patient with severe bleeding (contraindication) - Suggested drugs in combinations not proven safe - Treatment plans contradicting evidence-based guidelines
Root Causes: - Marketing hype exceeded actual capabilities - Insufficient clinical validation - Training data not representative (synthetic, not real patients) - Lack of physician oversight in recommendation generation
Lessons: - Demand rigorous clinical trial evidence, not just demonstrations - Marketing claims do not equal clinical validation - AI for high-stakes decisions (cancer treatment) requires highest evidence standard - Physician expertise cannot be replaced by insufficiently validated AI
Outcome: - IBM scaled back Watson Health initiatives - Many hospitals discontinued use - Cautionary tale about AI hype vs. reality
Case Study 3: Chest X-ray AI and COVID-19
Background: - Multiple AI systems developed for pneumonia detection from chest X-rays - Appeared highly accurate in retrospective studies - Deployed during COVID-19 pandemic
What Went Wrong (DeGrave et al., 2021): - Many AI systems failed on COVID-19 pneumonia: - Trained on pre-pandemic data (no COVID-19 patterns) - Learned spurious correlations (lateral decubitus positioning, portable X-rays) - Poor generalization to novel disease
Documented Issues: - AI detected “pneumonia” based on portable vs. fixed X-ray equipment - Picked up hospital-specific artifacts, text overlays, positioning - Failed to detect actual COVID-19 pneumonia features
Root Causes: - Training data biases (sicker patients → portable X-rays) - Lack of causal reasoning (correlations mistaken for disease features) - Insufficient stress testing on out-of-distribution cases - Rapid deployment without adequate validation
Lessons: - AI doesn’t truly “understand” disease, it learns statistical patterns - Training data biases lead to spurious correlations - Test AI on out-of-distribution data before deployment - Pandemic highlighted need for robust, generalizable AI
Recommendations for Safe AI Implementation
Pre-Deployment
1. Rigorous Validation: - Prospective validation in your target population - External validation if possible - Subgroup analysis (race, age, sex, insurance, disease severity)
2. Failure Mode Analysis: - Conduct FMEA before deployment - Identify high-risk failure modes - Design mitigations and safeguards
3. Human Factors Evaluation: - Test AI in realistic clinical workflow - Assess usability, alert design, integration - Identify automation bias risks
4. Transparent Communication: - Educate clinicians about AI capabilities and limitations - Set realistic expectations - Training on appropriate use
5. Safety Protocols: - Standard operating procedures for AI use - Escalation procedures for failures or uncertain cases - Oversight and accountability structure
During Use
6. Real-World Performance Monitoring: - Continuous tracking of key metrics - Dashboards with automated alerts for performance drops - Regular reporting to governance committee
7. Adverse Event Reporting: - Easy, non-punitive reporting system - Rapid investigation and response - Sharing lessons learned
8. Physician Oversight: - AI recommendations reviewed by qualified clinicians - Physicians retain final decision authority - Can’t delegate responsibility to algorithms
9. Patient Communication: - Inform patients about AI use (tiered consent approach) - Transparency about limitations - Respect patient preferences
Ongoing
10. Regular Safety Audits: - Periodic review of AI performance and incidents - Update risk assessments and mitigations - Assess for performance drift
11. Revalidation: - Scheduled revalidation (e.g., annually) - After major clinical practice changes - When patient population characteristics shift
12. Continuous Improvement: - Learn from failures and near-misses - Update AI, protocols, or training based on lessons - Stay current with evolving best practices
13. Decommissioning: - Willingness to pause or stop AI if safety concerns - Clear criteria for decommissioning - Patient safety > sunk costs
Conclusion
Medical AI safety is not an afterthought. It’s a fundamental requirement. The promise of AI to improve diagnosis, personalize treatment, and reduce errors can only be realized if AI systems are rigorously validated, thoughtfully integrated, continuously monitored, and honestly communicated (Kelly et al., 2019; Topol, 2019).
The history of medical AI includes both successes (IDx-DR improving diabetic retinopathy screening access) and failures (Epic sepsis model, IBM Watson). The difference lies not in the sophistication of the algorithms, but in the rigor of validation, honesty about limitations, and commitment to ongoing safety monitoring.
Core Safety Principles:
- Retrospective accuracy does not equal real-world safety: demand prospective validation
- External validation is essential: don’t trust vendor claims alone
- Monitor continuously: performance drifts over time
- Report failures transparently: learning requires honesty
- Physician oversight is non-negotiable: AI assists, humans decide
- Build a safety culture: just culture, transparency, continuous improvement
- Put patients first: safety trumps efficiency or profit
AI has the potential to improve patient care dramatically. But that potential can only be realized if safety is treated as seriously as innovation. First, do no harm, for algorithms as for all medical interventions.