21 Clinical AI Safety and Risk Management
Patient safety must be central to AI deployment. This chapter examines safety frameworks, failure mode analysis, and risk mitigation strategies for clinical AI. You will learn to:
- Apply FDA regulatory frameworks for medical AI devices
 - Conduct systematic failure mode and effects analysis (FMEA) for AI systems
 - Recognize common AI failure patterns and sentinel events
 - Implement safety monitoring and adverse event reporting
 - Build a safety culture for AI deployment
 - Understand human factors and automation bias risks
 - Design fail-safe mechanisms for clinical AI
 
Essential for all physicians, healthcare administrators, and AI implementation teams.
21.1 Introduction
In 2021, a study examining Epic’s widely-deployed sepsis prediction model revealed a sobering finding: the algorithm had a sensitivity of only 7% in real-world clinical use—meaning it missed 93% of sepsis cases (Wong et al. 2021). This was the same algorithm that showed promising retrospective performance and had been implemented across hundreds of hospitals.
The Epic sepsis case illustrates a fundamental truth about medical AI safety: retrospective accuracy does not guarantee real-world safety. Unlike traditional medical devices (which fail predictably through mechanical breakdown or electrical malfunction), AI systems fail in subtle, context-dependent ways that may not be apparent until deployment.
This chapter examines the unique safety challenges of clinical AI, regulatory frameworks, systematic approaches to risk assessment, documented failure modes, and strategies for building a culture of AI safety.
21.2 Why AI Safety is Different
21.2.1 Traditional Medical Device Safety
Medical devices have well-established safety paradigms: - Predictable failure modes: Pacemakers have battery depletion, monitors have sensor failures - Testable before deployment: Devices can be bench-tested, stress-tested, validated in controlled conditions - Static performance: Once validated, device performance doesn’t change (until hardware degrades) - Visible failures: Most failures are obvious (device stops working, alarm sounds)
21.2.2 AI System Safety Challenges
Medical AI introduces fundamentally different risks:
1. Silent Failures: - AI can produce plausible-looking but incorrect outputs - Errors may not be immediately apparent to clinicians - Example: AI misses subtle fracture on X-ray, radiologist trusts AI and also misses it
2. Context-Dependent Performance: - AI performs differently across populations, hospitals, workflows - What works at academic center may fail at community hospital - Performance varies with disease prevalence, patient demographics, image acquisition protocols
3. Performance Drift Over Time: - Clinical practice evolves (new treatments, changing patient populations) - AI trained on historical data becomes outdated - Performance degrades silently unless monitored (Finlayson et al. 2021)
4. Unpredictable Edge Cases: - AI may fail catastrophically on inputs unlike training data - Impossible to test all possible scenarios - Example: Chest X-ray AI trained pre-pandemic fails on COVID-19 pneumonia patterns
5. Inscrutability: - Deep learning models often can’t explain their predictions - Makes root cause analysis difficult when failures occur - Clinicians can’t validate AI reasoning process
6. Cascading Failures: - AI errors propagate through clinical workflows - Wrong AI prediction → wrong clinical decision → patient harm - Multiple systems may compound errors
These differences demand new approaches to safety assessment and monitoring (Kelly et al. 2019).
21.3 FDA Regulatory Framework for Medical AI
The FDA regulates medical AI as Software as a Medical Device (SaMD) under existing device regulations, but is developing AI-specific frameworks.
21.3.1 SaMD Classification
Medical AI is classified based on risk level:
Class I: Low Risk - Definition: Minimal risk to patients if device malfunctions - Examples: Dental caries detection, skin condition photo analysis - Regulation: Exempt from premarket notification (510(k)) - Requirements: General controls (labeling, adverse event reporting)
Class II: Moderate Risk - Definition: Could cause temporary or minor harm if device malfunctions - Examples: Computer-aided detection (CAD) for mammography, diabetic retinopathy screening - Regulation: 510(k) clearance required (demonstrate substantial equivalence to predicate device) - Requirements: General + special controls (performance standards, post-market surveillance)
Class III: High Risk - Definition: Could cause serious injury or death if device malfunctions - Examples: Autonomous diagnostic systems, treatment decision algorithms - Regulation: Premarket Approval (PMA) required (rigorous clinical trial evidence) - Requirements: General + special controls + premarket approval
Most Medical AI Currently Approved: - Majority are Class II devices (CAD, triage systems) - Few Class III AI devices (FDA cautious about fully autonomous systems) - Trend toward Class II with real-world performance monitoring
21.3.2 FDA’s AI/ML Action Plan
In 2021, FDA released framework specifically for AI/ML-based medical devices:
1. Pre-Determined Change Control Plans (PCCP): - Allows manufacturers to update AI models without new FDA submissions - Must specify: - Types of changes anticipated (new training data, architecture modifications) - Methodology for updates (retraining protocols, validation procedures) - Impact assessment (when changes require new submission) - Balances innovation (rapid updates) with safety (FDA oversight)
2. Good Machine Learning Practice (GMLP): - Quality and safety standards for AI development - Covers: - Data quality and representativeness - Feature engineering and selection - Model training and testing - Performance monitoring - Documentation and transparency
3. Algorithm Change Protocol: - Document describing how algorithm will be modified post-market - Safety and performance guardrails - Triggers for re-validation
4. Real-World Performance Monitoring: - Manufacturers must monitor deployed AI performance - Report performance drift or safety signals - Update or withdraw device if performance degrades
5. Transparency and Explainability: - FDA encourages (but doesn’t mandate) transparency about: - How algorithm works - Training data characteristics - Intended use and limitations - Known failure modes - Trend toward requiring more transparency for high-risk devices
Implications for Healthcare Organizations: - Can’t assume FDA clearance = proven clinical benefit - 510(k) clearance means “similar to existing device,” not “clinically validated” - PMA devices have more rigorous evidence - Organizations must conduct own validation even for FDA-cleared AI
21.4 Failure Mode and Effects Analysis (FMEA) for AI
FMEA is systematic approach to identifying potential failures before they cause harm. Applied to AI:
21.4.1 FMEA Process
Step 1: Map Clinical Workflow - Document end-to-end process where AI will be used - Identify inputs, outputs, decision points, handoffs
Example: AI for Pulmonary Embolism (PE) Detection on CT - Input: CT pulmonary angiography scan - AI processing: Algorithm analyzes images, outputs PE probability - Notification: Alerts radiologist if high probability - Review: Radiologist reviews images and AI output - Reporting: Radiologist issues final report - Action: Clinical team acts on report
Step 2: Identify Potential Failure Modes
For each step, brainstorm what could go wrong:
| Workflow Step | Potential Failure Modes | 
|---|---|
| Image acquisition | Poor image quality (motion, contrast timing), incompatible scanner | 
| AI processing | Software crash, wrong patient, dataset shift, spurious correlation | 
| Notification | Alert doesn’t fire, alert sent to wrong person, alert buried in inbox | 
| Radiologist review | Automation bias (misses error), alert fatigue (ignores AI), misinterprets AI output | 
| Reporting | Report unclear, doesn’t reach ordering provider | 
| Clinical action | Provider doesn’t see report, misinterprets recommendation, delays treatment | 
Step 3: Assess Severity, Likelihood, and Detectability
For each failure mode: - Severity: How bad if it happens? (1=negligible, 10=catastrophic) - Likelihood: How often will it happen? (1=rare, 10=frequent) - Detectability: Will failure be caught before harm? (1=always detected, 10=never detected) - Risk Priority Number (RPN) = Severity × Likelihood × Detectability
Example:
| Failure Mode | Severity | Likelihood | Detectability | RPN | Priority | 
|---|---|---|---|---|---|
| AI misses PE | 10 | 3 | 7 | 210 | HIGH | 
| False positive PE | 4 | 6 | 3 | 72 | MEDIUM | 
| Alert not delivered | 9 | 2 | 8 | 144 | HIGH | 
| Radiologist ignores alert | 8 | 4 | 6 | 192 | HIGH | 
Step 4: Implement Risk Mitigations
For high-priority failure modes, design safeguards:
AI Misses PE (RPN=210): - Mitigation 1: Radiologist reviews all cases (not just AI-flagged ones) - Mitigation 2: Quality assurance sampling (re-review AI-negative cases) - Mitigation 3: Performance monitoring (track missed PE rate) - Impact: Reduces detectability from 7 to 3 (RPN drops to 90)
Alert Not Delivered (RPN=144): - Mitigation 1: Redundant notification (EHR inbox + page for critical findings) - Mitigation 2: Require acknowledgment within 1 hour - Mitigation 3: Escalation if not acknowledged - Impact: Reduces likelihood from 2 to 1 (RPN drops to 72)
Step 5: Document and Monitor - Document FMEA findings and mitigations - Revisit FMEA periodically (workflows and AI change) - Track actual failures and update risk assessments
21.4.2 AI-Specific FMEA Considerations
1. Data Quality Failures: - Incorrect patient matched to AI input - Missing or corrupted data elements - Data format incompatible with AI expectations
2. Model Performance Failures: - Dataset shift (population differs from training) - Adversarial inputs (deliberately fooling AI) - Edge cases not in training data
3. Integration Failures: - AI output misinterpreted by clinicians - Timing issues (AI result arrives too late) - AI recommendations conflict with other clinical data
4. Human Factors Failures: - Automation bias (over-reliance on AI) - Alert fatigue (too many false positives) - Loss of clinical skills from AI dependence
21.5 Common AI Failure Patterns
Understanding how AI systems fail helps prevent and detect errors.
21.5.1 1. Dataset Shift and Generalization Failure
What It Is: AI trained on one population/setting performs poorly when deployed in different context.
Why It Happens: - Training data not representative of deployment population - Clinical workflows differ between development and deployment sites - Disease prevalence, patient demographics, or comorbidities differ
Examples:
COVID-19 Chest X-ray AI (DeGrave, Janizek, and Lee 2021): - AI trained on pre-pandemic chest X-rays to detect pneumonia - When deployed during pandemic, many AI systems failed on COVID-19 pneumonia - Reason: COVID-19 patterns not in training data, AI learned non-generalizable features - Some AI learned to detect portable X-rays (used for sicker patients) rather than actual pneumonia
Pneumonia Detection Dataset Shift (Zech et al. 2018): - AI trained at one hospital achieved 90%+ accuracy - Same AI deployed at different hospital: accuracy dropped to ~60% - Reason: AI learned hospital-specific artifacts (patient positioning, X-ray machine markers) instead of pneumonia
Mitigation: - Train on diverse data from multiple institutions - External validation before deployment - Monitor real-world performance continuously - Retrain when performance drifts
21.5.2 2. Spurious Correlations (Clever Hans Effect)
What It Is: AI learns irrelevant patterns that happen to correlate with outcome in training data but don’t reflect true causal relationships.
Why It Happens: - Training data contains confounding variables - AI optimizes for accuracy, not clinical reasoning - Limited data causes AI to latch onto any predictive signal
Examples:
Skin Cancer Detection and Rulers (Esteva et al. 2017): - Dermatology AI appeared highly accurate - Later discovered AI partially relied on rulers/color calibration markers in images - Malignant lesions more likely to be photographed with rulers (clinical documentation practice) - AI learned “ruler = cancer” instead of visual features of cancer
ICU Mortality Prediction and Time of Admission: - AI predicted ICU mortality based on admission time - Patients admitted at night had higher mortality - AI learned “night admission = high risk” rather than disease severity - Spurious correlation: sicker patients tend to arrive at night
Mitigation: - Careful feature engineering (include only clinically relevant variables) - Interpretability analysis (understand what AI is using) - Adversarial testing (remove expected signals, see if performance drops) - Clinical review of AI features/logic
21.5.3 3. Automation Bias and Over-Reliance
What It Is: Clinicians uncritically accept AI recommendations, even when wrong or when contradicted by other clinical information.
Why It Happens: - Cognitive bias toward trusting automated systems - AI presented as authoritative (“algorithm says…”) - Time pressure and cognitive load - Deskilling from prolonged AI use (loss of independent judgment)
Evidence:
Radiology Studies (Beam, Manrai, and Ghassemi 2020): - Radiologists shown AI-flagged images made more errors than without AI when AI was wrong - Effect stronger for less experienced radiologists - Automation bias overcame clinical judgment
Pathology AI (Campanella et al. 2019): - Pathologists reviewing AI-assisted slides sometimes missed obvious errors - Trust in AI reduced vigilance
Mitigation: - Present AI as “second opinion,” not ground truth - Require independent clinical assessment before viewing AI output (for high-stakes decisions) - Training on automation bias recognition - Audit cases where clinician agreed with incorrect AI - Calibrate trust: highlight when AI is uncertain or in novel scenario
21.5.4 4. Alert Fatigue and Integration Failures
What It Is: AI produces too many alerts (often false positives), causing clinicians to ignore all alerts, including true positives.
Why It Happens: - AI optimized for high sensitivity, accepting low specificity - Poor integration with clinical workflow (alerts at wrong time, wrong place) - No prioritization (all alerts treated equally)
Examples:
Epic Sepsis Model (Wong et al. 2021): - Low sensitivity (missed most sepsis) but still produced many false positives - Clinicians became desensitized to alerts - True positives ignored alongside false positives
General EHR Alert Fatigue: - Studies show clinicians override 49-96% of drug interaction alerts - Adding AI alerts without workflow consideration worsens problem
Mitigation: - Tune AI threshold based on acceptable false positive rate (not just maximizing sensitivity) - Smart alerting: right information, right person, right time, right format - Tiered alerts: critical vs. informational - Require acknowledgment for critical alerts with escalation - Monitor alert override rates and reasons
21.5.5 5. Performance Drift Over Time
What It Is: AI performance degrades after deployment as clinical practice, patient populations, or data characteristics evolve.
Why It Happens: - Clinical practice changes (new treatments, diagnostic criteria, guidelines) - Patient demographics shift - Changes in data collection or EHR systems - AI becomes outdated but continues to be used
Example: Cardiovascular Risk Prediction (Finlayson et al. 2021): - Risk models trained on historical data - Performance degrades over time as treatment improves (statins, blood pressure management) - Historical risk factors less predictive in modern era
Mitigation: - Continuous performance monitoring - Set performance thresholds triggering retraining - Regular scheduled revalidation (e.g., annually) - Version control and change management - Willingness to decommission outdated AI
21.6 Safety Monitoring and Adverse Event Reporting
Ongoing monitoring is essential for catching AI failures before widespread harm.
21.6.1 Real-World Performance Monitoring
What to Monitor:
1. Discrimination Metrics: - Sensitivity, specificity, AUC-ROC - Track overall and by patient subgroups - Set threshold for acceptable performance
2. Calibration: - Do predicted probabilities match observed outcomes? - Example: Of patients AI predicts 30% mortality risk, do ~30% actually die? - Miscalibration suggests model drift
3. Alert Metrics: - Alert rate (alerts per day) - Override rate (% of alerts ignored) - False positive and false negative rates - Positive predictive value in clinical practice
4. Clinical Outcomes: - Patient outcomes when AI used vs. not used (if feasible) - Time to treatment, missed diagnoses, unnecessary testing - Ideally compare against pre-AI baseline
5. Subgroup Performance: - Performance across race, ethnicity, age, sex, insurance status - Detect disparate impact or bias - Ensure equity
6. User Metrics: - Physician trust and satisfaction - Workflow disruption reports - Time spent reviewing AI outputs
How to Monitor:
Automated Dashboards: - Real-time or daily updates on key metrics - Alert when metrics fall below thresholds - Drill-down capability for root cause analysis
Periodic Audits: - Sample cases for detailed review - Compare AI output to ground truth - Identify systematic errors
Prospective Studies: - Randomized trials or cohort studies evaluating AI impact - Gold standard but resource-intensive
21.6.2 Adverse Event Reporting
What Counts as AI Adverse Event: - Incorrect AI output leading to patient harm (delayed diagnosis, wrong treatment) - AI system failure preventing timely care - Alert fatigue causing true positive to be ignored - Workflow disruption from AI integration - Privacy breach from AI system
Reporting Mechanisms:
Internal Reporting: - Easy-to-use reporting system for clinicians - Non-punitive culture (just culture, not blame culture) - Rapid response to reports - Feedback to reporters on outcomes
FDA Reporting (Medical Device Reporting, MDR): - Required for manufacturers and user facilities - Report if AI device: - Caused or contributed to death or serious injury - Malfunctioned and would likely cause harm if recurred - Timelines: Death (manufacturer: 30 days; user facility: 10 days), injury (manufacturer: 30 days; user facility: annually)
Institutional Quality/Safety Reporting: - Incorporate AI into existing safety event reporting - Root cause analysis (RCA) for serious AI-related events - Failure mode analysis to prevent recurrence
Learning from Events: - Share lessons across institutions (de-identified case reports) - National registries for AI adverse events (emerging) - Vendor accountability (require vendors to address identified failures)
21.7 Building a Safety Culture for AI
Technology alone doesn’t ensure safety—organizational culture matters.
21.7.1 Core Principles
1. Physician Oversight is Non-Negotiable: - AI assists, humans decide (especially for high-stakes decisions) - Physicians retain ultimate authority and accountability - Can’t delegate responsibility to algorithms
2. Transparency About Limitations: - Honest communication about what AI can and can’t do - Don’t oversell AI capabilities to staff or patients - Acknowledge uncertainty
3. Just Culture: - Encourage error reporting without blame - Focus on system improvement, not individual fault - Psychological safety for raising concerns about AI
4. Continuous Learning: - Every failure is learning opportunity - Regular review of AI performance and incidents - Update protocols based on lessons learned
5. Patient-Centered: - Safety trumps efficiency or cost - Patient welfare always first priority - Equitable AI performance across patient populations
21.7.2 Organizational Safeguards
AI Governance Committee: - Multidisciplinary: clinicians, informatics, quality/safety, ethics, legal - Reviews AI before deployment (safety assessment, FMEA) - Monitors AI performance and adverse events - Authority to pause or decommission AI if safety concerns
Training and Education: - Educate clinicians about AI capabilities and limitations - Training on automation bias and appropriate AI use - Competency assessment before independent use
Standard Operating Procedures: - Document clinical protocols for AI use - Escalation procedures for AI failures or uncertain cases - Criteria for overriding AI recommendations
Audit and Feedback: - Regular audits of AI-assisted cases - Feedback to clinicians on performance - Identify and address misuse or over-reliance
21.8 Case Studies: Learning from AI Safety Failures
21.8.1 Case Study 1: Epic Sepsis Model
Background: - Sepsis prediction model widely deployed across U.S. hospitals - Promised early sepsis detection to improve outcomes - Retrospective studies showed reasonable accuracy
What Went Wrong (Wong et al. 2021): - External validation study (University of Michigan) found: - Sensitivity only 7% (missed 93% of sepsis cases) - Positive predictive value 18% (82% false positives) - Performance far worse than retrospective claims
Root Causes: - Dataset shift: training data from different patient populations - Retrospective validation overestimated performance (selection bias) - Integration issues: alert timing often too late - Lack of prospective validation before wide deployment
Lessons: - External validation essential (don’t trust vendor claims alone) - Retrospective accuracy ≠ prospective clinical utility - Test AI in your specific population before relying on it - Monitor real-world performance continuously
Outcome: - Many hospitals paused or discontinued use - Epic modified algorithm and validation approach - Highlighted need for transparency in AI performance claims
21.8.2 Case Study 2: IBM Watson for Oncology
Background: - IBM marketed Watson as AI for personalized cancer treatment - Promised evidence-based treatment recommendations - Adopted by hospitals worldwide
What Went Wrong (Ross and Swetlitz 2018): - STAT News investigation revealed: - Unsafe and incorrect treatment recommendations - Recommendations based on limited training data (synthetic cases from single cancer center) - Never validated in prospective clinical trials - Doctors trained to use Watson in 2-day sessions (insufficient)
Examples of Unsafe Recommendations: - Recommended chemotherapy for patient with severe bleeding (contraindication) - Suggested drugs in combinations not proven safe - Treatment plans contradicting evidence-based guidelines
Root Causes: - Marketing hype exceeded actual capabilities - Insufficient clinical validation - Training data not representative (synthetic, not real patients) - Lack of physician oversight in recommendation generation
Lessons: - Demand rigorous clinical trial evidence, not just demonstrations - Marketing claims ≠ clinical validation - AI for high-stakes decisions (cancer treatment) requires highest evidence standard - Physician expertise cannot be replaced by insufficiently validated AI
Outcome: - IBM scaled back Watson Health initiatives - Many hospitals discontinued use - Cautionary tale about AI hype vs. reality
21.8.3 Case Study 3: Chest X-ray AI and COVID-19
Background: - Multiple AI systems developed for pneumonia detection from chest X-rays - Appeared highly accurate in retrospective studies - Deployed during COVID-19 pandemic
What Went Wrong (DeGrave, Janizek, and Lee 2021): - Many AI systems failed on COVID-19 pneumonia: - Trained on pre-pandemic data (no COVID-19 patterns) - Learned spurious correlations (lateral decubitus positioning, portable X-rays) - Poor generalization to novel disease
Documented Issues: - AI detected “pneumonia” based on portable vs. fixed X-ray equipment - Picked up hospital-specific artifacts, text overlays, positioning - Failed to detect actual COVID-19 pneumonia features
Root Causes: - Training data biases (sicker patients → portable X-rays) - Lack of causal reasoning (correlations mistaken for disease features) - Insufficient stress testing on out-of-distribution cases - Rapid deployment without adequate validation
Lessons: - AI doesn’t truly “understand” disease—learns statistical patterns - Training data biases lead to spurious correlations - Test AI on out-of-distribution data before deployment - Pandemic highlighted need for robust, generalizable AI
21.9 Recommendations for Safe AI Implementation
21.9.1 Pre-Deployment
✅ 1. Rigorous Validation: - Prospective validation in your target population - External validation if possible - Subgroup analysis (race, age, sex, insurance, disease severity)
✅ 2. Failure Mode Analysis: - Conduct FMEA before deployment - Identify high-risk failure modes - Design mitigations and safeguards
✅ 3. Human Factors Evaluation: - Test AI in realistic clinical workflow - Assess usability, alert design, integration - Identify automation bias risks
✅ 4. Transparent Communication: - Educate clinicians about AI capabilities and limitations - Set realistic expectations - Training on appropriate use
✅ 5. Safety Protocols: - Standard operating procedures for AI use - Escalation procedures for failures or uncertain cases - Oversight and accountability structure
21.9.2 During Use
✅ 6. Real-World Performance Monitoring: - Continuous tracking of key metrics - Dashboards with automated alerts for performance drops - Regular reporting to governance committee
✅ 7. Adverse Event Reporting: - Easy, non-punitive reporting system - Rapid investigation and response - Sharing lessons learned
✅ 8. Physician Oversight: - AI recommendations reviewed by qualified clinicians - Physicians retain final decision authority - Can’t delegate responsibility to algorithms
✅ 9. Patient Communication: - Inform patients about AI use (tiered consent approach) - Transparency about limitations - Respect patient preferences
21.9.3 Ongoing
✅ 10. Regular Safety Audits: - Periodic review of AI performance and incidents - Update risk assessments and mitigations - Assess for performance drift
✅ 11. Revalidation: - Scheduled revalidation (e.g., annually) - After major clinical practice changes - When patient population characteristics shift
✅ 12. Continuous Improvement: - Learn from failures and near-misses - Update AI, protocols, or training based on lessons - Stay current with evolving best practices
✅ 13. Decommissioning: - Willingness to pause or stop AI if safety concerns - Clear criteria for decommissioning - Patient safety > sunk costs
21.10 Conclusion
Medical AI safety is not an afterthought—it’s a fundamental requirement. The promise of AI to improve diagnosis, personalize treatment, and reduce errors can only be realized if AI systems are rigorously validated, thoughtfully integrated, continuously monitored, and honestly communicated (Kelly et al. 2019; Topol 2019).
The history of medical AI includes both successes (IDx-DR improving diabetic retinopathy screening access) and failures (Epic sepsis model, IBM Watson). The difference lies not in the sophistication of the algorithms, but in the rigor of validation, honesty about limitations, and commitment to ongoing safety monitoring.
Core Safety Principles:
- Retrospective accuracy ≠ real-world safety—demand prospective validation
 - External validation is essential—don’t trust vendor claims alone
 - Monitor continuously—performance drifts over time
 - Report failures transparently—learning requires honesty
 - Physician oversight is non-negotiable—AI assists, humans decide
 - Build a safety culture—just culture, transparency, continuous improvement
 - Put patients first—safety trumps efficiency or profit
 
AI has the potential to improve patient care dramatically. But that potential can only be realized if safety is treated as seriously as innovation. First, do no harm—for algorithms as for all medical interventions.