[Clinical AI Safety and Risk Management]{.chapter-title}

doi:10.5281/zenodo.18251405

Clinical AI Safety and Risk Management

A widely deployed sepsis prediction algorithm missed 67% of sepsis cases while generating 88% false positives. It had been validated retrospectively. It failed prospectively. Traditional medical devices break in obvious ways. AI systems fail silently, producing plausible-looking but dangerously wrong outputs.

Learning Objectives

After reading this chapter, you will be able to:

Apply FDA regulatory frameworks for medical AI devices
Conduct systematic failure mode and effects analysis (FMEA) for AI systems
Recognize common AI failure patterns and sentinel events
Implement safety monitoring and adverse event reporting
Build a safety culture for AI deployment
Understand human factors, automation bias, and cognitive anchoring risks
Maintain independent clinical skills to prevent de-skilling
Design fail-safe mechanisms for clinical AI

Chapter Summary (TL;DR)

AI Safety is Different from Traditional Medical Device Safety:

Traditional medical devices fail in predictable ways (mechanical failure, electrical malfunction). AI systems fail in subtle, context-dependent ways: - Silent failures: AI produces plausible-looking but incorrect outputs - Degradation over time: Performance drifts as clinical practice evolves - Unpredictable errors: Failures in scenarios not represented in training data - Cascading failures: AI errors propagate through clinical workflows

FDA Regulatory Framework:

SaMD (Software as a Medical Device) Classification: - Class I: Low risk (e.g., dental caries detection), minimal regulation - Class II: Moderate risk (e.g., CAD for mammography), 510(k) clearance required - Class III: High risk (e.g., autonomous diagnostic systems), premarket approval (PMA) required

FDA’s AI/ML Action Plan: - Pre-determined Change Control Plans (PCCP): allows updates without new submissions - Good Machine Learning Practice (GMLP): quality and safety standards - Real-world performance monitoring required - Transparency and explainability expectations (Topol, 2019)

Common AI Failure Modes:

Dataset Shift: - AI trained on population A, deployed on population B - Performance degrades when patient demographics, disease prevalence, or clinical workflows differ - Example: COVID-19 chest X-ray AI trained pre-pandemic failed during pandemic (DeGrave et al., 2021)

Spurious Correlations: - AI learns irrelevant patterns (hospital logos, patient positioning, acquisition parameters) - Appears accurate in development but fails when deployed - Example: AI “detecting” pneumonia by recognizing portable chest X-rays (sicker patients) (Zech et al., 2018)

Automation Bias and Cognitive Anchoring: - Clinicians over-rely on AI, fail to catch errors - Anchoring effect: AI recommendations bias clinician judgment even when they believe they’re thinking independently - De-skilling risk: Prolonged AI use erodes independent diagnostic capability needed to catch AI errors - Trust calibration critical: too much trust = missed errors; too little = AI ignored - Low override rates (<5%) may indicate over-reliance, not AI accuracy - Documented in radiology, pathology, and clinical decision support (Beam & Kohane, 2018; Khullar, 2025)

Integration Failures: - AI produces correct output, but workflow integration causes errors - Example: AI flags critical finding, but alert lost in EHR notification overload - Human factors engineering essential

Safety Monitoring Requirements:

Pre-Deployment: 1. Prospective validation in target population 2. Failure mode and effects analysis (FMEA) 3. Stress testing (edge cases, adversarial inputs) 4. Human factors evaluation 5. Safety protocols and escalation procedures

During Use: 6. Real-world performance monitoring (sensitivity, specificity, calibration) 7. Subgroup performance tracking (detect disparate impact) 8. Adverse event reporting system 9. Regular safety audits 10. Physician feedback mechanisms

Ongoing: 11. Algorithm performance drift monitoring 12. Periodic revalidation against current practice 13. Updates and version control 14. Decommissioning when performance degrades

Sentinel Events and Case Studies:

Epic Sepsis Model (2021): - Widely deployed sepsis prediction algorithm - External validation showed poor real-world performance: 33% sensitivity (missed 67% of sepsis cases), 12% PPV (88% false positive rate) - High false positive rate caused alert fatigue - Lesson: External validation essential; retrospective accuracy ≠ real-world safety (Wong et al., 2021)

IBM Watson for Oncology: - Promised personalized cancer treatment recommendations - Reports of unsafe and incorrect recommendations - Never prospectively validated in clinical trials - Lesson: Marketing hype ≠ clinical evidence; demand rigorous validation (Ross & Swetlitz, 2018)

Safety Culture for AI:

Principles: - Physician oversight required: AI assists, humans decide - Transparency about limitations: Don’t oversell AI capabilities - Just culture: Encourage error reporting without blame - Continuous learning: Every failure is learning opportunity - Patient-centered: Safety trumps efficiency or cost

Clinical Bottom Line:

AI safety requires systematic approach: rigorous pre-deployment testing, real-world monitoring, transparent reporting of failures, physician oversight, and willingness to decommission underperforming systems. “First, do no harm” applies to algorithms as much as treatments (Kelly et al., 2019).

Introduction

In 2021, a study examining Epic’s widely-deployed sepsis prediction model revealed a sobering finding: the algorithm had a sensitivity of only 33% in real-world clinical use, meaning it missed 67% of sepsis cases (Wong et al., 2021). The model also had a positive predictive value (PPV) of just 12%, meaning 88% of alerts were false positives. This was the same algorithm that showed promising retrospective performance and had been implemented across hundreds of hospitals.

The Epic sepsis case illustrates a fundamental truth about medical AI safety: retrospective accuracy does not guarantee real-world safety. Unlike traditional medical devices (which fail predictably through mechanical breakdown or electrical malfunction), AI systems fail in subtle, context-dependent ways that may not be apparent until deployment.

The Risk of Technological Solutionism

WHO’s 2024 guidance on large multi-modal models names a pattern physicians should recognize: technological solutionism, the tendency to overestimate AI benefits while ignoring or downplaying challenges in safety, efficacy, and utility (WHO, 2024).

The Epic sepsis debacle exemplifies this: enthusiasm for AI’s promise overwhelmed rigorous evaluation of whether the specific tool worked. Marketing claims of 85% sensitivity generated widespread adoption. External validation revealing 33% sensitivity came years after deployment. Patients were harmed in the interim.

Technological solutionism manifests as:

Uncritical adoption: Deploying AI without prospective validation because “AI is the future”
Overlooking failure modes: Assuming vendor performance claims generalize to your population
Ignoring workflow integration: Treating AI as a technology problem rather than a human factors problem
Dismissing skepticism: Labeling concerns about AI safety as “resistance to innovation”

The antidote is not rejecting AI, but demanding evidence. Every AI system must prove its value in your clinical context before deployment, not after.

AI Safety Within the Patient Safety Tradition

AI safety is not a new problem requiring new frameworks. It is the latest chapter in a patient safety tradition crystallized by the Institute of Medicine’s landmark 1999 report, To Err is Human: Building a Safer Health System (Kohn et al., 2000).

That report documented an uncomfortable truth: medical errors cause an estimated 44,000-98,000 deaths annually in U.S. hospitals, more than motor vehicle accidents, breast cancer, or AIDS. The IOM’s central insight was that these errors were not primarily caused by bad people making careless mistakes. They were good people working in bad systems. The solution was not punishment or retraining. It was redesigning systems to make errors less likely and less harmful when they occurred.

This framing applies directly to AI safety:

AI failures are system failures, not individual failures. When a physician over-relies on incorrect AI output, the failure is not the physician’s poor judgment. It is a system that presented AI recommendations without appropriate uncertainty indicators, training on automation bias, or workflow designs that preserve independent clinical reasoning.
Blame-free reporting enables learning. The IOM advocated for non-punitive error reporting so organizations could learn from failures. AI safety requires the same culture: physicians must feel safe reporting when AI recommendations were wrong or when they overrode AI incorrectly.
Design for safety, not just performance. The IOM emphasized building safety into systems through redundancy, forcing functions, and fail-safes. AI systems need similar design: not just high accuracy, but graceful degradation, uncertainty quantification, and human oversight at critical decision points.

The physicians who successfully navigate clinical AI will be those who understand that AI safety is patient safety, requiring the same systems thinking, just culture, and continuous improvement that the IOM articulated 25 years ago.

Why AI Safety is Different

Traditional Medical Device Safety

Medical devices have well-established safety paradigms: - Predictable failure modes: Pacemakers have battery depletion, monitors have sensor failures - Testable before deployment: Devices can be bench-tested, stress-tested, validated in controlled conditions - Static performance: Once validated, device performance doesn’t change (until hardware degrades) - Visible failures: Most failures are obvious (device stops working, alarm sounds)

AI System Safety Challenges

Medical AI introduces fundamentally different risks:

1. Silent Failures: - AI can produce plausible-looking but incorrect outputs - Errors may not be immediately apparent to clinicians - Example: AI misses subtle fracture on X-ray, radiologist trusts AI and also misses it

2. Context-Dependent Performance: - AI performs differently across populations, hospitals, workflows - What works at academic center may fail at community hospital - Performance varies with disease prevalence, patient demographics, image acquisition protocols

3. Performance Drift Over Time: - Clinical practice evolves (new treatments, changing patient populations) - AI trained on historical data becomes outdated - Performance degrades silently unless monitored (Finlayson et al., 2021)

4. Unpredictable Edge Cases: - AI may fail severely on inputs unlike training data - Impossible to test all possible scenarios - Example: Chest X-ray AI trained pre-pandemic fails on COVID-19 pneumonia patterns - LLMs show 26-38% accuracy drops when faced with unfamiliar answer patterns, suggesting pattern matching over robust reasoning (Bedi et al., 2025)

5. Inscrutability: - Deep learning models often can’t explain their predictions - Makes root cause analysis difficult when failures occur - Clinicians can’t validate AI reasoning process

6. Cascading Failures: - AI errors propagate through clinical workflows - Wrong AI prediction → wrong clinical decision → patient harm - Multiple systems may compound errors

These differences demand new approaches to safety assessment and monitoring (Kelly et al., 2019).

The “Jagged Frontier”: Superhuman and Brittle

The State of Clinical AI Report 2026 identifies a critical safety concept: AI systems exist on a “jagged frontier” where models demonstrate superhuman capabilities on some controlled tasks while remaining brittle when confronted with uncertainty.

The pattern:

Several 2025 studies showed LLMs matching or outperforming physicians on diagnostic reasoning and treatment planning when tested on fixed clinical cases
Some described this performance as “superhuman”

The brittleness:

On tests designed to measure reasoning under uncertainty, AI systems performed closer to medical students than experienced physicians (McCoy et al., NEJM AI, 2025)
Models tended to commit strongly to answers even when clinical ambiguity was high
When models had to ask follow-up questions, manage incomplete information, or revise decisions as new details emerged, performance dropped significantly (Johri et al., Nature Medicine, 2025)
Accuracy dropped 26-38% when familiar answer patterns were disrupted (Bedi et al., JAMA Network Open, 2025)

Why this matters for safety:

The gap between benchmark confidence and real-world uncertainty creates a dangerous pattern: AI systems may appear highly capable while lacking the appropriate humility about their limitations. A model that “commits strongly to an answer even when ambiguity was high” is precisely the type of system that will miss atypical presentations or fail to escalate appropriately.

Clinical implication: Demand evidence that AI systems can recognize and communicate uncertainty, not just achieve high accuracy on unambiguous cases.

Deferral as a Safety Mechanism

Perfect accuracy is unachievable in clinical AI, but fortunately, it is also not required. Safety emerges not from flawless performance but from knowing when not to act. A well-designed AI system should recognize its limitations and explicitly defer to human judgment when confidence is low or the clinical scenario falls outside its validated scope (Azad et al., Nature Medicine, 2026).

The deferral principle:

Each clinical AI task requires an explicit harm budget: the maximum acceptable error rate that still yields net clinical benefit. Critical metrics must align with real-world consequences:

Under-triage rates for emergency systems that might misclassify urgent cases as low-acuity
Omission rates for documentation tools that could miss medications and allergies
Unsafe non-deferral rates for patient messaging systems that provide confident responses when escalation was warranted

The goal is not perfection, but appropriate caution paired with measurable benefit. An AI system that defers appropriately on 15% of cases while performing well on the remaining 85% may be safer and more clinically useful than one that attempts to answer every query with false confidence.

Evaluating deferral capability:

When assessing AI systems, ask: Can this system recognize its limitations and appropriately request human intervention? Systems that fail this test lack frontline readiness. Deferral awareness, the ability to know what the system does not know, is a first-class safety requirement, not a limitation to hide.

Contrast: When prediction works

Not all deterioration prediction systems fail. A 2025 study demonstrated wearable AI predicting patient deterioration 8-24 hours before standard hospital alerts, identifying patients at risk for ICU transfer, cardiac arrest, or death with time to intervene (Scheid et al., Nature Communications, 2025). The key differences from failed implementations: continuous physiological monitoring rather than intermittent vital signs, prospective validation in real clinical workflows, and integration into existing care protocols rather than standalone alerts.

FDA Regulatory Framework for Medical AI

The FDA regulates medical AI as Software as a Medical Device (SaMD) under existing device regulations, but is developing AI-specific frameworks.

SaMD Classification

Medical AI is classified based on risk level:

Class I: Low Risk - Definition: Minimal risk to patients if device malfunctions - Examples: Dental caries detection, skin condition photo analysis - Regulation: Exempt from premarket notification (510(k)) - Requirements: General controls (labeling, adverse event reporting)

Class II: Moderate Risk - Definition: Could cause temporary or minor harm if device malfunctions - Examples: Computer-aided detection (CAD) for mammography, diabetic retinopathy screening - Regulation: 510(k) clearance required (demonstrate substantial equivalence to predicate device) - Requirements: General controls plus special controls (performance standards, post-market surveillance)

Class III: High Risk - Definition: Could cause serious injury or death if device malfunctions - Examples: Autonomous diagnostic systems, treatment decision algorithms - Regulation: Premarket Approval (PMA) required (rigorous clinical trial evidence) - Requirements: General controls plus special controls plus premarket approval

Most Medical AI Currently Approved: - Majority are Class II devices (CAD, triage systems) - Few Class III AI devices (FDA cautious about fully autonomous systems) - Trend toward Class II with real-world performance monitoring

FDA’s AI/ML Action Plan

In 2021, FDA released framework specifically for AI/ML-based medical devices:

1. Pre-Determined Change Control Plans (PCCP): - Allows manufacturers to update AI models without new FDA submissions - Must specify: - Types of changes anticipated (new training data, architecture modifications) - Methodology for updates (retraining protocols, validation procedures) - Impact assessment (when changes require new submission) - Balances innovation (rapid updates) with safety (FDA oversight)

2. Good Machine Learning Practice (GMLP): - Quality and safety standards for AI development - Covers: - Data quality and representativeness - Feature engineering and selection - Model training and testing - Performance monitoring - Documentation and transparency

3. Algorithm Change Protocol: - Document describing how algorithm will be modified post-market - Safety and performance guardrails - Triggers for re-validation

4. Real-World Performance Monitoring: - Manufacturers must monitor deployed AI performance - Report performance drift or safety signals - Update or withdraw device if performance degrades

5. Transparency and Explainability: - FDA encourages (but doesn’t mandate) transparency about: - How algorithm works - Training data characteristics - Intended use and limitations - Known failure modes - Trend toward requiring more transparency for high-risk devices

Implications for Healthcare Organizations: - Can’t assume FDA clearance = proven clinical benefit - 510(k) clearance means “similar to existing device,” not “clinically validated” - PMA devices have more rigorous evidence - Organizations must conduct own validation even for FDA-cleared AI

Failure Mode and Effects Analysis (FMEA) for AI

FMEA is systematic approach to identifying potential failures before they cause harm. Applied to AI:

FMEA Process

Step 1: Map Clinical Workflow - Document end-to-end process where AI will be used - Identify inputs, outputs, decision points, handoffs

Example: AI for Pulmonary Embolism (PE) Detection on CT - Input: CT pulmonary angiography scan - AI processing: Algorithm analyzes images, outputs PE probability - Notification: Alerts radiologist if high probability - Review: Radiologist reviews images and AI output - Reporting: Radiologist issues final report - Action: Clinical team acts on report

Step 2: Identify Potential Failure Modes

For each step, brainstorm what could go wrong:

Workflow Step	Potential Failure Modes
Image acquisition	Poor image quality (motion, contrast timing), incompatible scanner
AI processing	Software crash, wrong patient, dataset shift, spurious correlation
Notification	Alert doesn’t fire, alert sent to wrong person, alert buried in inbox
Radiologist review	Automation bias (misses error), cognitive anchoring (AI biases judgment), alert fatigue (ignores AI), misinterprets AI output, loss of independent skills
Reporting	Report unclear, doesn’t reach ordering provider
Clinical action	Provider doesn’t see report, misinterprets recommendation, delays treatment

Step 3: Assess Severity, Likelihood, and Detectability

For each failure mode: - Severity: How bad if it happens? (1=negligible, 10=catastrophic) - Likelihood: How often will it happen? (1=rare, 10=frequent) - Detectability: Will failure be caught before harm? (1=always detected, 10=never detected) - Risk Priority Number (RPN) = Severity × Likelihood × Detectability

Example:

Failure Mode	Severity	Likelihood	Detectability	RPN	Priority
AI misses PE	10	3	7	210	HIGH
Cognitive anchoring to incorrect AI	8	7	6	336	CRITICAL
False positive PE	4	6	3	72	MEDIUM
Alert not delivered	9	2	8	144	HIGH
Radiologist ignores alert	8	4	6	192	HIGH

Step 4: Implement Risk Mitigations

For high-priority failure modes, design safeguards:

Cognitive Anchoring to Incorrect AI (RPN=336): - Mitigation 1: Require radiologist to document preliminary impression before viewing AI output (for complex cases) - Mitigation 2: Training on anchoring bias recognition with case examples - Mitigation 3: Audit AI-concordant errors (cases where both AI and radiologist were wrong) - Mitigation 4: Monitor override rates (low override rate may indicate over-reliance) - Impact: Reduces likelihood from 7 to 4 and detectability from 6 to 4 (RPN drops to 128)

AI Misses PE (RPN=210): - Mitigation 1: Radiologist reviews all cases (not just AI-flagged ones) - Mitigation 2: Quality assurance sampling (re-review AI-negative cases) - Mitigation 3: Performance monitoring (track missed PE rate) - Impact: Reduces detectability from 7 to 3 (RPN drops to 90)

Alert Not Delivered (RPN=144): - Mitigation 1: Redundant notification (EHR inbox + page for critical findings) - Mitigation 2: Require acknowledgment within 1 hour - Mitigation 3: Escalation if not acknowledged - Impact: Reduces likelihood from 2 to 1 (RPN drops to 72)

Step 5: Document and Monitor - Document FMEA findings and mitigations - Revisit FMEA periodically (workflows and AI change) - Track actual failures and update risk assessments

AI-Specific FMEA Considerations

1. Data Quality Failures: - Incorrect patient matched to AI input - Missing or corrupted data elements - Data format incompatible with AI expectations

2. Model Performance Failures: - Dataset shift (population differs from training) - Adversarial inputs (deliberately fooling AI) - Edge cases not in training data

3. Integration Failures: - AI output misinterpreted by clinicians - Timing issues (AI result arrives too late) - AI recommendations conflict with other clinical data

4. Human Factors Failures: - Automation bias (over-reliance on AI) - Cognitive anchoring (AI biases clinician judgment) - Alert fatigue (too many false positives) - Loss of clinical skills from AI dependence (de-skilling) - Inability to diagnose independently when AI unavailable or unreliable

5. LLM-Specific Failure Modes (Conversational AI): - Sycophantic validation: LLMs trained to maximize engagement may validate delusional or harmful thinking rather than challenge it - Chatbot-induced psychosis: Emerging evidence links intensive chatbot use to psychotic symptom reinforcement in vulnerable individuals. OpenAI reports 0.07% of 800+ million weekly users show signs of psychosis or mania (OpenAI, October 2025). A UCSF case study documented new-onset psychosis following intensive ChatGPT interaction (Pierre et al., 2025) - Parasocial attachment and emotional dependency: Patients may substitute AI relationships for human therapeutic relationships - See Psychiatry and Behavioral Health for clinical implications

Common AI Failure Patterns

Understanding how AI systems fail helps prevent and detect errors.

1. Dataset Shift and Generalization Failure

What It Is: AI trained on one population/setting performs poorly when deployed in different context.

Why It Happens: - Training data not representative of deployment population - Clinical workflows differ between development and deployment sites - Disease prevalence, patient demographics, or comorbidities differ

Examples:

COVID-19 Chest X-ray AI (DeGrave et al., 2021): - AI trained on pre-pandemic chest X-rays to detect pneumonia - When deployed during pandemic, many AI systems failed on COVID-19 pneumonia - Reason: COVID-19 patterns not in training data, AI learned non-generalizable features - Some AI learned to detect portable X-rays (used for sicker patients) rather than actual pneumonia

Pneumonia Detection Dataset Shift (Zech et al., 2018): - AI trained at one hospital achieved 90%+ accuracy - Same AI deployed at different hospital: accuracy dropped to ~60% - Reason: AI learned hospital-specific artifacts (patient positioning, X-ray machine markers) instead of pneumonia

Mitigation: - Train on diverse data from multiple institutions - External validation before deployment - Monitor real-world performance continuously - Retrain when performance drifts

2. Spurious Correlations (Clever Hans Effect)

What It Is: AI learns irrelevant patterns that happen to correlate with outcome in training data but don’t reflect true causal relationships.

Why It Happens: - Training data contains confounding variables - AI optimizes for accuracy, not clinical reasoning - Limited data causes AI to latch onto any predictive signal

Examples:

Skin Cancer Detection and Rulers (Esteva et al., 2017): - Dermatology AI appeared highly accurate - Later discovered AI partially relied on rulers/color calibration markers in images - Malignant lesions more likely to be photographed with rulers (clinical documentation practice) - AI learned “ruler = cancer” instead of visual features of cancer

ICU Mortality Prediction and Time of Admission: - AI predicted ICU mortality based on admission time - Patients admitted at night had higher mortality - AI learned “night admission = high risk” rather than disease severity - Spurious correlation: sicker patients tend to arrive at night

Mitigation: - Careful feature engineering (include only clinically relevant variables) - Interpretability analysis (understand what AI is using) - Adversarial testing (remove expected signals, see if performance drops) - Clinical review of AI features/logic

3. Automation Bias and Over-Reliance

What It Is: Clinicians uncritically accept AI recommendations, even when wrong or when contradicted by other clinical information.

Why It Happens: - Cognitive bias toward trusting automated systems - AI presented as authoritative (“algorithm says…”) - Time pressure and cognitive load - Deskilling from prolonged AI use (loss of independent judgment)

Evidence:

Radiology Studies (Beam & Kohane, 2018): - Radiologists shown AI-flagged images made more errors than without AI when AI was wrong - Effect stronger for less experienced radiologists - Automation bias overcame clinical judgment

Pathology AI (Campanella et al., 2019): - Pathologists reviewing AI-assisted slides sometimes missed obvious errors - Trust in AI reduced vigilance

Cognitive Anchoring as a Critical Safety Failure Mode

Cognitive anchoring represents one of the most insidious AI safety risks: clinicians systematically bias their assessment toward AI recommendations, losing the capacity for independent clinical judgment that serves as the essential safety net for catching AI errors.

The Anchoring Effect:

When clinicians see an AI recommendation before forming their own clinical impression, the AI output becomes a cognitive anchor that disproportionately influences their final judgment. This differs from simple automation bias (accepting AI uncritically) in that clinicians believe they are exercising independent judgment when, in fact, their reasoning has been subtly channeled by the AI’s suggestion (Khullar, 2025).

Why Anchoring is Particularly Dangerous:

Invisible influence: Clinicians don’t recognize they’ve been anchored, assuming their judgment remains independent
Affects experienced physicians: Unlike simple automation bias (stronger in novices), anchoring can influence even seasoned clinicians
Undermines error detection: The AI’s primary safety check (physician oversight) becomes compromised
Compounds over time: Repeated exposure to AI recommendations gradually erodes independent diagnostic capability

FMEA Risk Assessment for Cognitive Anchoring:

Parameter	Rating	Rationale
Severity	High (8/10)	Missed diagnoses, inappropriate treatments, patient harm
Likelihood	High (7/10)	Well-documented cognitive bias, occurs across specialties
Detectability	Medium (6/10)	Difficult to distinguish from appropriate AI-concordant decisions
Risk Priority Number	336	CRITICAL PRIORITY

Evidence of Anchoring in Clinical AI:

Diagnostic Decision-Making: - Physicians shown AI diagnostic suggestions anchored toward those diagnoses, even when clinical findings pointed elsewhere - Effect persisted when AI was explicitly labeled as “low confidence” - Clinicians underweighted contradictory clinical information

Radiology Interpretation: - Radiologists shown AI-flagged regions focused disproportionately on those areas - Missed abnormalities in non-flagged regions they would have caught without AI - False sense of completeness: “AI didn’t flag anything else, so I’m done”

Mitigation Strategies for Anchoring:

1. Workflow Design: - Independent assessment first: Require clinicians to document preliminary impression before viewing AI output (for high-stakes diagnoses) - Delayed AI display: AI recommendations appear only after clinician forms initial judgment - Blind review sampling: Periodic audits where clinicians interpret cases without AI access - Two-stage review: Independent clinician review, then AI-assisted review, then reconciliation

2. Training and Awareness: - Education on anchoring bias and its mechanisms - Case-based training showing examples of AI-induced anchoring - Regular feedback on cases where clinician anchored to incorrect AI - Competency assessment: Can physicians diagnose accurately without AI?

3. Performance Monitoring: - Audit AI-concordant errors: Review cases where clinician agreed with incorrect AI - Monitor override rates: Paradoxically, very low override rates may indicate over-reliance rather than AI accuracy - Healthy override rate (10-20%) suggests independent clinical judgment - Very low override rate (<5%) raises concern for anchoring - Very high override rate (>50%) suggests AI is not clinically useful - Track diagnostic accuracy with vs. without AI: Compare performance when AI available vs. unavailable

4. AI Interface Design: - Present AI as “additional data point,” not “recommendation” - Require clinicians to justify agreement or disagreement with AI - Show confidence intervals/uncertainty (discourage anchoring to low-confidence outputs) - Avoid authoritative framing (“AI diagnosis:”) in favor of neutral language (“AI analysis suggests consideration of:”)

5. Organizational Safeguards: - Regular case conferences reviewing AI-discordant cases (AI wrong, clinician caught it) and AI-concordant errors (AI wrong, clinician missed it) - Celebrate appropriate AI overrides (reinforce that disagreeing with AI is professionally acceptable) - Continuous learning culture: every anchoring-related near-miss triggers workflow review

De-skilling as Long-Term Safety Risk

Prolonged reliance on AI diagnostic support creates a secondary safety risk: physicians lose the independent diagnostic capability necessary to catch AI errors. This represents an insidious erosion of the very safety mechanism (physician oversight) that AI deployment assumes will prevent harm.

The De-skilling Phenomenon:

What Happens: - Physicians who routinely use AI assistance gradually lose fluency in independent diagnosis - Pattern recognition skills atrophy from disuse - Diagnostic reasoning becomes dependent on AI prompts - When AI fails or is unavailable, physicians struggle to diagnose independently

Why It Matters for Safety: - AI safety model assumes physician can catch AI errors through independent judgment - If physicians can no longer diagnose without AI, this safety net disappears - Creates dangerous dependency: AI errors go uncaught because physicians lack capability to recognize them - Particularly concerning for rare diseases or atypical presentations (where AI may be unreliable and physician pattern recognition critical)

Evidence:

Radiology De-skilling: - Residents trained with AI assistance showed reduced independent interpretation skills - When AI removed, performance dropped below baseline - Particular deficits in subtle findings and atypical presentations

Clinical Reasoning: - Medical students using AI diagnostic assistants during training showed weaker differential diagnosis generation - Dependence on AI suggestions rather than systematic clinical reasoning - Difficulty articulating reasoning independent of AI prompts

Certification and Competency Questions:

The rise of AI assistance forces difficult questions about physician competency:

Can physicians diagnose without AI? If not, what happens during system downtime or novel scenarios where AI is unreliable?
Should board certification exams include AI-free assessments? Ensuring baseline independent diagnostic capability
How do we maintain skills in AI era? Deliberate practice without AI assistance to preserve independent judgment
What is minimum acceptable independent performance? Standards for physicians working with AI

Mitigation Strategies for De-skilling:

1. Training Programs: - AI-free training periods: Medical students and residents must demonstrate independent diagnostic competency before AI assistance - Continued AI-free practice: Regular cases interpreted without AI to maintain skills - Competency assessments: Periodic testing of independent diagnostic capability (without AI) - Cross-training: Rotate between AI-assisted and AI-free workflows

2. Workflow Integration: - Mandatory independent assessment: For complex or high-stakes cases, require documented independent impression before AI consultation - Scheduled AI-free days: Periodic practice without AI assistance to maintain skills - Case variety: Ensure physicians see sufficient case volume and diversity to maintain pattern recognition

3. Institutional Policies: - Competency standards: Define minimum independent diagnostic capability regardless of AI availability - Skills maintenance requirements: Continuing education focused on independent clinical reasoning - Backup protocols: Procedures for AI system downtime that don’t compromise patient safety - Hiring and credentialing: Assess independent diagnostic capability, not just AI-assisted performance

4. System Design: - Degradable AI: Systems designed to provide varying levels of assistance, allowing skills practice - Deliberate difficulty: Periodic cases where AI intentionally withheld to maintain physician skills - Educational mode: AI provides feedback after independent assessment rather than during

Cross-Reference: See Integration into Clinical Workflow for workflow designs that preserve independent judgment while leveraging AI capabilities. The goal is not to avoid AI assistance, but to design collaboration patterns that maintain rather than erode physician diagnostic expertise.

Mitigation (General): - Present AI as “second opinion,” not ground truth - Require independent clinical assessment before viewing AI output (for high-stakes decisions) - Training on automation bias and anchoring recognition - Audit cases where clinician agreed with incorrect AI - Calibrate trust: highlight when AI is uncertain or in novel scenario - Monitor override rates to detect over-reliance - Ensure physicians maintain diagnostic competency independent of AI

4. Alert Fatigue and Integration Failures

What It Is: AI produces too many alerts (often false positives), causing clinicians to ignore all alerts, including true positives.

Why It Happens: - AI optimized for high sensitivity, accepting low specificity - Poor integration with clinical workflow (alerts at wrong time, wrong place) - No prioritization (all alerts treated equally)

Examples:

Epic Sepsis Model (Wong et al., 2021): - Low sensitivity (missed most sepsis) but still produced many false positives - Clinicians became desensitized to alerts - True positives ignored alongside false positives

General EHR Alert Fatigue: - Studies show clinicians override 49-96% of drug interaction alerts - Adding AI alerts without workflow consideration worsens problem

Mitigation: - Tune AI threshold based on acceptable false positive rate (not just maximizing sensitivity) - Smart alerting: right information, right person, right time, right format - Tiered alerts: critical vs. informational - Require acknowledgment for critical alerts with escalation - Monitor alert override rates and reasons

5. Performance Drift Over Time

What It Is: AI performance degrades after deployment as clinical practice, patient populations, or data characteristics evolve.

Why It Happens: - Clinical practice changes (new treatments, diagnostic criteria, guidelines) - Patient demographics shift - Changes in data collection or EHR systems - AI becomes outdated but continues to be used

Example: Cardiovascular Risk Prediction (Finlayson et al., 2021): - Risk models trained on historical data - Performance degrades over time as treatment improves (statins, blood pressure management) - Historical risk factors less predictive in modern era

Mitigation: - Continuous performance monitoring - Set performance thresholds triggering retraining - Regular scheduled revalidation (e.g., annually) - Version control and change management - Willingness to decommission outdated AI

Safety Monitoring and Adverse Event Reporting

Ongoing monitoring is essential for catching AI failures before widespread harm.

Real-World Performance Monitoring

What to Monitor:

1. Discrimination Metrics: - Sensitivity, specificity, AUC-ROC - Track overall and by patient subgroups - Set threshold for acceptable performance

2. Calibration: - Do predicted probabilities match observed outcomes? - Example: Of patients AI predicts 30% mortality risk, do ~30% actually die? - Miscalibration suggests model drift

3. Alert Metrics: - Alert rate (alerts per day) - Override rate (% of alerts ignored) - False positive and false negative rates - Positive predictive value in clinical practice

4. Clinical Outcomes: - Patient outcomes when AI used vs. not used (if feasible) - Time to treatment, missed diagnoses, unnecessary testing - Ideally compare against pre-AI baseline

5. Subgroup Performance: - Performance across race, ethnicity, age, sex, insurance status - Detect disparate impact or bias - Ensure equity

6. User Metrics: - Physician trust and satisfaction - Workflow disruption reports - Time spent reviewing AI outputs

How to Monitor:

Automated Dashboards: - Real-time or daily updates on key metrics - Alert when metrics fall below thresholds - Drill-down capability for root cause analysis

Periodic Audits: - Sample cases for detailed review - Compare AI output to ground truth - Identify systematic errors

Prospective Studies: - Randomized trials or cohort studies evaluating AI impact - Gold standard but resource-intensive

Adverse Event Reporting

What Counts as AI Adverse Event: - Incorrect AI output leading to patient harm (delayed diagnosis, wrong treatment) - AI system failure preventing timely care - Alert fatigue causing true positive to be ignored - Workflow disruption from AI integration - Privacy breach from AI system

Reporting Mechanisms:

Internal Reporting: - Easy-to-use reporting system for clinicians - Non-punitive culture (just culture, not blame culture) - Rapid response to reports - Feedback to reporters on outcomes

FDA Reporting (Medical Device Reporting, MDR): - Required for manufacturers and user facilities - Report if AI device: - Caused or contributed to death or serious injury - Malfunctioned and would likely cause harm if recurred - Timelines: Death (manufacturer: 30 days; user facility: 10 days), injury (manufacturer: 30 days; user facility: annually)

Institutional Quality/Safety Reporting: - Incorporate AI into existing safety event reporting - Root cause analysis (RCA) for serious AI-related events - Failure mode analysis to prevent recurrence

Learning from Events: - Share lessons across institutions (de-identified case reports) - National registries for AI adverse events (emerging) - Vendor accountability (require vendors to address identified failures)

Building a Safety Culture for AI

Technology alone doesn’t ensure safety. Organizational culture matters.

Core Principles

1. Physician Oversight is Non-Negotiable: - AI assists, humans decide (especially for high-stakes decisions) - Physicians retain ultimate authority and accountability - Can’t delegate responsibility to algorithms

2. Transparency About Limitations: - Honest communication about what AI can and can’t do - Don’t oversell AI capabilities to staff or patients - Acknowledge uncertainty

3. Maintain Independent Clinical Skills: - Physicians must retain diagnostic capability independent of AI - Regular practice without AI assistance to prevent de-skilling - Competency assessment should include AI-free performance - Skills maintenance is a safety requirement, not optional

4. Just Culture: - Encourage error reporting without blame - Focus on system improvement, not individual fault - Psychological safety for raising concerns about AI - Applies the IOM principle: most errors result from faulty systems, not faulty people (Kohn et al., 2000)

4. Continuous Learning: - Every failure is learning opportunity - Regular review of AI performance and incidents - Update protocols based on lessons learned

5. Patient-Centered: - Safety trumps efficiency or cost - Patient welfare always first priority - Equitable AI performance across patient populations

Organizational Safeguards

AI Governance Committee: - Multidisciplinary: clinicians, informatics, quality/safety, ethics, legal - Reviews AI before deployment (safety assessment, FMEA) - Monitors AI performance and adverse events - Authority to pause or decommission AI if safety concerns

Training and Education: - Educate clinicians about AI capabilities and limitations - Training on automation bias, anchoring effects, and appropriate AI use - Competency assessment before independent use (including AI-free diagnostic capability) - Regular skills maintenance: periodic practice without AI assistance - Case-based training showing examples of AI-induced anchoring and appropriate AI overrides

Standard Operating Procedures: - Document clinical protocols for AI use - Escalation procedures for AI failures or uncertain cases - Criteria for overriding AI recommendations

Audit and Feedback: - Regular audits of AI-assisted cases - Feedback to clinicians on performance - Identify and address misuse or over-reliance

Case Studies: Learning from AI Safety Failures

Case Study 1: Epic Sepsis Model

Background: - Sepsis prediction model widely deployed across U.S. hospitals - Promised early sepsis detection to improve outcomes - Retrospective studies showed reasonable accuracy

What Went Wrong (Wong et al., 2021): - External validation study (University of Michigan) found: - Sensitivity only 7% (missed 93% of sepsis cases) - Positive predictive value 18% (82% false positives) - Performance far worse than retrospective claims

Root Causes: - Dataset shift: training data from different patient populations - Retrospective validation overestimated performance (selection bias) - Integration issues: alert timing often too late - Lack of prospective validation before wide deployment

Lessons: - External validation essential (don’t trust vendor claims alone) - Retrospective accuracy does not equal prospective clinical utility - Test AI in your specific population before relying on it - Monitor real-world performance continuously

Outcome: - Many hospitals paused or discontinued use - Epic modified algorithm and validation approach - Highlighted need for transparency in AI performance claims

Case Study 2: IBM Watson for Oncology

Background: - IBM marketed Watson as AI for personalized cancer treatment - Promised evidence-based treatment recommendations - Adopted by hospitals worldwide

What Went Wrong (Ross & Swetlitz, 2018): - STAT News investigation revealed: - Unsafe and incorrect treatment recommendations - Recommendations based on limited training data (synthetic cases from single cancer center) - Never validated in prospective clinical trials - Doctors trained to use Watson in 2-day sessions (insufficient)

Examples of Unsafe Recommendations: - Recommended chemotherapy for patient with severe bleeding (contraindication) - Suggested drugs in combinations not proven safe - Treatment plans contradicting evidence-based guidelines

Root Causes: - Marketing hype exceeded actual capabilities - Insufficient clinical validation - Training data not representative (synthetic, not real patients) - Lack of physician oversight in recommendation generation

Lessons: - Demand rigorous clinical trial evidence, not just demonstrations - Marketing claims do not equal clinical validation - AI for high-stakes decisions (cancer treatment) requires highest evidence standard - Physician expertise cannot be replaced by insufficiently validated AI

Outcome: - IBM scaled back Watson Health initiatives - Many hospitals discontinued use - Cautionary tale about AI hype vs. reality

Case Study 3: Chest X-ray AI and COVID-19

Background: - Multiple AI systems developed for pneumonia detection from chest X-rays - Appeared highly accurate in retrospective studies - Deployed during COVID-19 pandemic

What Went Wrong (DeGrave et al., 2021): - Many AI systems failed on COVID-19 pneumonia: - Trained on pre-pandemic data (no COVID-19 patterns) - Learned spurious correlations (lateral decubitus positioning, portable X-rays) - Poor generalization to novel disease

Documented Issues: - AI detected “pneumonia” based on portable vs. fixed X-ray equipment - Picked up hospital-specific artifacts, text overlays, positioning - Failed to detect actual COVID-19 pneumonia features

Root Causes: - Training data biases (sicker patients → portable X-rays) - Lack of causal reasoning (correlations mistaken for disease features) - Insufficient stress testing on out-of-distribution cases - Rapid deployment without adequate validation

Lessons: - AI doesn’t truly “understand” disease, it learns statistical patterns - Training data biases lead to spurious correlations - Test AI on out-of-distribution data before deployment - Pandemic highlighted need for robust, generalizable AI

Recommendations for Safe AI Implementation

Pre-Deployment

1. Rigorous Validation: - Prospective validation in your target population - External validation if possible - Subgroup analysis (race, age, sex, insurance, disease severity)

2. Failure Mode Analysis: - Conduct FMEA before deployment - Identify high-risk failure modes - Design mitigations and safeguards

3. Human Factors Evaluation: - Test AI in realistic clinical workflow - Assess usability, alert design, integration - Identify automation bias risks

4. Transparent Communication: - Educate clinicians about AI capabilities and limitations - Set realistic expectations - Training on appropriate use

5. Safety Protocols: - Standard operating procedures for AI use - Escalation procedures for failures or uncertain cases - Oversight and accountability structure

During Use

6. Real-World Performance Monitoring: - Continuous tracking of key metrics - Dashboards with automated alerts for performance drops - Regular reporting to governance committee

7. Adverse Event Reporting: - Easy, non-punitive reporting system - Rapid investigation and response - Sharing lessons learned

8. Physician Oversight: - AI recommendations reviewed by qualified clinicians - Physicians retain final decision authority - Can’t delegate responsibility to algorithms

9. Patient Communication: - Inform patients about AI use (tiered consent approach) - Transparency about limitations - Respect patient preferences

Ongoing

10. Regular Safety Audits: - Periodic review of AI performance and incidents - Update risk assessments and mitigations - Assess for performance drift

11. Revalidation: - Scheduled revalidation (e.g., annually) - After major clinical practice changes - When patient population characteristics shift

12. Continuous Improvement: - Learn from failures and near-misses - Update AI, protocols, or training based on lessons - Stay current with evolving best practices

13. Decommissioning: - Willingness to pause or stop AI if safety concerns - Clear criteria for decommissioning - Patient safety > sunk costs

Conclusion

Medical AI safety is not an afterthought. It’s a fundamental requirement. The promise of AI to improve diagnosis, personalize treatment, and reduce errors can only be realized if AI systems are rigorously validated, thoughtfully integrated, continuously monitored, and honestly communicated (Kelly et al., 2019; Topol, 2019).

The history of medical AI includes both successes (IDx-DR improving diabetic retinopathy screening access) and failures (Epic sepsis model, IBM Watson). The difference lies not in the sophistication of the algorithms, but in the rigor of validation, honesty about limitations, and commitment to ongoing safety monitoring.

Core Safety Principles:

Retrospective accuracy does not equal real-world safety: demand prospective validation
External validation is essential: don’t trust vendor claims alone
Monitor continuously: performance drifts over time
Report failures transparently: learning requires honesty
Physician oversight is non-negotiable: AI assists, humans decide. WHO’s 2024 guidance reinforces this principle globally: “Humans should remain in control of health-care systems and medical decisions” (WHO, 2024)
Build a safety culture: just culture, transparency, continuous improvement
Put patients first: safety trumps efficiency or profit

AI has the potential to improve patient care dramatically. But that potential can only be realized if safety is treated as seriously as innovation. First, do no harm, for algorithms as for all medical interventions.