4 AI Fundamentals for Clinicians

Learning Objectives

This chapter provides physician-friendly explanations of AI concepts without assuming technical background. You will learn to:

Understand the difference between AI, machine learning, and deep learning
Recognize the major types of machine learning relevant to medicine
Grasp how neural networks learn from medical data
Evaluate claims about AI performance using clinical metrics
Identify when AI is appropriate (and when it isn’t) for clinical problems
Understand the limitations and failure modes of medical AI systems

No mathematics, programming, or computer science background required.

📋 Chapter Summary (TL;DR)

The Clinical Context: Physicians encounter AI terminology constantly—machine learning, deep learning, neural networks, supervised learning, natural language processing—without clear explanations of what these terms mean or why they matter for clinical practice. This chapter translates AI jargon into clinical concepts.

Key Definitions (Physician-Friendly):

Artificial Intelligence (AI): Computer systems performing tasks typically requiring human intelligence (diagnosis, pattern recognition, decision-making, language understanding)
Machine Learning (ML): AI systems that learn from data rather than following explicit rules. Think: Algorithm learns from 10,000 chest X-rays labeled “pneumonia” or “normal” instead of being programmed with rules about infiltrates
Deep Learning (DL): Machine learning using artificial neural networks with many layers. Particularly good at analyzing images, text, and complex patterns. Most modern medical imaging AI uses deep learning
Supervised Learning: Algorithm learns from labeled examples (X-rays labeled by radiologists, pathology slides labeled by pathologists). Most medical AI is supervised learning
Unsupervised Learning: Algorithm finds patterns in unlabeled data (clustering similar patient types, identifying disease subtypes). Less common in clinical applications

The Key Insight for Physicians:

Modern medical AI doesn’t follow rules written by programmers. Instead, it learns patterns from large datasets. This brings both power (can find patterns humans miss) and problems (can learn biases, can’t explain reasoning, fails on cases different from training data).

What Medical AI Can Do Well:

✅ Pattern recognition in images: Detecting diabetic retinopathy, identifying lung nodules, classifying skin lesions ✅ Structured prediction: Predicting sepsis risk, estimating mortality, forecasting disease progression ✅ Information extraction: Pulling structured data from clinical notes, identifying adverse events from EHRs ✅ Language tasks: Summarizing literature, translating medical text, generating patient education materials

What Medical AI Cannot Do (Yet):

❌ General medical reasoning: Can’t replicate broad clinical judgment across diverse scenarios ❌ Handling truly novel cases: Struggles with presentations very different from training data ❌ Explaining its reasoning: Black-box models can’t articulate why they made a prediction ❌ Incorporating patient preferences: Doesn’t understand values, goals, cultural contexts ❌ Taking responsibility: Algorithms don’t face medical boards or malpractice suits

Critical AI Performance Metrics (Clinical Translation):

Sensitivity (Recall): % of actual positives correctly identified. High sensitivity = few false negatives. Matters when missing a case is dangerous (e.g., cancer screening)
Specificity: % of actual negatives correctly identified. High specificity = few false positives. Matters when false alarms cause harm or unnecessary workups
Positive Predictive Value (PPV): If AI says “positive,” what’s the probability it’s actually positive? Depends on disease prevalence. A test with 95% sensitivity and 95% specificity has only 16% PPV if disease prevalence is 1%
AUC-ROC: Overall discrimination ability (range 0.5-1.0). Useful for comparing algorithms but doesn’t tell you clinical utility at specific thresholds
Calibration: Do predicted probabilities match observed frequencies? An AI saying “70% probability of sepsis” should be right 70% of the time

⚠️ Warning: High accuracy/AUC in retrospective studies often doesn’t translate to real-world clinical benefit. Demand prospective validation.

Common AI Failure Modes:

Distribution Shift: Algorithm trained on Hospital A’s data fails at Hospital B due to different patient demographics, imaging equipment, clinical documentation practices (Beam, Manrai, and Ghassemi 2020)

Overfitting: Algorithm memorizes training data instead of learning generalizable patterns. Performs brilliantly on training set, poorly on new patients

Confounding: Algorithm learns spurious correlations. Example: COVID-19 chest X-ray AI that actually detected the word “portable” (sicker patients get portable X-rays) instead of lung findings (DeGrave, Janizek, and Lee 2021)

Adversarial Examples: Tiny, imperceptible changes to inputs fool AI completely—a patient safety concern (Finlayson et al. 2019)

Bias Amplification: If training data under-represents certain populations, AI performance will be worse for those groups (Obermeyer et al. 2019)

The Clinical Bottom Line:

AI is powerful pattern recognition, not artificial general intelligence. It augments physician capabilities but doesn’t replicate clinical judgment. Always maintain human oversight. Understand its limitations. Question vendor claims. Demand prospective validation in YOUR clinical context, not just impressive metrics from somewhere else.

Think of AI as a very sophisticated, very fast, but inflexible medical student: excellent at tasks it’s been trained on, completely lost when encountering something new.

4.1 Introduction

Every medical AI system—from diabetic retinopathy screening to sepsis prediction to radiology CAD—operates using machine learning algorithms that learn patterns from data. Understanding how these systems work, what they can and cannot do, and how to interpret their outputs is increasingly essential for clinical practice.

This chapter explains AI fundamentals specifically for physicians. No mathematics. No code. Just the conceptual foundations you need to evaluate AI tools critically and integrate them safely into clinical workflows.

The Central Concept:

Traditional medical software follows explicit rules programmed by humans:

IF temperature > 38°C AND WBC > 12,000 AND systolic BP < 90
THEN flag for sepsis evaluation

Machine learning systems instead learn patterns from data:

Show algorithm 50,000 patient cases labeled "developed sepsis" or "no sepsis"
Algorithm identifies patterns (subtle vital sign trends, lab trajectories, timing relationships)
Apply learned patterns to predict sepsis risk in new patients

This fundamental difference explains both AI’s power (finds patterns humans miss) and its problems (learns biases, can’t explain, fails on novel cases).

4.2 AI, Machine Learning, and Deep Learning: A Hierarchy

These terms are often used interchangeably but have distinct meanings:

Hide code

flowchart TD
    A[Artificial Intelligence<br/>Broadest category: machines performing intelligent tasks] --> B[Machine Learning<br/>Systems that learn from data]
    B --> C[Deep Learning<br/>Neural networks with many layers]

    style A fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#dbeafe,stroke:#2563eb,stroke-width:2px

Figure 4.1: The relationship between Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). AI is the broadest category encompassing all computer systems performing intelligent tasks. Machine Learning is a subset using algorithms that learn from data. Deep Learning is a further subset using multi-layer neural networks, particularly effective for medical imaging and language tasks.

4.2.1 Artificial Intelligence (AI)

The broadest category: Any computer system performing tasks that typically require human intelligence.

Medical examples: - Expert systems like MYCIN (rule-based, 1970s) - Machine learning algorithms (data-driven, 1990s-present) - Natural language processing for clinical notes - Robotic surgery systems - Clinical decision support systems

Key point: Not all AI is machine learning. MYCIN used hand-coded rules, not learning from data.

4.2.2 Machine Learning (ML)

Narrower category: Algorithms that learn patterns from data rather than following explicit rules.

Medical examples: - Predicting which patients will develop sepsis based on EHR data - Classifying skin lesions as benign or malignant from images - Extracting structured information from clinical notes - Forecasting disease progression from longitudinal data

Why ML matters for medicine: Medical data is too complex and nuanced for rule-based systems. ML can find subtle patterns across thousands of variables that no human could code explicitly.

4.2.3 Deep Learning (DL)

Narrowest category: Machine learning using artificial neural networks with many layers (hence “deep”).

Medical examples: - Detecting diabetic retinopathy from retinal fundus photographs (Gulshan et al. 2016) - Identifying pneumonia on chest X-rays (Rajpurkar et al. 2017) - Analyzing pathology slides for cancer detection (Nagpal et al. 2019) - Generating radiology reports from images

Why DL revolutionized medical imaging: Can learn complex features directly from raw pixels without manual feature engineering. Turned computer vision from mediocre to expert-level performance.

The clinical reality: Most modern medical AI uses deep learning, particularly for imaging tasks. Understanding “deep learning” means understanding most clinical AI systems you’ll encounter.

4.3 How Machine Learning Works: A Clinical Analogy

Think about how you learned to diagnose pneumonia:

Training: You saw hundreds of chest X-rays during residency, with attending physicians pointing out infiltrates, consolidations, effusions
Pattern recognition: Your brain learned to recognize visual patterns associated with pneumonia
Generalization: You can now diagnose pneumonia in new patients you’ve never seen
Refinement: You get better with experience, especially on edge cases

Machine learning works similarly:

Training Data: Algorithm sees thousands of chest X-rays labeled “pneumonia” or “normal” by radiologists
Pattern Learning: Algorithm learns visual features associated with pneumonia (infiltrates, consolidation patterns, location preferences)
Prediction: Algorithm can classify new chest X-rays it’s never seen
Optimization: Algorithm improves by adjusting how much weight it gives different features

The key difference: Physicians can explain their reasoning (“I see right lower lobe consolidation with air bronchograms”). Deep learning algorithms can’t—they’re black boxes (Rudin 2019).

4.4 Types of Machine Learning Relevant to Medicine

4.4.1 Supervised Learning (Most Common in Medicine)

What it is: Algorithm learns from labeled examples.

Medical applications: - Classification: Diagnosis tasks with categorical outputs (benign vs. malignant, pneumonia vs. normal, sepsis vs. no sepsis) - Regression: Predicting continuous outcomes (estimated survival time, predicted blood pressure, risk scores 0-100%)

Requirements: - Large dataset of labeled examples (thousands to millions) - High-quality labels (accurate diagnoses from expert physicians) - Labeled examples representing the diversity of real clinical cases

Limitations: - Label quality matters enormously: If training labels are inaccurate or biased, algorithm learns inaccurate/biased patterns - Can’t generalize beyond training distribution: If algorithm never saw pediatric cases during training, will perform poorly on children - Expensive: Labeling requires physician time

Example—Diabetic Retinopathy AI:

Training data: 128,000 retinal images labeled by ophthalmologists as “no DR,” “mild DR,” “moderate DR,” “severe DR,” “proliferative DR” (Gulshan et al. 2016)
Learning: Algorithm identifies patterns (microaneurysms, hemorrhages, exudates, neovascularization) associated with each severity level
Deployment: Can classify new retinal images into appropriate categories
Performance: Sensitivity 97.5%, specificity 93.4% for detecting referable diabetic retinopathy

4.4.2 Unsupervised Learning (Less Common Clinically)

What it is: Algorithm finds patterns in unlabeled data.

Medical applications: - Clustering: Identifying patient subgroups with similar characteristics (disease phenotypes) - Dimensionality reduction: Simplifying complex multi-omic data for interpretation - Anomaly detection: Flagging unusual patients who don’t fit typical patterns

Limitations: - Hard to validate (no ground truth labels) - Clinical utility often unclear - Requires careful interpretation

Example—Disease Phenotyping:

Unsupervised clustering of EHR data might identify distinct COPD subtypes based on symptoms, comorbidities, medication responses—potentially more clinically meaningful than traditional classifications.

4.4.3 Reinforcement Learning (Emerging in Medicine)

What it is: Algorithm learns by trial-and-error, receiving rewards for good actions and penalties for bad ones.

Medical applications (mostly research): - Optimizing treatment strategies for chronic diseases - Personalizing chemotherapy dosing - Controlling mechanical ventilation

Limitations: - Can’t learn on real patients (too dangerous) - Requires high-quality simulations - Validation challenges

Why it matters: Promising for personalized treatment optimization, but currently mostly theoretical.

4.5 Neural Networks: The Engine Behind Deep Learning

4.5.1 What is a Neural Network?

A neural network is a machine learning model loosely inspired by biological neurons, consisting of layers of interconnected nodes that transform inputs into outputs.

Clinical analogy: Think of diagnostic reasoning as information flow:

Patient symptoms → Your brain processes information → Differential diagnosis

Neural networks work similarly:

Input data → Hidden layers process information → Output prediction

The “deep” in deep learning: Multiple hidden layers allow learning complex, hierarchical patterns.

Hide code

flowchart LR
    subgraph Input["Input Layer"]
        I1[Symptom 1]
        I2[Lab Value 1]
        I3[Imaging Feature 1]
        I4[Vital Sign 1]
    end

    subgraph Hidden["Hidden Layers<br/>(Process Information)"]
        H1[Node]
        H2[Node]
        H3[Node]
        H4[Node]
        H5[Node]
        H6[Node]
    end

    subgraph Output["Output Layer"]
        O1[Disease A Probability]
        O2[Disease B Probability]
        O3[No Disease Probability]
    end

    I1 --> H1 & H2 & H3
    I2 --> H1 & H2 & H3 & H4
    I3 --> H3 & H4 & H5 & H6
    I4 --> H4 & H5 & H6

    H1 & H2 & H3 --> O1 & O2 & O3
    H4 & H5 & H6 --> O1 & O2 & O3

    style Input fill:#dbeafe,stroke:#2563eb
    style Hidden fill:#fef3c7,stroke:#f59e0b
    style Output fill:#d1fae5,stroke:#10b981

Figure 4.2: Simplified representation of a neural network for medical diagnosis. Input layer receives patient data (symptoms, labs, imaging features). Hidden layers process information through mathematical transformations. Output layer produces predictions (disease probability, risk score, diagnosis). Each connection has a weight learned during training.

4.5.2 How Neural Networks Learn

Training process:

Initialize: Start with random weights (connections between nodes)
Forward pass: Input data flows through network, producing prediction
Calculate error: Compare prediction to actual label (ground truth)
Backward pass: Adjust weights to reduce error (backpropagation)
Repeat: Iterate through thousands/millions of examples until error minimizes

Clinical translation:

Like a radiology resident reviewing cases with an attending: - Resident makes diagnosis (forward pass) - Attending provides correct answer (ground truth label) - Resident learns from mistakes (backpropagation) - Resident improves with practice (iterative learning)

Key difference: Neural networks can process millions of examples far faster than humans, finding subtle patterns across vast datasets that individual physicians couldn’t detect.

4.5.3 Convolutional Neural Networks (CNNs): For Medical Imaging

Special architecture for images:

Convolutional layers: Detect local patterns (edges, textures, shapes) before combining them into complex features
Hierarchical learning: Early layers learn simple features (edges), deep layers learn complex features (anatomical structures)
Translation invariance: Can detect findings regardless of location in image

Why CNNs revolutionized medical imaging:

Traditional computer vision required manually designing features (“look for round objects with size 2-5mm with density >X”). CNNs learn features automatically from raw pixels, discovering patterns human programmers wouldn’t have designed.

Medical applications: - Chest X-ray interpretation - CT/MRI analysis - Pathology slide review - Retinal imaging - Dermatology photo classification

4.5.4 Recurrent Neural Networks (RNNs): For Sequential Data

Special architecture for time-series data:

Memory: Maintains information about previous inputs
Sequential processing: Processes data in order (like reading a clinical note sentence by sentence)

Medical applications: - Analyzing EHR data over time (predicting deterioration from vital sign trends) - Processing clinical notes (understanding context and relationships) - Forecasting disease progression

Modern variant—Transformers:

Power large language models like GPT-4, Med-PaLM. Better than RNNs at capturing long-range dependencies and parallel processing.

4.6 Evaluating AI Performance: Clinical Metrics

Vendors tout impressive performance metrics. How do you evaluate them critically?

4.6.1 Confusion Matrix: The Foundation

Every binary classification task (disease vs. no disease) can be summarized:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From this come all common metrics:

4.6.2 Sensitivity (Recall, True Positive Rate)

\[\text{Sensitivity} = \frac{TP}{TP + FN}\]

Clinical meaning: Of all actual cases of disease, what % did AI detect?

When it matters most: Screening (cancer detection), rule-out tests, situations where missing a case is dangerous

Example: Sepsis prediction with 85% sensitivity misses 15% of patients who develop sepsis

4.6.3 Specificity (True Negative Rate)

\[\text{Specificity} = \frac{TN}{TN + FP}\]

Clinical meaning: Of all patients without disease, what % did AI correctly identify as negative?

When it matters most: Avoiding false alarms, situations where false positives cause harm (unnecessary biopsies, psychological distress, treatment side effects)

Example: Sepsis prediction with 70% specificity incorrectly flags 30% of patients who don’t have sepsis

4.6.4 Positive Predictive Value (PPV, Precision)

\[\text{PPV} = \frac{TP}{TP + FP}\]

Clinical meaning: If AI says “positive,” what’s the probability it’s actually correct?

Why it matters enormously: PPV depends on disease prevalence. A test with excellent sensitivity/specificity can have poor PPV if disease is rare.

Example: - Disease prevalence: 1% - Sensitivity: 95% - Specificity: 95% - PPV: Only 16%! (84% of “positive” predictions are false alarms)

⚠️ Critical for physicians: Always ask “What’s the PPV in MY patient population?” not just “What’s the accuracy?”

4.6.5 Negative Predictive Value (NPV)

\[\text{NPV} = \frac{TN}{TN + FN}\]

Clinical meaning: If AI says “negative,” what’s the probability it’s actually correct?

4.6.6 Accuracy

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

Clinical meaning: Overall, what % of predictions are correct?

⚠️ Warning: Misleading for imbalanced datasets. An AI that always predicts “no cancer” achieves 99% accuracy if cancer prevalence is 1%, but is clinically useless.

4.6.7 AUC-ROC (Area Under Receiver Operating Characteristic Curve)

What it measures: Overall discrimination ability across all possible thresholds

Range: 0.5 (no better than chance) to 1.0 (perfect discrimination)

Interpretation: - 0.9-1.0: Excellent - 0.8-0.9: Good - 0.7-0.8: Fair - 0.6-0.7: Poor - 0.5-0.6: Fail

Limitations: - Doesn’t tell you performance at specific clinical threshold - Can be high even when PPV is poor at relevant prevalence - Doesn’t capture calibration (are probabilities accurate?)

4.6.8 Calibration

What it measures: Do predicted probabilities match observed frequencies?

Example: If AI predicts “30% risk of mortality” for 1000 patients, do ~300 actually die?

Why it matters: Poorly calibrated models produce misleading probabilities, making clinical decision-making difficult.

How to assess: Calibration plots comparing predicted vs. observed outcomes

What Physicians Should Demand from AI Vendors

Prospective validation in settings similar to yours (not just retrospective)
Performance metrics relevant to your population (PPV at your disease prevalence)
Subgroup analysis: Does performance differ by age, sex, race, insurance status?
Calibration assessment: Are probabilities accurate?
Failure mode documentation: When and how does the system fail?
Independent validation: Published in peer-reviewed journals, not just vendor whitepapers

4.7 When to Use AI (and When Not To)

4.7.1 AI Works Well When:

✅ Well-defined task with clear output: Binary classification, risk score, structured prediction

✅ Large, high-quality labeled dataset available: Thousands to millions of examples with expert labels

✅ Pattern recognition from complex data: Images, longitudinal data, multi-dimensional inputs

✅ Objective ground truth exists: Pathology biopsy confirms diagnosis, outcomes can be verified

✅ Task is repetitive and time-consuming: Screening, triage, routine interpretation

✅ Human performance has known limitations: Fatigue, variability, rare findings

Examples: - Diabetic retinopathy screening from retinal photos - Detecting pneumothorax on chest X-rays - Identifying metastases in pathology slides - Predicting no-show appointments - Extracting structured data from clinical notes

4.7.2 AI Struggles When:

❌ Task requires general medical reasoning: Synthesizing information across domains, considering patient preferences, navigating uncertainty

❌ Training data is limited or biased: Rare diseases, underrepresented populations, novel clinical scenarios

❌ Ground truth is subjective or uncertain: Ambiguous diagnoses, prognosis depends on unmeasured factors

❌ Explanation is essential: Medical-legal situations, teaching, situations requiring physician buy-in

❌ Stakes are too high for any errors: Life-or-death decisions without human oversight

Examples: - Diagnosing complex multi-system diseases - Navigating goals-of-care discussions - Handling completely novel presentations - Replacing physician judgment entirely

4.8 Common Failure Modes and Limitations

4.8.1 Distribution Shift (External Validity)

The problem: AI trained on Hospital A’s data fails at Hospital B

Why: Different patient demographics, imaging equipment, clinical documentation, disease prevalence, treatment protocols

Example: Chest X-ray AI trained on data from academic medical centers performed poorly at community hospitals due to different patient populations and equipment (Zech et al. 2018)

What physicians should do: Demand validation in YOUR clinical context, not just impressive numbers from elsewhere

4.8.2 Overfitting (Memorization vs. Learning)

The problem: Algorithm memorizes training data instead of learning generalizable patterns

Why: Too complex model + too little data = memorization

How to detect: Excellent performance on training data, poor performance on new data

What physicians should do: Ask about validation strategy (hold-out test sets, cross-validation, external validation)

4.8.3 Confounding and Spurious Correlations

The problem: Algorithm learns correlations that don’t represent causal relationships

Famous example: COVID-19 chest X-ray AI that learned to detect the word “portable” in image metadata (sicker patients get portable X-rays) rather than actual lung findings (DeGrave, Janizek, and Lee 2021)

What physicians should do: Question HOW the AI makes predictions, not just WHETHER it’s accurate. Ask about confounding analyses.

4.8.4 Adversarial Attacks

The problem: Tiny, imperceptible changes to inputs can completely fool AI

Example: Adding noise invisible to human eyes can make AI misclassify malignant lesions as benign (Finlayson et al. 2019)

Clinical implications: Potential safety and security risks, especially for critical diagnoses

4.8.5 Algorithmic Bias

The problem: If training data under-represents certain populations, AI performs worse for those groups

Famous example: Commercial algorithm for predicting healthcare needs systematically under-estimated risk for Black patients, giving them lower priority for care coordination programs (Obermeyer et al. 2019)

Why it happens: Historical inequities → biased data → biased algorithms → perpetuate inequities

What physicians should do: Demand subgroup analyses by race, sex, age, insurance status. Question fairness metrics.

4.9 The Black-Box Problem

Most modern deep learning systems cannot explain their reasoning in clinically meaningful ways.

Why it matters: - Medical-legal: How do you defend a decision you can’t explain? - Trust: Physicians resist recommendations they don’t understand - Safety: Can’t identify failure modes if you can’t see reasoning - Learning: AI can’t teach the next generation if it can’t articulate reasoning

Approaches to explainability: - Saliency maps: Highlight image regions influencing prediction (but often don’t match clinical reasoning) - Attention mechanisms: Show which words/features model focused on - LIME/SHAP: Explain individual predictions (but computationally expensive, not always accurate)

Current reality: Deep learning explainability remains an active research area. Clinical AI is largely black-box.

What physicians should do: Maintain human oversight. Don’t blindly follow recommendations you can’t understand or verify.

4.10 Large Language Models: A Special Case

What makes LLMs different: - Trained on massive text corpora (internet, books, journals) - Can perform diverse tasks without task-specific training (few-shot learning) - Generate human-like text (including medical documentation, patient education, literature summaries)

Medical applications: - Clinical documentation assistance - Literature synthesis - Patient question answering - Medical education - Clinical reasoning support (with caveats)

Critical limitations: - Hallucinations: Confidently generate plausible but incorrect information - No access to real-time data: Can’t check current patient status, recent lab results - No responsibility: Can’t be held accountable for errors - Privacy concerns: Sending patient data to external APIs

Detailed coverage: See Chapter 23: Large Language Models in Clinical Practice

4.11 Key Takeaways for Physicians

Essential Concepts

AI learns from data, doesn’t follow explicit rules: Power (finds hidden patterns) + Problems (learns biases, can’t explain)
Most medical AI is supervised deep learning: Requires large labeled datasets, works best for pattern recognition tasks
Performance metrics are nuanced: Accuracy/AUC alone don’t tell you clinical utility. Ask about PPV in YOUR population, calibration, subgroup performance
Black boxes require trust but limit understanding: Maintain human oversight, don’t blindly follow unexplainable recommendations
Distribution shift is universal: AI trained elsewhere often fails in your context. Demand local validation
AI augments, doesn’t replace: Think “AI-assisted physician” not “physician-less AI”
Bias is pervasive: If training data reflects healthcare inequities, AI perpetuates them
Prospective validation is essential: Retrospective accuracy doesn’t guarantee prospective utility

The bottom line:

You don’t need to build neural networks to evaluate medical AI critically. You need to understand: - What AI can and cannot do - How to interpret performance metrics - What questions to ask vendors - What failure modes to watch for - When human oversight is essential

With these foundations, you’re prepared to evaluate AI tools for your specialty—covered in Part II.

4.12 References

Beam, Andrew L., Arjun K. Manrai, and Marzyeh Ghassemi. 2020. “Challenges to the Reproducibility of Machine Learning Models in Health Care.” JAMA 323 (4): 305–6. https://doi.org/10.1001/jama.2019.20866.

DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. 2021. “AI for Radiographic COVID-19 Detection Selects Shortcuts over Signal.” Nature Machine Intelligence 3: 610–19. https://doi.org/10.1038/s42256-021-00338-7.

Finlayson, Samuel G., John D. Bowers, Joichi Ito, Jonathan L. Zittrain, Andrew L. Beam, and Isaac S. Kohane. 2019. “Adversarial Attacks on Medical Machine Learning.” Science 363 (6433): 1287–89. https://doi.org/10.1126/science.aaw4399.

Gulshan, Varun, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, et al. 2016. “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.” JAMA 316 (22): 2402–10. https://doi.org/10.1001/jama.2016.17216.

Nagpal, Kunal, Davis Foote, Yun Liu, Po-Hsuan Cameron Chen, Ellery Wulczyn, Fraser Tan, Niels Olson, et al. 2019. “Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer.” Npj Digital Medicine 2 (1): 1–10. https://doi.org/10.1038/s41746-019-0112-2.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Rajpurkar, Pranav, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, et al. 2017. “CheXNet: Radiologist-Level Pneumonia Detection on Chest x-Rays with Deep Learning.” arXiv Preprint arXiv:1711.05225.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1: 206–15. https://doi.org/10.1038/s42256-019-0048-x.

Zech, John R., Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. 2018. “Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs: A Cross-Sectional Study.” PLOS Medicine 15 (11): e1002683. https://doi.org/10.1371/journal.pmed.1002683.