AI Fundamentals for Clinicians

Vendors claim their AI has 95% accuracy. Your hospital administrator wants to deploy sepsis prediction algorithms. Radiology groups promote AI-assisted interpretation. To evaluate these claims, you need to understand what machine learning actually does, how neural networks learn from data, and why impressive metrics in research papers often fail in clinical practice.

Learning Objectives

After reading this chapter, you will be able to:

  • Understand the difference between AI, machine learning, and deep learning
  • Recognize the major types of machine learning relevant to medicine
  • Grasp how neural networks learn from medical data
  • Evaluate claims about AI performance using clinical metrics
  • Identify when AI is appropriate (and when it isn’t) for clinical problems
  • Understand the limitations and failure modes of medical AI systems

The Clinical Context: Physicians encounter AI terminology constantly (machine learning, deep learning, neural networks, supervised learning, natural language processing) without clear explanations of what these terms mean or why they matter for clinical practice. This chapter translates AI jargon into clinical concepts you already understand from evidence-based medicine training.

Key Definitions (Physician-Friendly):

  • Artificial Intelligence (AI): Computer systems performing tasks typically requiring human intelligence (diagnosis, pattern recognition, decision-making, language understanding)

  • Machine Learning (ML): AI systems that learn from data rather than following explicit rules. Think: Algorithm learns from 10,000 chest X-rays labeled “pneumonia” or “normal” instead of being programmed with rules about infiltrates

  • Deep Learning (DL): Machine learning using artificial neural networks with many layers. Particularly good at analyzing images, text, and complex patterns. Most modern medical imaging AI uses deep learning

  • Supervised Learning: Algorithm learns from labeled examples (X-rays labeled by radiologists, pathology slides labeled by pathologists). Most medical AI is supervised learning

  • Unsupervised Learning: Algorithm finds patterns in unlabeled data (clustering similar patient types, identifying disease subtypes). Less common in clinical applications

The Key Insight for Physicians:

Modern medical AI doesn’t follow rules written by programmers. Instead, it learns patterns from large datasets. This brings both power (can find patterns humans miss) and problems (can learn biases, can’t explain reasoning, fails on cases different from training data).

What Medical AI Can Do Well:

Pattern recognition in images: Detecting diabetic retinopathy, identifying lung nodules, classifying skin lesions Structured prediction: Predicting sepsis risk, estimating mortality, forecasting disease progression Information extraction: Pulling structured data from clinical notes, identifying adverse events from EHRs Language tasks: Summarizing literature, translating medical text, generating patient education materials

What Medical AI Cannot Do (Yet):

General medical reasoning: Can’t replicate broad clinical judgment across diverse scenarios Handling truly novel cases: Struggles with presentations very different from training data Explaining its reasoning: Black-box models can’t articulate why they made a prediction Incorporating patient preferences: Doesn’t understand values, goals, cultural contexts Taking responsibility: Algorithms don’t face medical boards or malpractice suits

Critical AI Performance Metrics (Clinical Translation):

  • Sensitivity (Recall): % of actual positives correctly identified. High sensitivity = few false negatives. Matters when missing a case is dangerous (e.g., cancer screening)

  • Specificity: % of actual negatives correctly identified. High specificity = few false positives. Matters when false alarms cause harm or unnecessary workups

  • Positive Predictive Value (PPV): If AI says “positive,” what’s the probability it’s actually positive? Depends on disease prevalence. A test with 95% sensitivity and 95% specificity has only 16% PPV if disease prevalence is 1%

  • AUC-ROC: Overall discrimination ability (range 0.5-1.0). Useful for comparing algorithms but doesn’t tell you clinical utility at specific thresholds

  • Calibration: Do predicted probabilities match observed frequencies? An AI saying “70% probability of sepsis” should be right 70% of the time

Warning: High accuracy/AUC in retrospective studies often doesn’t translate to real-world clinical benefit. Demand prospective validation.

Common AI Failure Modes:

Distribution Shift: Algorithm trained on Hospital A’s data fails at Hospital B due to different patient demographics, imaging equipment, clinical documentation practices Beam and Kohane, 2020

Overfitting: Algorithm memorizes training data instead of learning generalizable patterns. Performs brilliantly on training set, poorly on new patients

Confounding: Algorithm learns spurious correlations. Example: COVID-19 chest X-ray AI that actually detected the word “portable” (sicker patients get portable X-rays) instead of lung findings DeGrave et al., 2021

Adversarial Examples: Tiny, imperceptible changes to inputs fool AI completely. This is a patient safety concern Finlayson et al., 2019

Bias Amplification: If training data under-represents certain populations, AI performance will be worse for those groups Obermeyer et al., 2019

Label Leakage: Model uses clinician responses (antibiotic orders, consults) as inputs, detecting that someone already suspected the condition rather than predicting it

Automation Bias: Clinicians over-rely on AI recommendations, accepting incorrect AI suggestions more readily than incorrect human suggestions

Bridging EBM and AI: If you trained in evidence-based medicine, you already understand AI concepts under different names (external validity = external validation, confounding = spurious correlations, effect modification = subgroup performance variation). See terminology translation table in main text.

Prediction ≠ Inference: ML finds correlations, not causes. A readmission model identifies correlated factors, but changing those factors may not reduce readmissions. For causal questions, you still need RCTs or causal inference methods.

The Clinical Bottom Line:

AI is powerful pattern recognition, not artificial general intelligence. It augments physician capabilities but doesn’t replicate clinical judgment. Always maintain human oversight. Understand its limitations. Question vendor claims. Demand prospective validation in YOUR clinical context, not just impressive metrics from somewhere else.

Think of AI as a very sophisticated, very fast, but inflexible medical student: excellent at tasks it’s been trained on, completely lost when encountering something new.

Introduction

Every medical AI system, from diabetic retinopathy screening to sepsis prediction to radiology CAD, operates using machine learning algorithms that learn patterns from data. Understanding how these systems work, what they can and cannot do, and how to interpret their outputs is increasingly essential for clinical practice.

Current Adoption Landscape: Healthcare AI adoption lags behind other sectors but is accelerating. Approximately 8.3% of healthcare firms use AI in producing goods or services, compared to 11.6% in finance, 15.1% in education, and 23.2% in information services (Nguyen et al., 2025). Adoption rates are accelerating, with ambulatory care settings (8.7%) adopting faster than nursing facilities (4.5%). These firm-level figures differ from the 60% physician personal adoption rate (2024 AMA survey) because they measure organizational deployment rather than individual use.

The Central Concept:

Traditional medical software follows explicit rules programmed by humans:

IF temperature > 38°C AND WBC > 12,000 AND systolic BP < 90
THEN flag for sepsis evaluation

Machine learning systems instead learn patterns from data:

Show algorithm 50,000 patient cases labeled "developed sepsis" or "no sepsis"
Algorithm identifies patterns (subtle vital sign trends, lab trajectories, timing relationships)
Apply learned patterns to predict sepsis risk in new patients

This fundamental difference explains both AI’s power (finds patterns humans miss) and its problems (learns biases, can’t explain, fails on novel cases).

Clinical and AI Terminology

If you trained in evidence-based medicine, you already understand most AI concepts under different names. This translation table maps familiar clinical research terminology to machine learning equivalents:

Clinical Research / EBM Machine Learning Notes
Predictor, covariate, independent variable Feature Identical concept
Outcome, dependent variable Label, target What the model predicts
Model fitting, estimation Training Same process, different name
Internal validity Training performance How well the model fits its own data
External validity External validation Performance on new populations
Confounding Spurious correlations Both cause misleading associations
Selection bias Training data bias Who’s included shapes what’s learned
Effect modification Subgroup performance variation Different results in different populations
Prospective study Prospective deployment Real-world performance assessment
Sample size Training dataset size More data generally improves models
Overfitting a regression Overfitting Model captures noise, not signal

The key insight: You already have the mental models for critical AI evaluation. The vocabulary is new; the concepts are not.

AI, Machine Learning, and Deep Learning: A Hierarchy

These terms are often used interchangeably but have distinct meanings:

Hide code
flowchart TD
    A[Artificial Intelligence<br/>Broadest category: machines performing intelligent tasks] --> B[Machine Learning<br/>Systems that learn from data]
    B --> C[Deep Learning<br/>Neural networks with many layers]

    style A fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#dbeafe,stroke:#2563eb,stroke-width:2px
Figure 5.1: The relationship between Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). AI is the broadest category encompassing all computer systems performing intelligent tasks. Machine Learning is a subset using algorithms that learn from data. Deep Learning is a further subset using multi-layer neural networks, particularly effective for medical imaging and language tasks.

Artificial Intelligence (AI)

The broadest category: Any computer system performing tasks that typically require human intelligence.

Medical examples: - Expert systems like MYCIN (rule-based, 1970s) - Machine learning algorithms (data-driven, 1990s-present) - Natural language processing for clinical notes - Robotic surgery systems - Clinical decision support systems

Key point: Not all AI is machine learning. MYCIN used hand-coded rules, not learning from data.

Machine Learning (ML)

Narrower category: Algorithms that learn patterns from data rather than following explicit rules.

Medical examples: - Predicting which patients will develop sepsis based on EHR data - Classifying skin lesions as benign or malignant from images - Extracting structured information from clinical notes - Forecasting disease progression from longitudinal data

Why ML matters for medicine: Medical data is too complex and nuanced for rule-based systems. ML can find subtle patterns across thousands of variables that no human could code explicitly.

Deep Learning (DL)

Narrowest category: Machine learning using artificial neural networks with many layers (hence “deep”).

Medical examples: - Detecting diabetic retinopathy from retinal fundus photographs Gulshan et al., 2016 - Identifying pneumonia on chest X-rays Rajpurkar et al., 2017 - Analyzing pathology slides for cancer detection Nagpal et al., 2019 - Generating radiology reports from images

Why DL revolutionized medical imaging: Can learn complex features directly from raw pixels without manual feature engineering. Turned computer vision from mediocre to expert-level performance.

The clinical reality: Most modern medical AI uses deep learning, particularly for imaging tasks. Understanding “deep learning” means understanding most clinical AI systems you’ll encounter.

Machine Learning Fundamentals: A Clinical Analogy

Think about how you learned to diagnose pneumonia:

  1. Training: You saw hundreds of chest X-rays during residency, with attending physicians pointing out infiltrates, consolidations, effusions
  2. Pattern recognition: Your brain learned to recognize visual patterns associated with pneumonia
  3. Generalization: You can now diagnose pneumonia in new patients you’ve never seen
  4. Refinement: You get better with experience, especially on edge cases

Machine learning works similarly:

  1. Training Data: Algorithm sees thousands of chest X-rays labeled “pneumonia” or “normal” by radiologists
  2. Pattern Learning: Algorithm learns visual features associated with pneumonia (infiltrates, consolidation patterns, location preferences)
  3. Prediction: Algorithm can classify new chest X-rays it’s never seen
  4. Optimization: Algorithm improves by adjusting how much weight it gives different features

The key difference: Physicians can explain their reasoning (“I see right lower lobe consolidation with air bronchograms”). Deep learning algorithms can’t. They’re black boxes Rudin, 2019.

Types of Machine Learning Relevant to Medicine

Supervised Learning (Most Common in Medicine)

What it is: Algorithm learns from labeled examples.

Medical applications: - Classification: Diagnosis tasks with categorical outputs (benign vs. malignant, pneumonia vs. normal, sepsis vs. no sepsis) - Regression: Predicting continuous outcomes (estimated survival time, predicted blood pressure, risk scores 0-100%)

Requirements: - Large dataset of labeled examples (thousands to millions) - High-quality labels (accurate diagnoses from expert physicians) - Labeled examples representing the diversity of real clinical cases

Limitations: - Label quality matters enormously: If training labels are inaccurate or biased, algorithm learns inaccurate/biased patterns - Can’t generalize beyond training distribution: If algorithm never saw pediatric cases during training, will perform poorly on children - Expensive: Labeling requires physician time

Example: Diabetic Retinopathy AI:

  • Training data: 128,000 retinal images labeled by ophthalmologists as “no DR,” “mild DR,” “moderate DR,” “severe DR,” “proliferative DR” Gulshan et al., 2016
  • Learning: Algorithm identifies patterns (microaneurysms, hemorrhages, exudates, neovascularization) associated with each severity level
  • Deployment: Can classify new retinal images into appropriate categories
  • Performance: Sensitivity 97.5%, specificity 93.4% for detecting referable diabetic retinopathy
Prediction vs. Inference: A Critical Distinction

In clinical research, you interpret regression coefficients as potential causal effects: “Each unit increase in X is associated with Y% higher odds, controlling for confounders.”

In machine learning, model weights are prediction tools, not causal estimates. A sepsis prediction model might weight “time since last antibiotic order” heavily because it correlates with clinical suspicion, not because delaying antibiotics causes sepsis. Multicollinearity, which invalidates causal inference, doesn’t hurt prediction.

Key implications for physicians:

  • ML finds correlations, not causes. A model predicting readmission identifies correlated factors, but changing those factors may not reduce readmissions
  • Confounding is not automatically addressed. ML may use confounders as predictive features rather than controlling for them
  • For causal questions (“Does X cause Y?”), you still need RCTs, propensity scores, or instrumental variables

If you need to explain why something happens, use epidemiological methods. If you need to predict what will happen, ML often excels.

Unsupervised Learning (Less Common Clinically)

What it is: Algorithm finds patterns in unlabeled data.

Medical applications: - Clustering: Identifying patient subgroups with similar characteristics (disease phenotypes) - Dimensionality reduction: Simplifying complex multi-omic data for interpretation - Anomaly detection: Flagging unusual patients who don’t fit typical patterns

Limitations: - Hard to validate (no ground truth labels) - Clinical utility often unclear - Requires careful interpretation

Example: Disease Phenotyping:

Unsupervised clustering of EHR data might identify distinct COPD subtypes based on symptoms, comorbidities, medication responses. These subtypes could be more clinically meaningful than traditional classifications.

Reinforcement Learning (Emerging in Medicine)

What it is: Algorithm learns by trial-and-error, receiving rewards for good actions and penalties for bad ones.

Medical applications (mostly research): - Optimizing treatment strategies for chronic diseases - Personalizing chemotherapy dosing - Controlling mechanical ventilation

Limitations: - Can’t learn on real patients (too dangerous) - Requires high-quality simulations - Validation challenges

Why it matters: Promising for personalized treatment optimization, but currently mostly theoretical.

Neural Networks: The Engine Behind Deep Learning

What is a Neural Network?

A neural network is a machine learning model loosely inspired by biological neurons, consisting of layers of interconnected nodes that transform inputs into outputs.

Clinical analogy: Think of diagnostic reasoning as information flow:

Patient symptoms → Your brain processes information → Differential diagnosis

Neural networks work similarly:

Input data → Hidden layers process information → Output prediction

The “deep” in deep learning: Multiple hidden layers allow learning complex, hierarchical patterns.

Hide code
flowchart LR
    subgraph Input["Input Layer"]
        I1[Symptom 1]
        I2[Lab Value 1]
        I3[Imaging Feature 1]
        I4[Vital Sign 1]
    end

    subgraph Hidden["Hidden Layers<br/>(Process Information)"]
        H1[Node]
        H2[Node]
        H3[Node]
        H4[Node]
        H5[Node]
        H6[Node]
    end

    subgraph Output["Output Layer"]
        O1[Disease A Probability]
        O2[Disease B Probability]
        O3[No Disease Probability]
    end

    I1 --> H1 & H2 & H3
    I2 --> H1 & H2 & H3 & H4
    I3 --> H3 & H4 & H5 & H6
    I4 --> H4 & H5 & H6

    H1 & H2 & H3 --> O1 & O2 & O3
    H4 & H5 & H6 --> O1 & O2 & O3

    style Input fill:#dbeafe,stroke:#2563eb
    style Hidden fill:#fef3c7,stroke:#f59e0b
    style Output fill:#d1fae5,stroke:#10b981
Figure 5.2: Simplified representation of a neural network for medical diagnosis. Input layer receives patient data (symptoms, labs, imaging features). Hidden layers process information through mathematical transformations. Output layer produces predictions (disease probability, risk score, diagnosis). Each connection has a weight learned during training.

How Neural Networks Learn

Training process:

  1. Initialize: Start with random weights (connections between nodes)
  2. Forward pass: Input data flows through network, producing prediction
  3. Calculate error: Compare prediction to actual label (ground truth)
  4. Backward pass: Adjust weights to reduce error (backpropagation)
  5. Repeat: Iterate through thousands/millions of examples until error minimizes

Clinical translation:

Like a radiology resident reviewing cases with an attending: - Resident makes diagnosis (forward pass) - Attending provides correct answer (ground truth label) - Resident learns from mistakes (backpropagation) - Resident improves with practice (iterative learning)

Key difference: Neural networks can process millions of examples far faster than humans, finding subtle patterns across vast datasets that individual physicians couldn’t detect.

Convolutional Neural Networks (CNNs): For Medical Imaging

Special architecture for images:

  • Convolutional layers: Detect local patterns (edges, textures, shapes) before combining them into complex features
  • Hierarchical learning: Early layers learn simple features (edges), deep layers learn complex features (anatomical structures)
  • Translation invariance: Can detect findings regardless of location in image

Why CNNs revolutionized medical imaging:

Traditional computer vision required manually designing features (“look for round objects with size 2-5mm with density >X”). CNNs learn features automatically from raw pixels, discovering patterns human programmers wouldn’t have designed.

Medical applications: - Chest X-ray interpretation - CT/MRI analysis - Pathology slide review - Retinal imaging - Dermatology photo classification

Recurrent Neural Networks (RNNs): For Sequential Data

Special architecture for time-series data:

  • Memory: Maintains information about previous inputs
  • Sequential processing: Processes data in order (like reading a clinical note sentence by sentence)

Medical applications: - Analyzing EHR data over time (predicting deterioration from vital sign trends) - Processing clinical notes (understanding context and relationships) - Forecasting disease progression

Modern variant: Transformers:

Power large language models like GPT-4, Med-PaLM. Better than RNNs at capturing long-range dependencies and parallel processing.

Identifying AI Types: A Quick Reference

When vendors describe their AI system, use this guide to understand what architecture they’re likely using and what questions to ask:

Hide code
flowchart TD
    A[What type of data does the AI analyze?] --> B[Medical Images<br/>X-rays, CT, MRI, pathology]
    A --> C[Sequential/Time-Series<br/>Vital sign trends, EHR trajectories]
    A --> D[Clinical Text<br/>Notes, reports, literature]
    A --> E[Tabular Data<br/>Demographics, labs, structured EHR]

    B --> F[CNNs / Deep Learning<br/>Ask about: training data diversity, edge cases]
    C --> G[RNNs / Transformers<br/>Ask about: temporal validation, data leakage]
    D --> H[LLMs / Transformers<br/>Ask about: hallucination rates, grounding]
    E --> I[Gradient Boosting / Logistic Regression<br/>Ask about: feature importance, interpretability]

    style A fill:#f3f4f6,stroke:#6b7280,stroke-width:2px
    style F fill:#dbeafe,stroke:#2563eb
    style G fill:#fef3c7,stroke:#f59e0b
    style H fill:#d1fae5,stroke:#10b981
    style I fill:#fee2e2,stroke:#dc2626
Figure 5.3: Decision guide for identifying AI system types based on input data. Different data types call for different architectures, and mismatches between data type and algorithm suggest potential problems.

Why this matters: If a vendor claims to use “deep learning” for tabular risk prediction, ask why. Gradient boosting often outperforms neural networks on structured EHR data. This pattern holds even when comparing modern LLMs to traditional approaches: in a 2025 study of antimicrobial resistance prediction in sepsis, deep learning on structured EHR data (AUROC 0.85) significantly outperformed LLMs analyzing clinical notes (AUROC 0.74), with no benefit from combining the two (Hixon et al., 2025). The choice of algorithm should match the problem type.

Evaluating AI Performance: Clinical Metrics

Vendors tout impressive performance metrics. How do you evaluate them critically?

Confusion Matrix: The Foundation

Every binary classification task (disease vs. no disease) can be summarized:

Predicted Positive Predicted Negative
Actually Positive True Positive (TP) False Negative (FN)
Actually Negative False Positive (FP) True Negative (TN)

From this come all common metrics:

Sensitivity (Recall, True Positive Rate)

\[\text{Sensitivity} = \frac{TP}{TP + FN}\]

Clinical meaning: Of all actual cases of disease, what % did AI detect?

When it matters most: Screening (cancer detection), rule-out tests, situations where missing a case is dangerous

Example: Sepsis prediction with 85% sensitivity misses 15% of patients who develop sepsis

Specificity (True Negative Rate)

\[\text{Specificity} = \frac{TN}{TN + FP}\]

Clinical meaning: Of all patients without disease, what % did AI correctly identify as negative?

When it matters most: Avoiding false alarms, situations where false positives cause harm (unnecessary biopsies, psychological distress, treatment side effects)

Example: Sepsis prediction with 70% specificity incorrectly flags 30% of patients who don’t have sepsis

Positive Predictive Value (PPV, Precision)

\[\text{PPV} = \frac{TP}{TP + FP}\]

Clinical meaning: If AI says “positive,” what’s the probability it’s actually correct?

Why it matters enormously: PPV depends on disease prevalence. A test with excellent sensitivity/specificity can have poor PPV if disease is rare.

Example: - Disease prevalence: 1% - Sensitivity: 95% - Specificity: 95% - PPV: Only 16%! (84% of “positive” predictions are false alarms)

Critical for physicians: Always ask “What’s the PPV in MY patient population?” not just “What’s the accuracy?”

Negative Predictive Value (NPV)

\[\text{NPV} = \frac{TN}{TN + FN}\]

Clinical meaning: If AI says “negative,” what’s the probability it’s actually correct?

Accuracy

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

Clinical meaning: Overall, what % of predictions are correct?

Warning: Misleading for imbalanced datasets. An AI that always predicts “no cancer” achieves 99% accuracy if cancer prevalence is 1%, but is clinically useless.

AUC-ROC (Area Under Receiver Operating Characteristic Curve)

What it measures: Overall discrimination ability across all possible thresholds

Range: 0.5 (no better than chance) to 1.0 (perfect discrimination)

Interpretation: - 0.9-1.0: Excellent - 0.8-0.9: Good - 0.7-0.8: Fair - 0.6-0.7: Poor - 0.5-0.6: Fail

Limitations: - Doesn’t tell you performance at specific clinical threshold - Can be high even when PPV is poor at relevant prevalence - Doesn’t capture calibration (are probabilities accurate?)

Calibration

What it measures: Do predicted probabilities match observed frequencies?

Example: If AI predicts “30% risk of mortality” for 1000 patients, do ~300 actually die?

Why it matters: Poorly calibrated models produce misleading probabilities, making clinical decision-making difficult.

How to assess: Calibration plots comparing predicted vs. observed outcomes

Vendor Requirements and Red Flags
  1. Prospective validation in settings similar to yours (not just retrospective)
  2. Performance metrics relevant to your population (PPV at your disease prevalence)
  3. Subgroup analysis: Does performance differ by age, sex, race, insurance status?
  4. Calibration assessment: Are probabilities accurate?
  5. Failure mode documentation: When and how does the system fail?
  6. Independent validation: Published in peer-reviewed journals, not just vendor whitepapers

Translating Vendor Claims

When evaluating AI products, vendor marketing language often obscures rather than clarifies. This translation table helps interpret common claims:

Vendor Says What It Actually Means
“AI-powered” Uses some form of pattern recognition (could be simple or sophisticated)
“Clinically validated” Tested somewhere, but unclear if relevant to your population
“FDA-cleared” Met minimum safety threshold for marketing; not an efficacy guarantee
“99% accuracy” On their test set, with their prevalence, under their conditions
“Proven in 50+ hospitals” Deployed, but performance at each site may vary dramatically
“Real-time predictions” Fast output, but speed says nothing about accuracy
“Explainable AI” Provides some rationale, but explanations may not match actual reasoning
“Trained on millions of cases” Large dataset, but representativeness matters more than size
“Outperforms physicians” In one narrow task, under controlled conditions, often retrospectively
“Continuously learning” Model updates over time, but can drift or degrade without oversight

When a claim sounds too good to be true, it usually is. Always ask for peer-reviewed publications, external validation data, and subgroup analyses.

When to Use AI (and When Not To)

AI Works Well When:

Well-defined task with clear output: Binary classification, risk score, structured prediction

Large, high-quality labeled dataset available: Thousands to millions of examples with expert labels

Pattern recognition from complex data: Images, longitudinal data, multi-dimensional inputs

Objective ground truth exists: Pathology biopsy confirms diagnosis, outcomes can be verified

Task is repetitive and time-consuming: Screening, triage, routine interpretation

Human performance has known limitations: Fatigue, variability, rare findings

Examples: - Diabetic retinopathy screening from retinal photos - Detecting pneumothorax on chest X-rays - Identifying metastases in pathology slides - Predicting no-show appointments - Extracting structured data from clinical notes

AI Struggles When:

Task requires general medical reasoning: Synthesizing information across domains, considering patient preferences, navigating uncertainty

Training data is limited or biased: Rare diseases, underrepresented populations, novel clinical scenarios

Ground truth is subjective or uncertain: Ambiguous diagnoses, prognosis depends on unmeasured factors

Explanation is essential: Medical-legal situations, teaching, situations requiring physician buy-in

Stakes are too high for any errors: Life-or-death decisions without human oversight

Examples: - Diagnosing complex multi-system diseases - Navigating goals-of-care discussions - Handling completely novel presentations - Replacing physician judgment entirely

Common Failure Modes and Limitations

Distribution Shift (External Validity)

The problem: AI trained on Hospital A’s data fails at Hospital B

Why: Different patient demographics, imaging equipment, clinical documentation, disease prevalence, treatment protocols

Example: Chest X-ray AI trained on data from academic medical centers performed poorly at community hospitals due to different patient populations and equipment Zech et al., 2018

What physicians should do: Demand validation in YOUR clinical context, not just impressive numbers from elsewhere

Overfitting (Memorization vs. Learning)

The problem: Algorithm memorizes training data instead of learning generalizable patterns

Why: Too complex model + too little data = memorization

How to detect: Excellent performance on training data, poor performance on new data

What physicians should do: Ask about validation strategy (hold-out test sets, cross-validation, external validation)

Confounding and Spurious Correlations

The problem: Algorithm learns correlations that don’t represent causal relationships

Famous example: COVID-19 chest X-ray AI that learned to detect the word “portable” in image metadata (sicker patients get portable X-rays) rather than actual lung findings DeGrave et al., 2021

What physicians should do: Question HOW the AI makes predictions, not just WHETHER it’s accurate. Ask about confounding analyses.

Adversarial Attacks

The problem: Tiny, imperceptible changes to inputs can completely fool AI

Example: Adding noise invisible to human eyes can make AI misclassify malignant lesions as benign Finlayson et al., 2019

Clinical implications: Potential safety and security risks, especially for critical diagnoses

Algorithmic Bias

The problem: If training data under-represents certain populations, AI performs worse for those groups

Famous example: Commercial algorithm for predicting healthcare needs systematically under-estimated risk for Black patients, giving them lower priority for care coordination programs Obermeyer et al., 2019

Why it happens: Historical inequities → biased data → biased algorithms → perpetuate inequities

What physicians should do: Demand subgroup analyses by race, sex, age, insurance status. Question fairness metrics.

Label Leakage

The problem: Algorithm uses information that wouldn’t be available at prediction time, or learns from clinician responses rather than patient physiology.

Example: A sepsis prediction model trained on EHR data might use antibiotic orders as input features. But antibiotics are ordered because clinicians suspect sepsis. The model isn’t predicting sepsis, it’s detecting that someone already suspected sepsis.

Why it’s dangerous: Model appears to predict early, but actually requires information that only exists after the clinical decision has been made.

What physicians should do: Ask vendors: “Walk me through exactly when each input feature becomes available relative to when the prediction is made.” If features include treatment orders, diagnostic tests ordered, or specialist consults, be suspicious.

Temporal Leakage

The problem: For time-series predictions (deterioration, readmission, mortality), using future information to predict past events.

Example: A 30-day readmission model trained on patients who were eventually readmitted, using data from their entire initial hospitalization, including the final discharge summary that mentions “patient at high risk for readmission.”

Why it’s dangerous: Retrospective performance looks excellent, but the model can’t access future data when deployed prospectively.

What physicians should do: Ask about temporal validation. The training/test split should respect time: train on earlier data, test on later data. Random shuffling of time-series data is a red flag.

Automation Bias

The problem: Clinicians over-rely on AI recommendations, even when their own judgment or other evidence contradicts it.

Evidence: Systematic review found automation bias increased risk of commission errors by 26% when using incorrect decision support compared to working without decision support Goddard et al., 2012.

Why it happens: AI systems are perceived as objective and tireless. Cognitive load makes it easier to accept recommendations than to critically evaluate them.

What physicians should do: Treat AI outputs as one input among many, not as authoritative answers. Develop workflows that require active engagement with AI recommendations, not passive acceptance.

The Black-Box Problem

Most modern deep learning systems cannot explain their reasoning in clinically meaningful ways.

Why it matters: - Medical-legal: How do you defend a decision you can’t explain? - Trust: Physicians resist recommendations they don’t understand - Safety: Can’t identify failure modes if you can’t see reasoning - Learning: AI can’t teach the next generation if it can’t articulate reasoning

Approaches to explainability: - Saliency maps: Highlight image regions influencing prediction (but often don’t match clinical reasoning) - Attention mechanisms: Show which words/features model focused on - LIME/SHAP: Explain individual predictions (but computationally expensive, not always accurate)

Current reality: Deep learning explainability remains an active research area. Clinical AI is largely black-box.

What physicians should do: Maintain human oversight. Don’t blindly follow recommendations you can’t understand or verify.

Large Language Models: A Special Case

Terminology: LLM vs. LMM

Large Language Models (LLMs) accept text input and produce text output (ChatGPT, Claude). Large Multi-Modal Models (LMMs) accept multiple input types, including text, images, and audio, and can generate diverse outputs (GPT-4V, Gemini, Med-PaLM M). As medicine increasingly uses image + text models for radiology, pathology, and dermatology, the LMM distinction matters. WHO’s 2024 guidance uses “LMM” as the preferred term for these general-purpose foundation models (WHO, 2025).

What makes LLMs different: - Trained on massive text corpora (internet, books, journals) - Can perform diverse tasks without task-specific training (few-shot learning) - Generate human-like text (including medical documentation, patient education, literature summaries)

Medical applications: - Clinical documentation assistance - Literature synthesis - Patient question answering - Medical education - Clinical reasoning support (with caveats)

Critical limitations: - Hallucinations: Confidently generate plausible but incorrect information. LLMs fundamentally predict the next most likely token based on training patterns, rather than reasoning about content. This “next-word prediction” mechanism explains why hallucinations occur: the model has no conception of what it produces, only statistical patterns from training data. A response that is grammatically fluent and contextually plausible may be factually fabricated because the model optimizes for pattern completion, not truth (WHO, 2025). - Pattern matching vs. reasoning: High benchmark scores may reflect pattern recognition rather than genuine clinical reasoning. When answer patterns are disrupted, LLM accuracy drops 26-38% (Bedi et al., 2025) - No access to real-time data: Can’t check current patient status, recent lab results - No responsibility: Can’t be held accountable for errors - Privacy concerns: Sending patient data to external APIs

Detailed coverage: See Large Language Models in Clinical Practice

AI Agents: From Chatbots to Autonomous Systems

Traditional AI tools respond to single queries: you ask a question, the system provides an answer. AI agents go further. They combine large language models with the ability to use tools, execute code, search databases, call APIs, and chain multiple actions together to accomplish complex goals with varying degrees of autonomy.

What distinguishes agents from chatbots:

Characteristic Chatbot/LLM AI Agent
Input/Output Single question → single answer Goal → multi-step plan → execution
Tool use None Web search, code execution, API calls, database queries
Memory Limited context window Can maintain persistent memory across sessions
Autonomy Responds to prompts Can act independently on goals
Iteration One-shot response Plans, acts, observes results, adjusts approach

How agents work in clinical contexts:

  1. Goal specification: Clinician defines a task (“Find recent trials for this patient’s cancer type and stage”)
  2. Planning: Agent breaks the task into steps (search clinical trial databases, filter by eligibility, rank by relevance)
  3. Tool use: Agent executes searches, reads results, processes information
  4. Iteration: Agent refines approach based on what it finds
  5. Output: Structured recommendation with sources

Current clinical applications:

  • Oncology: Multi-agent systems for drug discovery, treatment planning, and clinical trial matching. Early evidence suggests 70-85% concordance with oncologist decisions, though no prospective RCT validation exists (Truhn et al., 2026)
  • Telehealth: The Utah prescribing pilot (Doctronic) uses approximately 100 specialized agents for autonomous prescription renewal in stable chronic conditions
  • Prior authorization: Agents navigating payer requirements and compiling documentation
  • EHR tasks: Stanford’s MedAgentBench found Claude 3.5 Sonnet achieved 69.67% success on 300 clinical EHR tasks (Fleming et al., 2025)

Critical safety considerations:

Agents amplify both capabilities and risks. When an LLM hallucinates, it generates incorrect text. When an agent hallucinates, it may take incorrect actions: submitting wrong orders, searching incorrect databases, or iterating on flawed assumptions.

  • No FDA clearance exists for autonomous clinical decision-making
  • Hallucination risks compound across multi-step workflows
  • Liability frameworks remain undefined for agent-caused errors
  • Human oversight is essential for any patient-affecting decisions

The autonomy spectrum:

Not all agents are fully autonomous. Clinical agents range from:

  • Supervised assistants: Generate drafts for human review (documentation, summaries)
  • Bounded autonomy: Act independently within narrow constraints (scheduling, information retrieval)
  • Full autonomy: Make and execute clinical decisions without human review (rare, regulatory barriers remain high)

Most clinical applications should remain at supervised or bounded autonomy levels. Fully autonomous clinical decision-making raises unresolved liability, safety, and ethical concerns.

Rapid democratization of agent capabilities:

Open-source agent frameworks are accelerating access to autonomous AI capabilities. OpenClaw (formerly Clawdbot, then Moltbot) has emerged as the most prominent example: an open-source autonomous AI assistant with over 100,000 GitHub stars that can browse the web, manage schedules, send emails, and execute multi-step tasks across messaging platforms. The ecosystem now includes Moltbook, an AI agent-exclusive social network with over 1.5 million registered agents where AI assistants can interact with each other on behalf of their users. This democratization means clinicians will increasingly encounter patients using personal AI agents for health-related tasks (symptom research, medication reminders, appointment scheduling) without clinical oversight. Understanding this landscape helps physicians anticipate patient questions and recognize when AI-assisted patient decisions may need clinical review.

Detailed coverage: See AI Agents in Oncology and Multi-Agent AI Systems for specialty-specific applications and emerging developments.

Key Takeaways for Physicians

Essential Concepts
  1. AI learns from data, doesn’t follow explicit rules: Power (finds hidden patterns) + Problems (learns biases, can’t explain)

  2. Most medical AI is supervised deep learning: Requires large labeled datasets, works best for pattern recognition tasks

  3. Performance metrics are nuanced: Accuracy/AUC alone don’t tell you clinical utility. Ask about PPV in YOUR population, calibration, subgroup performance

  4. Black boxes require trust but limit understanding: Maintain human oversight, don’t blindly follow unexplainable recommendations

  5. Distribution shift is universal: AI trained elsewhere often fails in your context. Demand local validation

  6. AI augments, doesn’t replace: Think “AI-assisted physician” not “physician-less AI”

  7. Bias is pervasive: If training data reflects healthcare inequities, AI perpetuates them

  8. Prospective validation is essential: Retrospective accuracy doesn’t guarantee prospective utility

The bottom line:

You don’t need to build neural networks to evaluate medical AI critically. You need to understand: - What AI can and cannot do - How to interpret performance metrics - What questions to ask vendors - What failure modes to watch for - When human oversight is essential

With these foundations, you’re prepared to evaluate AI tools for your specialty, covered in Part II.