Appendix B — Appendix B: Glossary of Medical AI Terms

Introduction

This glossary defines key terms used throughout the Physician AI Handbook. Terms are organized alphabetically and explained in practical, clinically relevant language.

A

AI (Artificial Intelligence): Computer systems designed to perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and pattern detection. In medicine, AI encompasses diagnostic algorithms, clinical decision support systems, and predictive models.

Algorithm: A step-by-step procedure or formula for solving a problem. In medical AI, algorithms process patient data (images, labs, clinical notes) to generate predictions, diagnoses, or treatment recommendations.

Algorithmic Bias: Systematic errors in AI predictions that disproportionately affect certain groups (e.g., racial/ethnic minorities, women, elderly). Bias typically stems from training data that underrepresents or misrepresents specific populations.

AUROC (Area Under the Receiver Operating Characteristic Curve): A performance metric measuring an AI model’s ability to distinguish between classes (e.g., disease present vs. absent). AUROC ranges from 0.5 (no better than chance) to 1.0 (perfect discrimination). Values above 0.8 generally considered good, above 0.9 excellent.

Autonomous AI: AI systems that make decisions and take actions without human oversight. In medicine, rare and controversial—e.g., IDx-DR (diabetic retinopathy screening AI) operates without physician interpretation. Contrast with decision support AI requiring human review.

B

Batch Learning: Training an AI model on a fixed dataset all at once, then deploying the “locked” model. Most current medical AI uses batch learning—models don’t update after deployment. Contrast with continuous/online learning.

Bias (Statistical): Systematic deviation of AI predictions from true values. Sources include non-representative training data, flawed data collection, or algorithmic design choices. See also Algorithmic Bias.

Black Box: AI models whose internal logic is opaque—input goes in, prediction comes out, but how the model reached its decision is unclear. Deep neural networks often criticized as black boxes. Contrast with interpretable/explainable models.

C

CAD (Computer-Aided Detection): AI systems that flag suspicious findings for physician review, e.g., highlighting lung nodules on chest X-rays or calcifications on mammograms. CAD augments but doesn’t replace human interpretation.

Calibration: The degree to which an AI model’s predicted probabilities match observed frequencies. A well-calibrated model that predicts “70% probability of disease” should be correct 70% of the time. Poorly calibrated models may be overconfident or underconfident.

CDSS (Clinical Decision Support System): Software providing clinicians with patient-specific assessments or recommendations to aid decision-making. Includes simple rule-based systems (alert if drug interaction detected) and complex AI-powered predictions (sepsis risk scores).

Confounding: When an observed association between AI prediction and outcome is actually due to a third variable. Example: COVID-19 CXR AI learned to detect “portable X-ray” text in image metadata (sicker patients) rather than actual lung pathology.

Continuous Learning: AI that updates its model as new data becomes available, adapting over time. Promises improved performance but raises regulatory challenges—how to approve a constantly changing system? Few medical AI systems currently use true continuous learning.

D

Data Augmentation: Techniques to artificially increase training dataset size by creating modified versions of existing data (e.g., rotating medical images, adding noise). Helps prevent overfitting and improves generalization.

Deep Learning: A subset of machine learning using artificial neural networks with multiple layers (hence “deep”). Excels at pattern recognition in images, text, and other complex data. Most medical AI breakthroughs (radiology, pathology, dermatology) use deep learning.

Deployment Drift: When an AI model’s performance degrades after deployment due to changes in patient population, data collection practices, equipment, or disease prevalence. Requires continuous monitoring to detect and address.

Differential Privacy: Techniques that allow AI training on sensitive patient data while mathematically guaranteeing individual patients cannot be re-identified. Adds “noise” to data to protect privacy while preserving statistical properties.

E

Edge AI: AI models running locally on devices (smartphones, medical equipment, wearables) rather than cloud servers. Benefits include faster processing, offline capability, and enhanced privacy. Examples: smartphone-based ECG interpretation, on-device ultrasound AI.

Ensemble Model: Combining multiple AI models to improve prediction accuracy. For example, averaging predictions from 5 different lung nodule detection algorithms may outperform any single algorithm.

Explainable AI (XAI): AI systems designed to provide human-understandable explanations for their predictions. Techniques include highlighting image regions influencing a diagnosis or listing key factors driving a risk score. Critical for clinical trust and regulatory acceptance.

External Validation: Testing an AI model on data from a different institution, population, or time period than the training data. Essential to assess generalizability. Many models perform well internally but poorly when externally validated.

F

False Negative: When AI incorrectly predicts “no disease” for a patient who actually has the disease. Clinically dangerous—missed diagnoses, delayed treatment. Measured by sensitivity (1 - false negative rate).

False Positive: When AI incorrectly predicts “disease present” for a healthy patient. Leads to unnecessary anxiety, testing, procedures, costs. Measured by specificity (1 - false positive rate).

FDA (Food and Drug Administration): U.S. regulatory agency overseeing medical devices, including AI/ML-based software. Most medical AI requires FDA clearance (510k) or approval (PMA) before clinical use.

Feature: An individual measurable property or characteristic used as input for an AI model. In medical AI, features may include patient age, lab values, pixel intensities in images, or words in clinical notes.

Federated Learning: Training AI across multiple institutions without sharing raw patient data. Each site trains locally, only model updates (not data) are shared centrally and aggregated. Enables large-scale AI development while preserving privacy.

Fine-Tuning: Adapting a pre-trained AI model to a specific task or dataset. For example, taking a general image recognition model and fine-tuning it to detect lung nodules using chest X-rays.

Foundation Model: Large AI models (e.g., GPT-4, Med-PaLM) trained on vast, diverse datasets and then fine-tuned for specific tasks. In medicine, foundation models may answer clinical questions, generate documentation, or assist diagnosis across multiple specialties.

G

Generalization: An AI model’s ability to perform well on new, previously unseen data. Poor generalization (overfitting) occurs when a model memorizes training data but fails on real-world cases.

Generative AI: AI that creates new content—text, images, audio, video. In medicine, applications include generating synthetic medical images for training, drafting clinical notes, or creating patient education materials. Examples: ChatGPT, DALL-E.

GMLP (Good Machine Learning Practice): Guidelines developed by FDA, Health Canada, and UK MHRA outlining best practices for developing and maintaining medical AI. Covers data quality, model validation, monitoring, and transparency.

Ground Truth: The true, correct answer against which AI predictions are compared. In medical AI, ground truth is typically expert annotations (e.g., radiologist labels identifying tumors) or definitive outcomes (biopsy results, clinical diagnoses).

H

Hallucination: When AI generates plausible-sounding but incorrect or fabricated information. Common problem with large language models—e.g., citing non-existent research papers, recommending unproven treatments. Dangerous in clinical contexts.

Hyperparameter: Settings that control how an AI model learns, such as learning rate, number of layers in a neural network, or regularization strength. Tuned during model development to optimize performance.

I

Imbalanced Data: When one class vastly outnumbers others in training data. Example: 95% normal chest X-rays, 5% with pneumonia. AI may achieve high accuracy by predicting “normal” for everything, but fail to detect rare disease. Requires special techniques to address.

Informed Consent: Ethical and legal requirement that patients understand and agree to medical interventions. For AI, raises questions: Must patients consent to AI use in their care? Be informed of AI limitations? Consent to data use for training?

Interpretability: The degree to which humans can understand why an AI model made a specific prediction. Linear models highly interpretable; deep neural networks less so. Critical for clinical adoption and regulatory approval.

L

Labeled Data: Training data with known outcomes or expert annotations. Example: chest X-rays labeled “pneumonia present” or “normal” by radiologists. Most supervised learning requires large amounts of labeled data—expensive and time-consuming to obtain.

Large Language Model (LLM): AI trained on vast amounts of text to generate human-like language. Examples: GPT-4, Claude, Med-PaLM. In medicine, LLMs answer clinical questions, summarize literature, generate documentation, but risk hallucinations.

Latent Space: Abstract, high-dimensional representation of data learned by neural networks. In medical imaging, latent space captures patterns (textures, shapes) that distinguish diseased from healthy tissue, even if not explicitly programmed.

M

Machine Learning (ML): Subset of AI where systems learn patterns from data without explicit programming. Includes supervised learning (learning from labeled data), unsupervised learning (finding patterns in unlabeled data), and reinforcement learning (learning through trial-and-error).

Model: The mathematical representation learned by an AI algorithm from training data. Once trained, the model makes predictions on new data. In medical AI, models range from simple logistic regression to complex deep neural networks.

Multimodal AI: AI systems integrating multiple data types—images, text, genomics, labs, wearables. Promises more holistic assessment than single-modality AI. Example: combining chest X-ray with clinical notes to improve pneumonia diagnosis.

N

Natural Language Processing (NLP): AI techniques for analyzing human language. In medicine, NLP extracts structured information from unstructured clinical notes, generates documentation, or powers chatbots answering patient questions.

Neural Network: AI model inspired by brain structure, consisting of interconnected nodes (neurons) organized in layers. Information flows from input layer through hidden layers to output layer. Deep neural networks have many hidden layers.

NPV (Negative Predictive Value): Probability that a patient with a negative AI prediction truly does not have the disease. Depends on disease prevalence—high when disease is rare, lower when common.

O

Overfitting: When an AI model memorizes training data (including noise and peculiarities) rather than learning generalizable patterns. Overfitted models perform well on training data but poorly on new data. Prevented through regularization, cross-validation, larger datasets.

P

PCCP (Predetermined Change Control Plan): FDA framework allowing medical AI manufacturers to specify anticipated changes (e.g., retraining on new data) and receive approval for those changes in advance. Enables continuous improvement without repeated FDA submissions.

Precision (Positive Predictive Value): Proportion of positive AI predictions that are correct. High precision = few false positives. Critical when false positives are costly (unnecessary biopsies, anxiety, procedures).

Pre-training: Initial training of an AI model on a large, general dataset before fine-tuning on a specific medical task. Example: pre-train on millions of general images, then fine-tune on chest X-rays. Improves performance, especially with limited medical data.

Prospective Validation: Testing AI on data collected after model development, simulating real-world deployment. Gold standard for validation, much more rigorous than retrospective validation on historical data.

R

Radiomics: Extracting quantitative features from medical images (texture, shape, intensity patterns) and using them for AI prediction. Example: tumor heterogeneity on CT predicting treatment response.

Recall (Sensitivity): Proportion of actual cases correctly identified by AI. High recall = few false negatives. Critical when missing disease is dangerous (e.g., cancer screening).

Reinforcement Learning: AI learns through trial-and-error, receiving rewards for correct actions and penalties for incorrect ones. Rare in clinical medicine (ethical concerns about “trial-and-error” on patients) but used in drug discovery, treatment optimization simulations.

Retrospective Validation: Testing AI on historical data collected before model development. Easier and faster than prospective validation, but prone to biases and may overestimate real-world performance.

ROC Curve (Receiver Operating Characteristic): Graph plotting true positive rate (sensitivity) vs. false positive rate (1-specificity) at various decision thresholds. Used to evaluate AI diagnostic performance. See also AUROC.

S

Sensitivity (Recall): Proportion of actual disease cases correctly identified by AI. Formula: True Positives / (True Positives + False Negatives). High sensitivity critical for screening tests where missing disease is dangerous.

Specificity: Proportion of healthy patients correctly identified as healthy by AI. Formula: True Negatives / (True Negatives + False Positives). High specificity critical when false positives costly.

Supervised Learning: AI learning from labeled training data with known outcomes. Most medical AI uses supervised learning—e.g., training on images labeled by expert radiologists.

Synthetic Data: Artificial data generated by AI (e.g., fake medical images that look realistic but don’t correspond to real patients). Used for training, augmentation, or privacy protection. Risk: may not capture full complexity of real clinical data.

T

Test Set: Subset of data held back during model development and used only for final evaluation. Provides unbiased estimate of how AI will perform in real world. Must be completely separate from training data.

Training Data: Data used to teach an AI model. Model learns patterns from this data. Quality and representativeness of training data critically determine model performance and bias.

Transfer Learning: Applying knowledge learned from one task to a different but related task. Example: AI trained to recognize cats and dogs can be fine-tuned to detect tumors in medical images. Reduces data requirements for medical AI.

Triage: AI that prioritizes cases by urgency. Example: flagging head CT with intracranial hemorrhage for immediate radiologist review, while normal studies reviewed later. Improves workflow efficiency and reduces time to treatment for critical findings.

U

Underfitting: When an AI model is too simple to capture patterns in data, performing poorly on both training and test data. Opposite of overfitting. Fixed by using more complex models or better features.

Unsupervised Learning: AI learning patterns from unlabeled data without explicit outcomes. Applications include clustering similar patients or discovering disease subtypes. Less common in clinical medicine than supervised learning.

V

Validation: Process of evaluating an AI model’s performance on data not used during training. Internal validation uses data from same institution; external validation uses data from different institutions, populations, or time periods. External validation essential for assessing generalizability.

Vanishing Gradient Problem: Technical challenge in training very deep neural networks where learning signals become too weak to update early layers. Addressed through architectural innovations (e.g., residual networks).

W

Weakly Supervised Learning: Training AI with imprecise or incomplete labels. Example: using billing codes (imperfect) instead of chart review (expensive) to label diagnoses. Reduces labeling burden but may introduce errors.

X

XAI: See Explainable AI.

Z

Zero-shot Learning: AI performing tasks without seeing any training examples for that specific task, relying on knowledge from related tasks. Example: Large language model answering medical questions about rare diseases never seen during training. Promising but risk of hallucinations.

This glossary provides a foundation for understanding medical AI terminology. For deeper exploration of specific concepts, refer to relevant handbook chapters.