3 AI in Medicine: A Brief History

Learning Objectives

This chapter traces AI’s seven-decade journey through medicine, from expert systems to foundation models. You will learn to:

Distinguish genuine clinical breakthroughs from recurring hype cycles (MYCIN, IBM Watson, diagnostic AI)
Recognize why technical excellence doesn’t guarantee clinical adoption
Identify patterns separating successful medical AI deployments from research prototypes
Understand FDA regulation evolution and current frameworks
Evaluate whether today’s foundation models represent a paradigm shift
Apply historical lessons about clinical integration, medical liability, and physician trust

No technical prerequisites required.

📋 Chapter Summary (TL;DR)

The Clinical Context: AI has experienced 70 years of boom-bust cycles—from 1970s expert systems to today’s foundation models—with each wave promising to revolutionize medicine but delivering narrow, often fragile applications. Understanding this history is essential for distinguishing genuine breakthroughs from marketing hype and avoiding expensive implementation failures.

The Cautionary Tales:

MYCIN (1972-1979): Stanford’s expert system matched infectious disease specialists in diagnosing bacterial blood infections (Shortliffe et al. 1975). Evaluation showed 65% acceptability and 90.9% accuracy in antimicrobial selection (Yu et al. 1979). Yet MYCIN was never used clinically—killed by liability concerns, FDA uncertainty, EHR integration impossibility, and physician trust barriers. Key lesson: Technical excellence ≠ clinical adoption.
IBM Watson for Oncology (2012-2018): After defeating Jeopardy! champions, Watson promised to revolutionize cancer treatment by analyzing medical literature and EHRs. Multiple hospitals worldwide adopted it. Reality: Watson produced unsafe and incorrect treatment recommendations (Ross and Swetlitz 2018). MD Anderson cancelled a $62 million implementation (Strickland 2019). Jupiter Hospital (India) found it recommended treatments unavailable in their country (Liu et al. 2020). Quietly discontinued. Key lesson: Marketing hype doesn’t equal clinical validity.
Google Flu Trends (2008-2015): Published in Nature (Ginsberg et al. 2009), GFT used search queries to predict influenza activity 1-2 weeks ahead of CDC surveillance. Initial success led to widespread adoption. Then it overestimated peak flu activity by 140% in 2012-2013 (Lazer et al. 2014). Why? Algorithm updates, changing search behavior, overfitting to spurious correlations. Discontinued 2015. Key lesson: Black-box algorithms fail when they can’t adapt to distribution shift.

The Pattern That Repeats: 1. Breakthrough technology demonstrates impressive capabilities in controlled settings 2. Overpromising: “AI will revolutionize medicine / replace radiologists / eliminate diagnostic errors” 3. Pilot studies succeed with carefully curated datasets 4. Reality: Deployment reveals liability concerns, workflow disruption, integration challenges, trust gaps 5. Disillusionment when technology falls short of marketing claims 6. Eventual integration into narrow, well-defined applications (if evidence supports it)

What Actually Works in Clinical Medicine:

✅ Diabetic retinopathy screening (IDx-DR, FDA-cleared 2018) (Abràmoff et al. 2018) - Specific, well-defined task with clear ground truth - Prospective validation in real clinical settings - Autonomous operation without physician interpretation - Currently deployed in primary care, endocrinology clinics

✅ Computer-aided detection (CAD) in mammography (Lehman et al. 2015) - Augments radiologist interpretation, doesn’t replace it - Reduces false negatives in screening - Integrated into radiology workflow - Evidence shows improved cancer detection rates

✅ Sepsis prediction alerts (Epic Sepsis Model, others) (Wong et al. 2021) - High-stakes problem with clear intervention pathway - Alerts clinicians to deteriorating patients - But: High false positive rates remain problematic - Ongoing debate about clinical benefit vs. alert fatigue

✅ AI-assisted pathology (Paige Prostate, FDA-cleared 2021) (Pantanowitz et al. 2020) - Flags suspicious regions for pathologist review - Reduces interpretation time - Maintains human-in-the-loop oversight

What Doesn’t Work (Documented Failures):

❌ IBM Watson for Oncology - Unsafe recommendations, poor real-world performance (Ross and Swetlitz 2018) ❌ Epic Sepsis Model at Michigan Medicine - 67% sensitivity at low-specificity threshold, missed most sepsis cases (Wong et al. 2021) ❌ Skin cancer apps lacking validation - Many show poor performance outside training distributions (Freeman et al. 2020) ❌ Autonomous diagnostic systems without human oversight - Liability unclear, physician resistance high

The Critical Insight for Physicians:

Technical metrics (accuracy, AUC-ROC, sensitivity, specificity) do not predict clinical utility. The hardest problems deploying medical AI are: - Medical liability: Who’s responsible when AI fails? - FDA regulation: Which devices require clearance? How much evidence is enough? - Clinical workflow integration: Does this fit how we actually practice? - Physician trust: Will clinicians follow AI recommendations? - Patient acceptance: Are patients comfortable with algorithmic decisions?

Why This Time Might Be Different:

Data availability: EHRs, genomics, imaging archives, wearables, multi-omic datasets
Computational power: Cloud computing, GPUs, TPUs make complex models feasible
Algorithmic breakthroughs: Transfer learning, foundation models (GPT-4, Med-PaLM 2), few-shot learning
Regulatory maturity: FDA has frameworks for AI/ML-based medical devices
Clinical acceptance: Younger physicians trained alongside AI tools show higher adoption

Yet fundamental challenges persist: Explainability (black-box models), fairness (algorithmic bias), reliability (distribution shift), deployment barriers, and the irreducible complexity of clinical medicine.

The Clinical Bottom Line:

Be skeptical of vendor claims. Demand prospective clinical trials, not just retrospective validation. Prioritize patient safety over efficiency. Understand that you remain medically and legally responsible for clinical decisions, regardless of AI recommendations. Start with narrow, well-defined problems rather than general diagnostic systems. Center physician and patient perspectives in AI development and deployment.

History shows: Most medical AI projects fail. Learning why matters more than celebrating the rare successes.

3.1 Introduction

Artificial intelligence in medicine isn’t new. The field has experienced multiple waves of excitement and bitter disillusionment over seven decades. Each cycle promised to revolutionize clinical practice. Each fell short.

So why should physicians believe that this time is different?

Understanding AI’s history in medicine isn’t just academic curiosity—it’s essential for navigating today’s hype, identifying genuinely transformative applications, and avoiding expensive failures that harm patients or waste resources. The patterns repeat: breathless promises, pilot studies that look impressive, deployment challenges nobody anticipated, and eventual disillusionment when technology doesn’t match marketing.

But history also reveals what works. Successful medical AI applications share common traits: they solve specific, well-defined clinical problems; they augment rather than replace physician expertise; they integrate into existing workflows rather than demanding wholesale practice transformation; and most importantly, they have prospective clinical evidence demonstrating patient benefit.

This chapter traces AI’s journey through medicine from philosophical thought experiment to today’s foundation models, with focus on lessons for practicing physicians.

Hide code

timeline
    title AI Evolution in Medicine: From Expert Systems to Foundation Models
    1950s-1960s : Birth of Medical AI
                : Turing Test (1950)
                : Dartmouth Conference (1956)
                : Early diagnosis systems
    1970s-1980s : Expert Systems Era
                : MYCIN (1972): Infectious disease
                : INTERNIST-I (1974): Internal medicine
                : First AI Winter (late 1980s)
    1990s-2000s : Machine Learning Era
                : Support Vector Machines
                : FDA regulates CAD systems
                : Evidence-based medicine integration
    2010s : Deep Learning Revolution
          : AlexNet (2012): Computer vision
          : FDA clears first deep learning device (2017)
          : Diabetic retinopathy AI (IDx-DR, 2018)
    2020s : Foundation Model Era
          : GPT-3 (2020), ChatGPT (2022)
          : Med-PaLM 2 (Google, 2023)
          : FDA AI/ML framework evolves

Figure 3.1: Timeline of AI development in medicine from the 1950s to 2020s, showing major breakthroughs (summers) and setbacks (winters). Each era brought different approaches and capabilities, from expert systems and rule-based reasoning to machine learning, deep learning, and today’s foundation models. The cyclical pattern of hype and disillusionment has repeated multiple times, yet each wave built on lessons from previous attempts.

3.2 The Birth of Medical AI (1950s-1960s)

3.2.1 The Turing Test and Medical Diagnosis

In 1950, British mathematician Alan Turing published “Computing Machinery and Intelligence” (Turing 1950), opening with a deceptively simple question: “Can machines think?”

Rather than define “thinking” philosophically, Turing proposed a practical test: if a human evaluator couldn’t distinguish a machine’s responses from a human’s, the machine could be said to exhibit intelligence. This pragmatic approach—judge by outputs, not internal mechanisms—still influences how we evaluate medical AI systems today.

Historical Context

When Turing wrote his paper, computers were room-sized calculators used primarily for mathematical computations and code-breaking (Turing himself had led cryptanalysis efforts at Bletchley Park during World War II). The idea that machines might one day diagnose diseases or recommend treatments seemed like science fiction. Yet Turing explicitly discussed medical diagnosis as a potential application of machine intelligence.

Relevance for physicians today: We still evaluate medical AI using a modified Turing-like framework: Can the algorithm’s diagnostic accuracy match or exceed expert physicians? But as MYCIN’s story will show, matching expert performance doesn’t guarantee clinical adoption.

3.2.2 The Dartmouth Conference (1956)

The field of AI was formally born at Dartmouth College (McCarthy et al. 2006) in summer 1956. John McCarthy, Marvin Minsky, Claude Shannon, and other luminaries gathered for a two-month workshop with an audacious premise:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

They were spectacularly overconfident about timelines. McCarthy predicted machines would achieve human-level intelligence within a generation. Instead, we got symbolic AI systems that could play checkers but couldn’t recognize a cat—a task any toddler performs effortlessly.

Relevance for clinical practice: This early overconfidence established a pattern physicians still see today—brilliant computer scientists underestimating how much of medical expertise is tacit, contextual, and embodied. Diagnosing a patient requires more than applying rules; it requires intuition built from thousands of cases, cultural competency, the ability to notice what’s not documented in the chart, and synthesizing ambiguous, incomplete, sometimes contradictory information under uncertainty and time pressure.

3.2.3 Early Medical AI Attempts

The first attempts to apply AI to medicine emerged in the 1960s:

DENDRAL (1965) (Lindsay et al. 1993): A Stanford system that identified molecular structures from mass spectrometry data. It worked, but only in a highly constrained domain with clear rules.
Pattern recognition for cancer diagnosis: Early computer vision systems attempted to identify cancerous cells from microscope images. Results were mixed, and 1960s computational power wasn’t sufficient for complex image analysis.

These early efforts revealed a fundamental challenge that persists today: medicine deals with messy, incomplete data about extraordinarily complex biological systems. Unlike chess (deterministic rules, perfect information) or mathematical theorem proving (formal logic), medical diagnosis involves uncertainty, missing data, biological variability, comorbidities, and patient preferences.

3.3 The Expert Systems Era (1970s-1980s): MYCIN’s Promise and Failure

The 1970s brought a new approach: if we can’t make machines think like humans, maybe we can capture human expertise in formal IF-THEN rules.

3.3.1 MYCIN: Technical Success, Clinical Failure

In 1972, Edward Shortliffe began developing MYCIN (Shortliffe et al. 1975) at Stanford, an expert system for diagnosing bacterial blood infections and recommending antibiotics. This was a seemingly perfect test case for AI in medicine:

Why MYCIN was promising:

Well-defined clinical problem: Identify causative bacteria and select appropriate antibiotics
Clear expertise: Infectious disease specialists followed identifiable reasoning patterns
Life-or-death stakes: Sepsis kills quickly; correct antibiotic choice dramatically affects mortality
Knowledge-intensive task: Success requires knowing hundreds of drug-bug interactions, resistance patterns, patient-specific factors

How MYCIN worked:

MYCIN used backward chaining through approximately 600 IF-THEN rules:

IF:
  1) Patient is immunocompromised host
  AND
  2) Site of infection is gastrointestinal tract
  AND
  3) Gram stain is gram-negative-rod
THEN:
  Evidence (0.7) that organism is E. coli

The system conducted structured consultations by asking questions, applying rules, and—crucially—explaining its reasoning. This explainability was revolutionary. MYCIN could answer “Why do you believe this?” and “How did you reach that conclusion?”—something today’s deep learning systems struggle to do convincingly.

The results were stunning:

Rigorous evaluation studies in the late 1970s found that MYCIN performed as well as infectious disease experts (Yu et al. 1979) and better than junior physicians. A landmark study published in JAMA showed:

65% of MYCIN’s therapy recommendations were deemed acceptable by expert review
90.9% accuracy in prescribing appropriate antimicrobial therapy
Performance comparable to ID faculty, superior to residents

Papers celebrated MYCIN as a breakthrough. It was featured in medical journals, computer science conferences, and popular media. Stanford Medicine showcased it as the future of clinical decision support.

The devastating reality:

Despite this technical and clinical success, MYCIN was never used to treat a single patient. Not once. Not in a clinical trial. Not even in a supervised pilot study.

Why did MYCIN fail so completely? The reasons had nothing to do with the AI’s performance:

Why MYCIN Failed: Lessons for Today’s Physician AI

Medical Liability: Who is legally responsible if MYCIN recommends the wrong antibiotic and the patient dies? The physician who followed its advice? Stanford University? The programmers? In the 1970s (and arguably still today), liability frameworks couldn’t accommodate algorithmic decision-making.

FDA Regulatory Uncertainty: The FDA had no framework for regulating software-based medical devices. Was MYCIN a medical device requiring approval? If so, what clinical evidence would be needed? These questions weren’t resolved until decades later.

Technical Integration Barriers: Getting MYCIN’s recommendations required a separate computer terminal, manual data entry, and stepping outside normal clinical workflow. In the 1970s, hospitals were just beginning to adopt electronic systems. MYCIN couldn’t integrate with existing infrastructure.

Physician Trust: Doctors weren’t comfortable following advice from a system they didn’t understand. The black-box problem—even though MYCIN could explain its rules, physicians couldn’t validate them independently during patient care.

Knowledge Maintenance Burden: Medical knowledge evolves. Keeping 600 hand-coded rules updated proved impractical. When new antibiotics became available or resistance patterns changed, updating MYCIN required programmer intervention.

Hospital Politics: Infectious disease specialists worried MYCIN would undermine their consultative role. Administrators couldn’t justify dedicated computer resources. Insurers wouldn’t reimburse for AI-assisted care.

The lesson for today’s physicians:

MYCIN proves that technical excellence doesn’t guarantee clinical adoption. The hardest problems deploying medical AI are rarely algorithmic—they’re legal, regulatory, social, organizational, and workflow-related.

This lesson remains profoundly relevant. Today’s deep learning models vastly exceed MYCIN’s capabilities. Yet they face identical deployment challenges: liability uncertainty, integration complexity, physician trust barriers, and unclear value proposition relative to existing clinical workflows.

3.3.2 Other Expert Systems: Similar Patterns

The 1980s saw dozens of medical expert systems, most following MYCIN’s pattern of impressive demonstrations but minimal clinical impact:

INTERNIST-I/CADUCEUS (1974-1985) (Miller, Pople, and Myers 1982): Diagnosed diseases across internal medicine (~1000 diseases, ~3,500 manifestations). Could rival internists in complex cases. Never clinically deployed.
DXplain (1984-present) (Barnett et al. 1987): Differential diagnosis support system. Still used today for education and clinical decision support, but as a reference tool, not autonomous diagnostic system. Rare success story due to modest scope and human-in-the-loop design.
ONCOCIN (1981): Guided cancer chemotherapy protocols. Impressive demonstrations, minimal clinical adoption.

3.3.3 Why Expert Systems Failed

By the late 1980s, the expert systems approach hit fundamental limits:

Brittleness: Systems worked perfectly within their narrow domain but failed catastrophically on edge cases. A single unexpected finding could derail the entire diagnostic process.
Knowledge acquisition bottleneck: Extracting rules from expert physicians was extraordinarily time-consuming and incomplete. Experts often couldn’t articulate their reasoning explicitly.
Combinatorial explosion: Real-world medicine requires thousands of interacting rules. Managing complexity became unworkable.
Maintenance burden: Medical knowledge evolves continuously. Hand-coded rules became outdated, requiring constant programmer intervention.
Lack of learning: Expert systems couldn’t improve from experience. Every new case provided no feedback to enhance future performance.
Deployment realities: As MYCIN demonstrated, technical performance didn’t address liability, regulation, integration, or trust.

The “AI Winter” of the late 1980s and 1990s arrived. Funding dried up. Researchers left the field. Companies removed “AI” from marketing materials. Medical AI seemed like a failed experiment.

The lesson: Rule-based approaches couldn’t capture the complexity, uncertainty, and nuance of real clinical practice. Medicine needed a fundamentally different approach.

3.4 The Machine Learning Revolution (1990s-2000s)

3.4.1 From Rules to Data

The 1990s brought a paradigm shift: instead of encoding expert rules manually, why not let algorithms learn patterns directly from data?

Machine learning—particularly supervised learning—offered a solution. Show an algorithm thousands of examples (X-rays labeled normal vs. abnormal, patient data labeled survived vs. died), and it learns to recognize patterns without explicit rule programming.

Key developments enabling this shift:

Increased computational power: Moore’s Law made previously impractical computations feasible
Digitization of medical data: Picture Archiving and Communication Systems (PACS), early electronic health records
Algorithmic advances: Support Vector Machines (SVMs), decision trees, ensemble methods

3.4.2 Computer-Aided Detection (CAD) in Radiology

The first commercially successful medical AI applications emerged in radiology:

Computer-Aided Detection (CAD) for mammography:

FDA began approving CAD systems in the late 1990s
Designed to augment radiologist interpretation by flagging suspicious regions
Became widely adopted in breast cancer screening

Initial promise: Retrospective studies suggested CAD could reduce false negatives (missed cancers).

Reality check: Prospective studies showed mixed results (Lehman et al. 2015). CAD increased recall rates (more women called back for additional imaging) without consistently improving cancer detection rates. Some studies suggested CAD reduced radiologist specificity without improving sensitivity.

Current status: Second-generation AI systems using deep learning show more promise. The lesson: first-generation medical AI often underperforms expectations in real-world deployment.

3.4.3 FDA Begins Regulating Medical AI

The 1990s-2000s saw FDA establish regulatory frameworks for software-based medical devices:

1990s: First CAD systems cleared through 510(k) pathway (substantial equivalence to existing devices)
2000s: FDA establishes guidance for computer-assisted detection devices
Challenge: Rapid AI evolution outpaced regulatory frameworks designed for static medical devices

This regulatory evolution continues today, with FDA developing new frameworks for continuously learning AI systems.

3.4.4 Why This Era Mattered

The machine learning revolution established principles still guiding medical AI:

✅ Data-driven approaches could achieve good performance without explicit rule encoding
✅ Narrow, well-defined tasks (e.g., detecting lung nodules on CT) worked better than general diagnosis
✅ Augmentation vs. replacement gained acceptance (radiologist + CAD performed better than either alone)
⚠️ Prospective validation essential: Retrospective performance didn’t guarantee real-world utility
⚠️ Workflow integration remained challenging despite better algorithms

The limitation: Traditional machine learning required careful feature engineering (hand-crafted measurements and patterns). Algorithms couldn’t learn complex representations directly from raw data.

The deep learning revolution would change that.

3.5 The Deep Learning Revolution (2010s): Imaging AI Comes of Age

3.5.1 AlexNet and the ImageNet Moment (2012)

In 2012, a deep convolutional neural network called AlexNet (Krizhevsky, Sutskever, and Hinton 2012) won the ImageNet visual recognition challenge by a massive margin, halving the error rate of competing approaches.

This wasn’t just incremental progress—it was a paradigm shift. Deep learning could learn complex features directly from raw pixels without manual feature engineering. Suddenly, computer vision problems that had resisted decades of effort became tractable.

Implications for medical imaging:

Medical images (X-rays, CTs, MRIs, pathology slides, dermatology photos, retinal fundus images) are fundamentally visual pattern recognition problems. If deep learning could master general image classification, perhaps it could learn to detect pneumonia, identify cancers, or diagnose diabetic retinopathy.

3.5.2 Medical Imaging AI Explosion

The mid-2010s saw an explosion of medical imaging AI research:

2016: Dermatology AI matches dermatologists in skin cancer classification (Esteva et al. 2017)
2016: Deep learning detects diabetic retinopathy from retinal images (Gulshan et al. 2016)
2017: Radiology AI detects pneumonia from chest X-rays (Rajpurkar et al. 2017)
2018: Pathology AI analyzes prostate biopsies (Nagpal et al. 2019)

Papers proliferated. Venture capital flooded into medical AI startups. Headlines proclaimed “AI Will Replace Radiologists.”

3.5.3 The First FDA-Cleared Autonomous AI (2018): IDx-DR

In April 2018, FDA granted the first authorization for an autonomous AI diagnostic system: IDx-DR for diabetic retinopathy screening (Abràmoff et al. 2018).

Why IDx-DR succeeded where MYCIN failed:

✅ Narrow, well-defined task: Detect referable diabetic retinopathy (yes/no decision)
✅ Clear clinical need: Primary care physicians need retinal screening but lack ophthalmology expertise
✅ Prospective validation: 900-patient clinical trial in primary care settings (not just retrospective analysis)
✅ Autonomous operation: Primary care staff operate the system without specialist interpretation
✅ Regulatory clarity: FDA had developed frameworks for AI-based medical devices
✅ Reimbursement: CPT codes established for AI-assisted diabetic retinopathy screening

Clinical bottom line: IDx-DR demonstrates the formula for successful medical AI deployment: 1. Well-defined, high-value clinical problem 2. Prospective validation in real-world settings 3. Clear regulatory pathway 4. Workflow integration 5. Reimbursement model 6. Physician and patient acceptance

3.5.4 The “AI Will Replace Radiologists” Debate

Geoffrey Hinton (deep learning pioneer) famously claimed in 2016: “It’s quite obvious that we should stop training radiologists.”

This sparked fierce debate. Would AI replace radiologists? Should medical students avoid imaging specialties?

What actually happened:

❌ AI didn’t replace radiologists
✅ AI began augmenting radiologist workflow
✅ Radiologists increasingly use AI as assistive tools
✅ AI handles some straightforward screening tasks
✅ Complex cases still require radiologist expertise
✅ New roles emerged: radiologists curating datasets, validating AI, interpreting edge cases

The lesson for physicians:

Dire predictions about AI replacing doctors are consistently wrong. What happens instead: AI augments clinical capabilities, handles routine tasks, and creates new workflows requiring physician oversight.

Fear replacement less. Focus on effective integration more.

3.5.5 Deep Learning’s Limitations in Medicine

Despite revolutionary capabilities, deep learning revealed serious limitations:

Black-box problem: Neural networks can’t explain why they make predictions. A radiologist can articulate reasoning (“spiculated mass in the upper lobe with associated lymphadenopathy suggests malignancy”). Deep learning models output probabilities without interpretable justification.

Data hunger: Deep learning requires massive labeled datasets (thousands to millions of examples). Curating these datasets requires enormous physician time.

Brittleness to distribution shift: Models trained on data from Hospital A often perform poorly at Hospital B due to different imaging equipment, patient populations, disease prevalence, or documentation practices.

Adversarial vulnerability: Tiny, imperceptible changes to input images can fool deep learning models completely—a major patient safety concern (Finlayson et al. 2019).

Fairness and bias: AI systems inherit biases from training data. If training data under-represents certain populations, model performance may be worse for those groups (Obermeyer et al. 2019).

These limitations drive ongoing research and FDA regulatory evolution.

3.6 IBM Watson for Oncology: The Highest-Profile Failure

No discussion of medical AI history would be complete without IBM Watson for Oncology—arguably the most expensive, highest-profile medical AI failure to date.

3.6.1 The Promise

In 2011, IBM’s Watson defeated human champions on Jeopardy!, demonstrating impressive natural language processing capabilities. IBM pivoted to healthcare, promising Watson would revolutionize medicine by:

Analyzing millions of medical journal articles
Synthesizing complex patient data from EHRs
Recommending personalized, evidence-based cancer treatments
Augmenting oncologist decision-making

Major cancer centers worldwide partnered with IBM: MD Anderson Cancer Center, Memorial Sloan Kettering, hospitals across India, China, and other countries. IBM invested billions. Expectations soared.

3.6.2 The Reality

Watson for Oncology failed spectacularly (Ross and Swetlitz 2018):

Unsafe recommendations: Internal documents revealed Watson recommended treatments that would have harmed patients. In one case, Watson suggested administering chemotherapy to a patient with severe bleeding—a contraindicated, potentially fatal recommendation (Ross and Swetlitz 2018).

Poor real-world performance: Oncologists found recommendations often didn’t match current evidence-based guidelines or were inappropriate for specific clinical contexts (Liu et al. 2020).

Geographic inappropriateness: Jupiter Hospital (India) reported Watson recommended treatments unavailable in India, ignoring local resource constraints and formulary restrictions (Liu et al. 2020).

Training data issues: Watson was trained primarily on synthetic cases created by Memorial Sloan Kettering physicians, not real-world patient data. It learned institutional preferences, not universal evidence (Strickland 2019).

MD Anderson debacle: MD Anderson Cancer Center spent $62 million on Watson implementation before canceling the project in 2016, concluding it wasn’t ready for clinical use (Strickland 2019).

Widespread abandonment: By 2019, multiple health systems had stopped using Watson. IBM sold Watson Health assets in 2021.

3.6.3 What Went Wrong: Lessons for Physicians

Critical Lessons from Watson’s Failure

Marketing ≠ Clinical Validation: Jeopardy! success doesn’t translate to clinical competence. Demand prospective clinical trials, not just impressive demonstrations in controlled settings.
Black-box algorithms are dangerous: Oncologists couldn’t understand Watson’s reasoning or override incorrect recommendations effectively.
Training data matters immensely: Synthetic cases created by one institution don’t represent the diversity of real clinical practice.
Physician involvement is essential: Watson was developed primarily by engineers, with insufficient oncologist input during design and training.
Geographic and institutional context matters: Treatment recommendations must account for local resources, formularies, patient populations, and practice patterns.
Financial incentives can override evidence: IBM’s business model prioritized deployment over clinical validation.
Transparency and reporting: Watson’s failures remained largely hidden until investigative journalists uncovered them. Medical AI needs transparent reporting of failures.

The clinical bottom line:

Watson’s failure demonstrates why physicians must evaluate AI systems with the same rigor applied to new pharmaceuticals: demand prospective clinical trials, transparent reporting, independent validation, and post-market surveillance.

Don’t accept vendor claims without evidence.

3.7 The Foundation Model Era (2020-Present)

3.7.1 Large Language Models Arrive

November 2022 brought ChatGPT, introducing millions of people—including physicians—to large language models (LLMs) (Brown et al. 2020). These “foundation models” could:

Answer medical questions with apparently sophisticated reasoning
Draft clinical notes
Explain complex concepts
Translate between languages
Write code
Summarize literature

Unlike narrow AI systems designed for specific tasks, LLMs demonstrated general capabilities across diverse domains.

3.7.2 Medical-Specific Foundation Models

Recognizing both promise and risks of general LLMs in medicine, researchers developed medical-specific models:

Med-PaLM and Med-PaLM 2 (Google, 2022-2023): - Fine-tuned on medical text - Achieved passing scores on USMLE-style questions (Singhal et al. 2023) - Improved clinical accuracy compared to general LLMs

GPT-4 in medicine (OpenAI, 2023): - Demonstrated strong performance on medical licensing exams - Used by physicians for literature synthesis, clinical reasoning support, patient education

Challenges remain: - Hallucinations: LLMs confidently generate plausible but incorrect information - Bias: Inherit biases from training data - Liability: Unclear legal framework for LLM-assisted clinical decisions - Privacy: Patient data security concerns - Validation: How to validate general-purpose models across diverse clinical scenarios?

3.7.3 Current State: Promise and Uncertainty

As of 2025, medical AI stands at an inflection point:

Undeniable progress: - FDA has cleared 500+ AI-based medical devices - Radiology, pathology, dermatology, ophthalmology see routine AI use - Clinical decision support systems are deployed in major EHR systems - Foundation models demonstrate unprecedented versatility

Persistent challenges: - Most AI systems remain narrow, fragile, and context-dependent - Integration into clinical workflow remains difficult - Liability frameworks lag technological capabilities - Bias and fairness concerns are well-documented but incompletely solved - Physician trust varies widely by specialty, generation, and prior experience

The fundamental question:

Are today’s foundation models genuinely different from previous AI waves, or are we in another hype cycle that will end in disillusionment?

Evidence suggests: Both. Foundation models represent real algorithmic breakthroughs. But transforming clinical practice requires solving the same deployment challenges that defeated MYCIN and Watson: integration, liability, trust, validation, and proving clinical benefit beyond technical performance metrics.

3.8 Key Lessons from History

Seven decades of medical AI teach clear lessons:

Historical Lessons for Physicians

Technical performance ≠ clinical adoption (MYCIN)
Marketing hype ≠ clinical validity (Watson)
Retrospective validation ≠ prospective utility (Google Flu Trends)
Narrow, well-defined tasks work best (IDx-DR)
Augmentation > replacement (CAD in radiology)
Physician involvement is essential (Watson’s failure)
Regulatory frameworks evolve slowly (ongoing FDA challenges)
Liability concerns shape adoption (MYCIN’s legal uncertainty)
Workflow integration is harder than algorithmic development (expert systems)
Patient safety must remain paramount (Watson’s unsafe recommendations)

The path forward:

Learn from failures. Demand evidence. Integrate thoughtfully. Maintain physician oversight. Prioritize patients.

History doesn’t repeat, but it rhymes. The physicians who understand AI’s history will navigate its future most effectively.

3.9 References

Abràmoff, Michael D., Philip T. Lavin, Michele Birch, Nilay Shah, and James C. Folk. 2018. “Pivotal Trial of an Autonomous AI-Based Diagnostic System for Detection of Diabetic Retinopathy in Primary Care Offices.” Npj Digital Medicine 1 (1): 1–8. https://doi.org/10.1038/s41746-018-0040-6.

Arbabshirani, Mohammad R., Brandon K. Fornwalt, Gregory J. Mongelluzzo, Jonathan D. Suever, Benjamin D. Geise, Aalpen A. Patel, and Gregory J. Moore. 2018. “Advanced Machine Learning in Action: Identification of Intracranial Hemorrhage on Computed Tomography Scans of the Head with Clinical Workflow Integration.” Npj Digital Medicine 1: 1–7. https://doi.org/10.1038/s41746-017-0015-z.

Attia, Zachi I., Peter A. Noseworthy, Francisco Lopez-Jimenez, Samuel J. Asirvatham, Abhishek J. Deshmukh, Bernard J. Gersh, Rickey E. Carter, et al. 2019. “An Artificial Intelligence-Enabled ECG Algorithm for the Identification of Patients with Atrial Fibrillation During Sinus Rhythm: A Retrospective Analysis of Outcome Prediction.” The Lancet 394 (10201): 861–67. https://doi.org/10.1016/S0140-6736(19)31721-0.

Barnett, G. Octo, James J. Cimino, Jon A. Hupp, and Edward P. Hoffer. 1987. “DXplain: An Evolving Diagnostic Decision-Support System.” JAMA 258 (1): 67–74. https://doi.org/10.1001/jama.258.1.67.

Bates, David W., Gilad J. Kuperman, Samuel Wang, Tejal Gandhi, Anne Kittler, Lynn Volk, Christiana Spurr, Ramin Khorasani, Milenko Tanasijevic, and Blackford Middleton. 2003. “Ten Commandments for Effective Clinical Decision Support: Making the Practice of Evidence-Based Medicine a Reality.” Journal of the American Medical Informatics Association 10 (6): 523–30. https://doi.org/10.1197/jamia.M1370.

Beam, Andrew L., Arjun K. Manrai, and Marzyeh Ghassemi. 2020. “Challenges to the Reproducibility of Machine Learning Models in Health Care.” JAMA 323 (4): 305–6. https://doi.org/10.1001/jama.2019.20866.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

Campanella, Gabriele, Matthew G. Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J. Busam, Edi Brogi, Victor E. Reuter, David S. Klimstra, and Thomas J. Fuchs. 2019. “Clinical-Grade Computational Pathology Using Weakly Supervised Deep Learning on Whole Slide Images.” Nature Medicine 25 (8): 1301–9. https://doi.org/10.1038/s41591-019-0508-1.

Castro, E., J. S. Cardoso, and J. C. Pereira. 2020. “Validation of Artificial Intelligence for Prostate MRI Interpretation: A Multi-Center Study.” European Radiology 30: 6343–50. https://doi.org/10.1007/s00330-020-07035-2.

Char, Danton S., Nigam H. Shah, and David Magnus. 2018. “Implementing Machine Learning in Health Care: Addressing Ethical Challenges.” New England Journal of Medicine 378 (11): 981–83. https://doi.org/10.1056/NEJMp1714229.

Collins, Gary S., Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, Marzyeh Ghassemi, et al. 2024. “TRIPOD+AI Statement: Updated Guidance for Reporting Clinical Prediction Models That Use Regression or Machine Learning Methods.” BMJ 385: e078378. https://doi.org/10.1136/bmj-2023-078378.

Daneshjou, Roxana, Kailas Vodrahalli, Roberto A. Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al. 2022. “Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set.” Science Advances 8 (32): eabq6147. https://doi.org/10.1126/sciadv.abq6147.

DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. 2021. “AI for Radiographic COVID-19 Detection Selects Shortcuts over Signal.” Nature Machine Intelligence 3: 610–19. https://doi.org/10.1038/s42256-021-00338-7.

Esteva, Andre, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639): 115–18. https://doi.org/10.1038/nature21056.

Finlayson, Samuel G., John D. Bowers, Joichi Ito, Jonathan L. Zittrain, Andrew L. Beam, and Isaac S. Kohane. 2019. “Adversarial Attacks on Medical Machine Learning.” Science 363 (6433): 1287–89. https://doi.org/10.1126/science.aaw4399.

Finlayson, Samuel G., Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohane, and Suchi Saria. 2021. “The Clinician and Dataset Shift in Artificial Intelligence.” New England Journal of Medicine 385 (3): 283–86. https://doi.org/10.1056/NEJMc2104626.

Fitzpatrick, Kathleen Kara, Alison Darcy, and Molly Vierhile. 2017. “Delivering Cognitive Behavior Therapy to Young Adults with Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.” JMIR Mental Health 4 (2): e19. https://doi.org/10.2196/mental.7785.

Freeman, Kathleen, Jacqueline Dinnes, Naomi Chuchu, Yemisi Takwoingi, Susan E. Bayliss, Rubeta N. Matin, Abha Jain, Fiona M. Walter, Hywel C. Williams, and Jonathan J. Deeks. 2020. “Algorithm Based Smartphone Apps to Assess Risk of Skin Cancer in Adults: Systematic Review of Diagnostic Accuracy Studies.” BMJ 368. https://doi.org/10.1136/bmj.m127.

Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–14. https://doi.org/10.1038/nature07634.

Gulshan, Varun, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, et al. 2016. “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.” JAMA 316 (22): 2402–10. https://doi.org/10.1001/jama.2016.17216.

Hashimoto, Daniel A., Guy Rosman, Daniela Rus, and Ozanan R. Meireles. 2018. “Artificial Intelligence in Surgery: Promises and Perils.” Annals of Surgery 268 (1): 70–76. https://doi.org/10.1097/SLA.0000000000002693.

He, Jiang, Sally L. Baxter, Jie Xu, Jiming Xu, Xingtao Zhou, and Kang Zhang. 2019. “The Practical Implementation of Artificial Intelligence Technologies in Medicine.” Nature Medicine 25 (1): 30–36. https://doi.org/10.1038/s41591-018-0307-0.

Hwang, E. J., S. Park, K. N. Jin, J. I. Kim, S. Y. Choi, J. H. Lee, J. M. Goo, et al. 2021. “Deep Learning for Chest Radiograph Diagnosis: A Retrospective Comparison of Convolutional Neural Networks.” Radiology 301 (2): 455–65. https://doi.org/10.1148/radiol.2021203115.

Isaac, Thomas, Jie Zheng, and Ashish Jha. 2012. “Overcoming Barriers to Using Evidence-Based Medicine in Primary Care: The Role of Technology.” JAMA 308 (18): 1883–84. https://doi.org/10.1001/jama.2012.13659.

Kansagara, Devan, Honora Englander, Amanda Salanitro, David Kagen, Cecelia Theobald, Michele Freeman, and Sunil Kripalani. 2011. “Risk Prediction Models for Hospital Readmission: A Systematic Review.” JAMA 306 (15): 1688–98. https://doi.org/10.1001/jama.2011.1515.

Kelly, Christopher J., Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King. 2019. “Key Challenges for Delivering Clinical Impact with Artificial Intelligence.” BMC Medicine 17 (1): 1–9. https://doi.org/10.1186/s12916-019-1426-2.

Kim, R. Y., C. Glick, and H. Kim. 2021. “Systematic Review of Artificial Intelligence for Detecting Pulmonary Diseases on Chest Radiographs.” Journal of Thoracic Disease 13 (12): 6861–70. https://doi.org/10.21037/jtd-21-1435.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25: 1097–1105. https://doi.org/10.1145/3065386.

Lang, K., S. Hofvind, A. Rodriguez-Ruiz, and I. Andersson. 2023. “Impact of Artificial Intelligence-Based Detection System on the Workload of Screening Mammography.” Radiology 307 (2): e222097. https://doi.org/10.1148/radiol.222097.

Larson, D. B., H. Harvey, D. L. Rubin, N. Irani, J. R. Tse, and C. P. Langlotz. 2022. “Implementation and Evaluation of AI-Supported Chest Radiography Triage in the Emergency Department.” Radiology: Artificial Intelligence 4 (5): e210283. https://doi.org/10.1148/ryai.210283.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–5. https://doi.org/10.1126/science.1248506.

Lehman, Constance D., Robert D. Wellman, Diana S. M. Buist, Karla Kerlikowske, Anna N. A. Tosteson, and Diana L. Miglioretti. 2015. “Diagnostic Accuracy of Digital Screening Mammography with and Without Computer-Aided Detection.” JAMA Internal Medicine 175 (11): 1828–37. https://doi.org/10.1001/jamainternmed.2015.5231.

Lindsay, Robert K., Bruce G. Buchanan, Edward A. Feigenbaum, and Joshua Lederberg. 1993. “DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation.” Artificial Intelligence 61 (2): 209–61. https://doi.org/10.1016/0004-3702(93)90068-M.

Liu, Xiaoxuan, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, Spirit-Ai, Consort-Ai Working Group, et al. 2020. “Reporting Guideline for Clinical Trial Reports of Artificial Intelligence in Healthcare: The CONSORT-AI Extension.” BMJ 370. https://doi.org/10.1136/bmj.m3164.

Lotter, William, Abdul Rahman Diab, Bryan Haslam, Jiye G. Kim, Giorgia Grisot, Eric Wu, Kevin Wu, et al. 2021. “Robust Breast Cancer Detection in Mammography and Digital Breast Tomosynthesis Using an Annotation-Efficient Deep Learning Approach.” Nature Medicine 27: 244–49. https://doi.org/10.1038/s41591-020-01174-9.

McCarthy, John, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon. 2006. “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955.” AI Magazine 27 (4): 12–14. https://doi.org/10.1609/aimag.v27i4.1904.

McKinney, S. M., M. Sieniek, and V. Godbole. 2020. “Artificial Intelligence in Breast Cancer Diagnosis: A Systematic Review and Meta-Analysis.” NPJ Digital Medicine 3: 1–7. https://doi.org/10.1038/s41746-020-0346-5.

McLellan, Andrew M., Gabriel M. Rodrigues, Chaklam Silpasuwanchai, Bijoy K. Menon, Andrew M. Demchuk, Mayank Goyal, and Michael D. Hill. 2022. “Reducing Time to Endovascular Reperfusion in Acute Ischemic Stroke Through AI-Enabled Workflow: The DIRECT Study.” Stroke 53 (8): 2656–63. https://doi.org/10.1161/STROKEAHA.121.038217.

Miller, Randolph A., Harry E. Pople, and Jack D. Myers. 1982. “INTERNIST-1, an Experimental Computer-Based Diagnostic Consultant for General Internal Medicine.” New England Journal of Medicine 307 (8): 468–76. https://doi.org/10.1056/NEJM198208193070803.

Nagendran, Myura, Yang Chen, Christopher A. Lovejoy, Anthony C. Gordon, Matthieu Komorowski, Hugh Harvey, Eric J. Topol, John P. A. Ioannidis, Gary S. Collins, and Mahiben Maruthappu. 2020. “Artificial Intelligence Versus Clinicians: Systematic Review of Design, Reporting Standards, and Claims of Deep Learning Studies.” BMJ 368. https://doi.org/10.1136/bmj.m689.

Nagpal, Kunal, Davis Foote, Yun Liu, Po-Hsuan Cameron Chen, Ellery Wulczyn, Fraser Tan, Niels Olson, et al. 2019. “Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer.” Npj Digital Medicine 2 (1): 1–10. https://doi.org/10.1038/s41746-019-0112-2.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Omboni, Stefano, Richard J. McManus, Hayden B. Bosworth, Lucy C. Chappell, Beverly B. Green, Kazuomi Kario, Adam G. Logan, et al. 2020. “Telemedicine and mHealth in the Management of Hypertension: Technologies, Applications and Clinical Evidence.” High Blood Pressure & Cardiovascular Prevention 27: 347–65. https://doi.org/10.1007/s40292-020-00396-1.

Pantanowitz, Liron, Gabriela M. Quiroga-Garza, Lisanne Bien, Ronen Heled, Daphna Laifenfeld, Chaim Linhart, Judith Sandbank, et al. 2020. “An Artificial Intelligence Algorithm for Prostate Cancer Diagnosis in Whole Slide Images of Core Needle Biopsies: A Blinded Clinical Validation and Deployment Study.” The Lancet Digital Health 2 (8): e407–16. https://doi.org/10.1016/S2589-7500(20)30159-X.

Perez, Marco V., Kenneth W. Mahaffey, Haley Hedlin, John S. Rumsfeld, Ariadna Garcia, Todd Ferris, Vidhya Balasubramanian, et al. 2019. “Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation.” New England Journal of Medicine 381 (20): 1909–17. https://doi.org/10.1056/NEJMoa1901183.

Poplin, Ryan, Avinash V. Varadarajan, Katy Blumer, Yun Liu, Michael V. McConnell, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. “Prediction of Cardiovascular Risk Factors from Retinal Fundus Photographs via Deep Learning.” Nature Biomedical Engineering 2: 158–64. https://doi.org/10.1038/s41551-018-0195-0.

Price, W. Nicholson, and I. Glenn Cohen. 2019. “Privacy in the Age of Medical Big Data.” Nature Medicine 25 (1): 37–43. https://doi.org/10.1038/s41591-018-0272-7.

Rajkomar, Alvin, Jeffrey Dean, and Isaac Kohane. 2019. “Machine Learning in Medicine.” New England Journal of Medicine 380 (14): 1347–58. https://doi.org/10.1056/NEJMra1814259.

Rajpurkar, Pranav, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, et al. 2017. “CheXNet: Radiologist-Level Pneumonia Detection on Chest x-Rays with Deep Learning.” arXiv Preprint arXiv:1711.05225.

Rava, R. A., S. E. Seymour, M. E. LaQue, B. A. Peterson, K. V. Snyder, M. Mokin, and M. Waqas. 2021. “A Systematic Review and Meta-Analysis of AI in Detecting Intracranial Hemorrhage.” Neurosurgical Focus 51 (5): E5. https://doi.org/10.3171/2021.8.FOCUS21363.

Reddy, Sandeep, Sonia Allan, Simon Coghlan, and Paul Cooper. 2020. “A Governance Model for the Application of AI in Health Care.” Journal of the American Medical Informatics Association 27 (3): 491–97. https://doi.org/10.1093/jamia/ocz192.

Ross, Casey, and Ike Swetlitz. 2018. “Artificial Intelligence in Healthcare: IBM Watson and Oncology.” STAT News.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1: 206–15. https://doi.org/10.1038/s42256-019-0048-x.

Schork, Nicholas J. 2019. “Artificial Intelligence and Personalized Medicine.” Cancer Treatment and Research 178: 265–83. https://doi.org/10.1007/978-3-030-16391-4_11.

Semigran, Hannah L., Jeffrey A. Linder, Courtney Gidengil, and Ateev Mehrotra. 2015. “Evaluation of Symptom Checkers for Self Diagnosis and Triage: Audit Study.” BMJ 351: h3480. https://doi.org/10.1136/bmj.h3480.

Sendak, Mark P., William Ratliff, David Sarro, Elizabeth Alderton, Joseph Futoma, Michael Gao, Marshall Nichols, et al. 2020. “Real-World Integration of a Sepsis Deep Learning Technology into Routine Clinical Care: Implementation Study.” JMIR Medical Informatics 8 (7): e15182. https://doi.org/10.2196/15182.

Shortliffe, Edward H., Randall Davis, Scott G. Axline, Bruce G. Buchanan, Cordell C. Green, and Stanley N. Cohen. 1975. “Computer-Based Consultations in Clinical Therapeutics: Explanation and Rule Acquisition Capabilities of the MYCIN System.” Computers and Biomedical Research 8 (4): 303–20. https://doi.org/10.1016/0010-4809(75)90009-9.

Singhal, Karan, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. 2023. “Large Language Models Encode Clinical Knowledge.” Nature 620 (7972): 172–80. https://doi.org/10.1038/s41586-023-06291-2.

Stone, E. G., S. C. Morton, M. E. Hulscher, M. A. Maglione, E. A. Roth, J. M. Grimshaw, B. S. Mittman, L. V. Rubenstein, L. Z. Rubenstein, and P. G. Shekelle. 2002. “Implementation of Computerized Decision Support for Health Maintenance in Primary Care.” Journal of the American Medical Informatics Association 9 (4): 395–407. https://doi.org/10.1197/jamia.M1056.

Strickland, Eliza. 2019. “IBM Watson, Heal Thyself: How IBM Overpromised and Underdelivered on AI Health Care.” IEEE Spectrum.

Topol, Eric J. 2019. “High-Performance Medicine: The Convergence of Human and Artificial Intelligence.” Nature Medicine 25 (1): 44–56. https://doi.org/10.1038/s41591-018-0300-7.

Turing, A. M. 1950. “Computing Machinery and Intelligence.” Mind 59 (236): 433–60. https://doi.org/10.1093/mind/LIX.236.433.

Wong, Andrew, Erkin Otles, John P. Donnelly, Andrew Krumm, Jeffrey McCullough, Olivia DeTroyer-Cooley, Justin Pestrue, et al. 2021. “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients.” JAMA Internal Medicine 181 (8): 1065–70. https://doi.org/10.1001/jamainternmed.2021.2626.

Yu, Victor L., Bruce G. Buchanan, Edward H. Shortliffe, Sharon M. Wraith, Randall Davis, A. Carlisle Scott, and Stanley N. Cohen. 1979. “Antimicrobial Selection by a Computer: A Blinded Evaluation by Infectious Disease Experts.” JAMA 242 (12): 1279–82. https://doi.org/10.1001/jama.1979.03300120033018.

Zech, John R., Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl Oermann. 2018. “Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs: A Cross-Sectional Study.” PLOS Medicine 15 (11): e1002683. https://doi.org/10.1371/journal.pmed.1002683.