AI in Medicine: A Brief History

In 1979, Stanford’s MYCIN expert system matched infectious disease specialists in diagnosing bacterial blood infections with 90.9% accuracy. It was never used to treat a single patient. This pattern, technical excellence followed by clinical failure, has repeated for 70 years. Understanding why MYCIN, IBM Watson, and Google Flu Trends failed helps evaluate whether foundation models will be different.

Learning Objectives

After reading this chapter, you will be able to:

  • Distinguish genuine clinical breakthroughs from recurring hype cycles (MYCIN, IBM Watson, diagnostic AI)
  • Recognize why technical excellence doesn’t guarantee clinical adoption
  • Identify patterns separating successful medical AI deployments from research prototypes
  • Understand FDA regulation evolution and current frameworks
  • Evaluate whether today’s foundation models represent a paradigm shift
  • Apply historical lessons about clinical integration, medical liability, and physician trust

The Clinical Context: AI has experienced 70 years of boom-bust cycles, from 1970s expert systems to today’s foundation models, with each wave promising to reshape medicine but delivering narrow, often fragile applications. Understanding this history is essential for distinguishing genuine breakthroughs from marketing hype and avoiding expensive implementation failures.

The Cautionary Tales:

  • MYCIN (1972-1979): Stanford’s expert system matched infectious disease specialists in diagnosing bacterial blood infections (Shortliffe, 1976). Evaluation showed 65% acceptability and 90.9% accuracy in antimicrobial selection (Yu et al., 1979). Yet MYCIN was never used clinically, killed by liability concerns, FDA uncertainty, EHR integration impossibility, and physician trust barriers. Key lesson: Technical excellence ≠ clinical adoption.

  • IBM Watson for Oncology (2012-2018): After defeating Jeopardy! champions, Watson promised to revolutionize cancer treatment by analyzing medical literature and EHRs. Multiple hospitals worldwide adopted it. Reality: Watson produced unsafe and incorrect treatment recommendations (Ross & Swetlitz, 2019). MD Anderson cancelled a $62 million implementation (Strickland, 2019). Jupiter Hospital (India) found it recommended treatments unavailable in their country (Liu et al., 2021). Quietly discontinued. Key lesson: Marketing hype doesn’t equal clinical validity.

  • Google Flu Trends (2008-2015): Published in Nature (Ginsberg et al., 2009), GFT used search queries to predict influenza activity 1-2 weeks ahead of CDC surveillance. After initial success and widespread adoption, GFT failed dramatically in the 2012-2013 flu season, overestimating peak activity by 140% (Lazer et al., 2014). Why? Algorithm updates, changing search behavior, overfitting to spurious correlations. Discontinued 2015. Key lesson: Black-box algorithms fail when they can’t adapt to distribution shift.

The Pattern That Repeats: 1. Breakthrough technology demonstrates impressive capabilities in controlled settings 2. Overpromising: “AI will transform medicine / replace radiologists / eliminate diagnostic errors” 3. Pilot studies succeed with carefully curated datasets 4. Reality: Deployment reveals liability concerns, workflow disruption, integration challenges, trust gaps 5. Disillusionment when technology falls short of marketing claims 6. Eventual integration into narrow, well-defined applications (if evidence supports it)

What Actually Works in Clinical Medicine:

Diabetic retinopathy screening (IDx-DR, FDA-cleared 2018) (Abràmoff et al., 2018) - Specific, well-defined task with clear ground truth - Prospective validation in real clinical settings - Autonomous operation without physician interpretation - Currently deployed in primary care, endocrinology clinics

Computer-aided detection (CAD) in mammography (Lehman et al., 2019) - Augments radiologist interpretation, doesn’t replace it - Reduces false negatives in screening - Integrated into radiology workflow - Evidence shows improved cancer detection rates

Sepsis prediction alerts (Epic Sepsis Model, others) (Wong et al., 2021) - High-stakes problem with clear intervention pathway - Alerts clinicians to deteriorating patients - But: High false positive rates remain problematic - Ongoing debate about clinical benefit vs. alert fatigue

AI-assisted pathology (Paige Prostate, FDA-cleared 2021) (Pantanowitz et al., 2020) - Flags suspicious regions for pathologist review - Reduces interpretation time - Maintains human-in-the-loop oversight

What Doesn’t Work (Documented Failures):

IBM Watson for Oncology - Unsafe recommendations, poor real-world performance (Ross & Swetlitz, 2019) Epic Sepsis Model at Michigan Medicine - 33% sensitivity (missed 67% of sepsis cases), 12% PPV (88% false positives) (Wong et al., 2021) Skin cancer apps lacking validation - Many show poor performance outside training distributions (Freeman et al., 2020) Autonomous diagnostic systems without human oversight - Liability unclear, physician resistance high

The Critical Insight for Physicians:

Technical metrics (accuracy, AUC-ROC, sensitivity, specificity) do not predict clinical utility. The hardest problems deploying medical AI are: - Medical liability: Who’s responsible when AI fails? - FDA regulation: Which devices require clearance? How much evidence is enough? - Clinical workflow integration: Does this fit how we actually practice? - Physician trust: Will clinicians follow AI recommendations? - Patient acceptance: Are patients comfortable with algorithmic decisions?

Why This Time Might Be Different:

  • Data availability: EHRs, genomics, imaging archives, wearables, multi-omic datasets
  • Computational power: Cloud computing, GPUs, TPUs make complex models feasible
  • Algorithmic breakthroughs: Transfer learning, foundation models (GPT-4, Med-PaLM 2), few-shot learning
  • Regulatory maturity: FDA has frameworks for AI/ML-based medical devices
  • Clinical acceptance: Younger physicians trained alongside AI tools show higher adoption

Yet fundamental challenges persist: Explainability (black-box models), fairness (algorithmic bias), reliability (distribution shift), deployment barriers, and the irreducible complexity of clinical medicine.

The Clinical Bottom Line:

Be skeptical of vendor claims. Demand prospective clinical trials, not just retrospective validation. Prioritize patient safety over efficiency. Understand that you remain medically and legally responsible for clinical decisions, regardless of AI recommendations. Start with narrow, well-defined problems rather than general diagnostic systems. Center physician and patient perspectives in AI development and deployment.

History shows: Most medical AI projects fail. Learning why matters more than celebrating the rare successes.

Introduction

Artificial intelligence in medicine isn’t new. The field has experienced multiple waves of excitement and bitter disillusionment over seven decades. Each cycle promised to reshape clinical practice. Each fell short.

So why should physicians believe that this time is different?

Understanding AI’s history in medicine isn’t just academic curiosity. It’s essential for navigating today’s hype, identifying genuinely transformative applications, and avoiding expensive failures that harm patients or waste resources. The patterns repeat: breathless promises, pilot studies that look impressive, deployment challenges nobody anticipated, and eventual disillusionment when technology doesn’t match marketing.

But history also reveals what works. Successful medical AI applications share common traits: they solve specific, well-defined clinical problems; they augment rather than replace physician expertise; they integrate into existing workflows rather than demanding wholesale practice transformation; and most importantly, they have prospective clinical evidence demonstrating patient benefit.

This chapter traces AI’s journey through medicine from philosophical thought experiment to today’s foundation models, with focus on lessons for practicing physicians.

Hide code
timeline
    title AI Evolution in Medicine: From Expert Systems to Foundation Models
    1950s-1960s : Birth of Medical AI
                : Turing Test (1950)
                : Dartmouth Conference (1956)
                : Early diagnosis systems
    1970s-1980s : Expert Systems Era
                : MYCIN (1972): Infectious disease
                : INTERNIST-I (1974): Internal medicine
                : First AI Winter (late 1980s)
    1990s-2000s : Machine Learning Era
                : Support Vector Machines
                : FDA regulates CAD systems
                : Evidence-based medicine integration
    2010s : Deep Learning Revolution
          : AlexNet (2012): Computer vision
          : FDA clears first deep learning device (2017)
          : Diabetic retinopathy AI (IDx-DR, 2018)
    2020s : Foundation Model Era
          : GPT-3 (2020), ChatGPT (2022)
          : Med-PaLM 2 (Google, 2023)
          : FDA AI/ML framework evolves
Figure 4.1: Timeline of AI development in medicine from the 1950s to 2020s, showing major breakthroughs (summers) and setbacks (winters). Each era brought different approaches and capabilities, from expert systems and rule-based reasoning to machine learning, deep learning, and today’s foundation models. The cyclical pattern of hype and disillusionment has repeated multiple times, yet each wave built on lessons from previous attempts.

The Birth of Medical AI (1950s-1960s)

The Turing Test and Medical Diagnosis

In 1950, Alan Turing asked a question that still haunts medical AI discussions: “Can machines think?”

What made Turing brilliant was skipping the philosophy entirely. He didn’t care about defining consciousness or machine sentience. He wanted a practical test. If you can’t tell whether you’re talking to a machine or a human, does it really matter which it is? Judge by outputs, not internal mechanisms.

We still evaluate medical AI this way. Can the algorithm’s diagnostic accuracy match or exceed expert physicians? Turns out that’s the easy part. What MYCIN taught us (painfully) is that matching expert performance doesn’t mean anyone will actually use your system.

Historical Context

When Turing wrote his paper, computers were room-sized calculators used for mathematical computations and code-breaking (Turing himself had led cryptanalysis at Bletchley Park during World War II). The idea that machines might diagnose diseases or recommend treatments seemed like science fiction. Yet Turing explicitly discussed medical diagnosis as a potential application of machine intelligence.

The Dartmouth Conference (1956)

The field of AI was formally born at Dartmouth College (McCarthy et al., 1955) in summer 1956, when John McCarthy, Marvin Minsky, Claude Shannon, and other luminaries gathered for a two-month workshop with an audacious premise:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

Their timeline predictions proved wildly optimistic. Some participants predicted machines would achieve human-level intelligence within a generation. Instead, we got symbolic AI systems that could play checkers but couldn’t recognize a cat, a task any toddler performs effortlessly.

This early overconfidence established a pattern physicians still see today: brilliant computer scientists underestimating how much of medical expertise is tacit, contextual, and embodied. Diagnosing a patient isn’t just applying rules. It requires intuition built from thousands of cases, cultural competency, noticing what’s not documented in the chart, and integrating ambiguous, incomplete, sometimes contradictory information under uncertainty and time pressure.

These tacit competencies remain difficult to encode algorithmically.

Early Medical AI Attempts

The first attempts to apply AI to medicine emerged in the 1960s:

  • DENDRAL (1965) (Lindsay et al., 1993): Stanford built a system that identified molecular structures from mass spectrometry data. It worked, but only in a highly constrained domain with clear rules.

  • Pattern recognition for cancer diagnosis: Early computer vision systems tried to identify cancerous cells from microscope images. Results were mixed, and 1960s computational power wasn’t remotely sufficient for complex image analysis.

These early efforts revealed a fundamental challenge that still persists: medicine deals with messy, incomplete data about extraordinarily complex biological systems. Unlike chess (deterministic rules, perfect information) or mathematical theorem proving (formal logic), medical diagnosis involves uncertainty, missing data, biological variability, comorbidities, and patient preferences.

The Expert Systems Era (1970s-1980s): MYCIN’s Promise and Failure

The 1970s brought a new approach: if we can’t make machines think like humans, maybe we can capture human expertise in formal IF-THEN rules.

MYCIN: Technical Success, Clinical Failure

In 1972, Edward Shortliffe began developing MYCIN (Shortliffe, 1976) at Stanford, an expert system for diagnosing bacterial blood infections and recommending antibiotics. This was a seemingly perfect test case for AI in medicine:

Why MYCIN was promising:

  • Well-defined clinical problem: Identify causative bacteria and select appropriate antibiotics
  • Clear expertise: Infectious disease specialists followed identifiable reasoning patterns
  • Life-or-death stakes: Sepsis kills quickly; correct antibiotic choice dramatically affects mortality
  • Knowledge-intensive task: Success requires knowing hundreds of drug-bug interactions, resistance patterns, patient-specific factors

How MYCIN worked:

MYCIN used backward chaining through approximately 600 IF-THEN rules:

IF:
  1) Patient is immunocompromised host
  AND
  2) Site of infection is gastrointestinal tract
  AND
  3) Gram stain is gram-negative-rod
THEN:
  Evidence (0.7) that organism is E. coli

The system conducted structured consultations by asking questions, applying rules, and (crucially) explaining its reasoning. This explainability was unprecedented. MYCIN could answer “Why do you believe this?” and “How did you reach that conclusion?”, something today’s deep learning systems struggle to do convincingly.

The results were stunning:

Rigorous evaluation studies in the late 1970s found that MYCIN performed as well as infectious disease experts (Yu et al., 1979) and better than junior physicians. A landmark study published in JAMA showed:

  • 65% of MYCIN’s therapy recommendations were deemed acceptable by expert review
  • 90.9% accuracy in prescribing appropriate antimicrobial therapy
  • Performance comparable to ID faculty, superior to residents

Papers celebrated MYCIN as a breakthrough. It was featured in medical journals, computer science conferences, and popular media. Stanford Medicine showcased it as the future of clinical decision support.

The devastating reality:

Despite this technical and clinical success, MYCIN was never deployed in routine clinical care. Not once. Not in a clinical trial. Not even in a supervised pilot study.

Why did MYCIN fail so completely? The reasons had nothing to do with the AI’s performance:

Why MYCIN Failed: Lessons for Today’s Physician AI

Medical Liability: Who is legally responsible if MYCIN recommends the wrong antibiotic and the patient dies? The physician who followed its advice? Stanford University? The programmers? In the 1970s (and arguably still today), liability frameworks couldn’t accommodate algorithmic decision-making.

FDA Regulatory Uncertainty: The FDA had no framework for regulating software-based medical devices. Was MYCIN a medical device requiring approval? If so, what clinical evidence would be needed? These questions weren’t resolved until decades later.

Technical Integration Barriers: Getting MYCIN’s recommendations required a separate computer terminal, manual data entry, and stepping outside normal clinical workflow. In the 1970s, hospitals were just beginning to adopt electronic systems. MYCIN couldn’t integrate with existing infrastructure.

Physician Trust: Doctors weren’t comfortable following advice from a system they didn’t understand. The black-box problem: even though MYCIN could explain its rules, physicians couldn’t validate them independently during patient care.

Knowledge Maintenance Burden: Medical knowledge evolves. Keeping 600 hand-coded rules updated proved impractical. When new antibiotics became available or resistance patterns changed, updating MYCIN required programmer intervention.

Hospital Politics: Infectious disease specialists worried MYCIN would undermine their consultative role. Administrators couldn’t justify dedicated computer resources. Insurers wouldn’t reimburse for AI-assisted care.

The lesson for today’s physicians:

MYCIN proves that technical excellence doesn’t guarantee clinical adoption. The hardest problems deploying medical AI are rarely algorithmic. They’re legal, regulatory, social, organizational, and workflow-related.

Two decades after MYCIN’s failure, the Institute of Medicine formalized this insight in To Err is Human (Kohn et al., 2000): medical errors are primarily system failures, not individual failures. MYCIN failed not because of bad algorithms or bad physicians, but because the healthcare system lacked the infrastructure, regulatory frameworks, and organizational culture to deploy it safely. The IOM’s “good people working in bad systems” framing explains MYCIN’s fate and predicts similar challenges for today’s AI systems.

This lesson remains profoundly relevant. Today’s deep learning models vastly exceed MYCIN’s capabilities. Yet they face identical deployment challenges: liability uncertainty, integration complexity, physician trust barriers, and unclear value proposition relative to existing clinical workflows.

Other Expert Systems: Similar Patterns

The 1980s saw dozens of medical expert systems, most following MYCIN’s pattern of impressive demonstrations but minimal clinical impact:

  • INTERNIST-I/CADUCEUS (1974-1985) (Miller et al., 1982): Diagnosed diseases across internal medicine (~1000 diseases, ~3,500 manifestations). Could rival internists in complex cases. Never clinically deployed.

  • DXplain (1984-present) (Barnett et al., 1987): Differential diagnosis support system. Still used today for education and clinical decision support, but as a reference tool, not autonomous diagnostic system. Rare success story due to modest scope and human-in-the-loop design.

  • ONCOCIN (1981): Guided cancer chemotherapy protocols. Impressive demonstrations, minimal clinical adoption.

Why Expert Systems Failed

By the late 1980s, the expert systems approach hit fundamental limits:

  1. Brittleness: Systems worked perfectly within their narrow domain but failed catastrophically on edge cases. A single unexpected finding could derail the entire diagnostic process.

  2. Knowledge acquisition bottleneck: Extracting rules from expert physicians was extraordinarily time-consuming and incomplete. Experts often couldn’t articulate their reasoning explicitly.

  3. Combinatorial explosion: Real-world medicine requires thousands of interacting rules. Managing complexity became unworkable.

  4. Maintenance burden: Medical knowledge evolves continuously. Hand-coded rules became outdated, requiring constant programmer intervention.

  5. Lack of learning: Expert systems couldn’t improve from experience. Every new case provided no feedback to enhance future performance.

  6. Deployment realities: As MYCIN demonstrated, technical performance didn’t address liability, regulation, integration, or trust.

The “AI Winter” of the late 1980s and 1990s arrived. Funding dried up. Researchers left the field. Companies removed “AI” from marketing materials. Medical AI seemed like a failed experiment.

The lesson: Rule-based approaches couldn’t capture the complexity, uncertainty, and nuance of real clinical practice. Medicine needed a fundamentally different approach.

The Machine Learning Revolution (1990s-2000s)

From Rules to Data

The 1990s brought a paradigm shift: instead of encoding expert rules manually, why not let algorithms learn patterns directly from data?

Machine learning, particularly supervised learning, offered a solution. Show an algorithm thousands of examples (X-rays labeled normal vs. abnormal, patient data labeled survived vs. died), and it learns to recognize patterns without explicit rule programming.

Key developments enabling this shift:

  • Increased computational power: Moore’s Law made previously impractical computations feasible
  • Digitization of medical data: Picture Archiving and Communication Systems (PACS), early electronic health records
  • Algorithmic advances: Support Vector Machines (SVMs), decision trees, ensemble methods

Computer-Aided Detection (CAD) in Radiology

The first commercially successful medical AI applications emerged in radiology:

Computer-Aided Detection (CAD) for mammography:

  • FDA began approving CAD systems in the late 1990s
  • Designed to augment radiologist interpretation by flagging suspicious regions
  • Became widely adopted in breast cancer screening

Initial promise: Retrospective studies suggested CAD could reduce false negatives (missed cancers).

Reality check: Prospective studies showed mixed results (Lehman et al., 2019). CAD increased recall rates (more women called back for additional imaging) without consistently improving cancer detection rates. Some studies suggested CAD reduced radiologist specificity without improving sensitivity.

Current status: Second-generation AI systems using deep learning show more promise. The lesson: first-generation medical AI often underperforms expectations in real-world deployment.

FDA Begins Regulating Medical AI

The 1990s-2000s saw FDA establish regulatory frameworks for software-based medical devices:

  • 1990s: First CAD systems cleared through 510(k) pathway (substantial equivalence to existing devices)
  • 2000s: FDA establishes guidance for computer-assisted detection devices
  • Challenge: Rapid AI evolution outpaced regulatory frameworks designed for static medical devices

This regulatory evolution continues today, with FDA developing new frameworks for continuously learning AI systems.

Why This Era Mattered

The machine learning revolution established principles still guiding medical AI:

  • Data-driven approaches could achieve good performance without explicit rule encoding
  • Narrow, well-defined tasks (e.g., detecting lung nodules on CT) worked better than general diagnosis
  • Augmentation vs. replacement gained acceptance (radiologist + CAD performed better than either alone)
  • Prospective validation essential: Retrospective performance didn’t guarantee real-world utility
  • Workflow integration remained challenging despite better algorithms

The limitation: Traditional machine learning required careful feature engineering (hand-crafted measurements and patterns). Algorithms couldn’t learn complex representations directly from raw data.

The deep learning revolution would change that.

The Deep Learning Revolution (2010s): Imaging AI Comes of Age

AlexNet and the ImageNet Moment (2012)

In 2012, a deep convolutional neural network called AlexNet (Krizhevsky et al., 2012) won the ImageNet visual recognition challenge by a massive margin, halving the error rate of competing approaches.

This wasn’t just incremental progress. It was a paradigm shift. Deep learning could learn complex features directly from raw pixels without manual feature engineering. Suddenly, computer vision problems that had resisted decades of effort became tractable.

Implications for medical imaging:

Medical images (X-rays, CTs, MRIs, pathology slides, dermatology photos, retinal fundus images) are fundamentally visual pattern recognition problems. If deep learning could master general image classification, perhaps it could learn to detect pneumonia, identify cancers, or diagnose diabetic retinopathy.

Medical Imaging AI Explosion

The mid-2010s saw an explosion of medical imaging AI research:

Papers proliferated. Venture capital flooded into medical AI startups. Headlines proclaimed “AI Will Replace Radiologists.”

The First FDA-Cleared Autonomous AI (2018): IDx-DR

In April 2018, FDA granted the first authorization for an autonomous AI diagnostic system: IDx-DR for diabetic retinopathy screening (Abràmoff et al., 2018).

Why IDx-DR succeeded where MYCIN failed:

  • Narrow, well-defined task: Detect referable diabetic retinopathy (yes/no decision)
  • Clear clinical need: Primary care physicians need retinal screening but lack ophthalmology expertise
  • Prospective validation: 900-patient clinical trial in primary care settings (not just retrospective analysis)
  • Autonomous operation: Primary care staff operate the system without specialist interpretation
  • Regulatory clarity: FDA had developed frameworks for AI-based medical devices
  • Reimbursement: CPT codes established for AI-assisted diabetic retinopathy screening

Clinical bottom line: IDx-DR demonstrates the formula for successful medical AI deployment: 1. Well-defined, high-value clinical problem 2. Prospective validation in real-world settings 3. Clear regulatory pathway 4. Workflow integration 5. Reimbursement model 6. Physician and patient acceptance

The “AI Will Replace Radiologists” Debate

Geoffrey Hinton (deep learning pioneer) stated at a 2016 conference: “It’s quite obvious that we should stop training radiologists.”

This sparked fierce debate. Would AI replace radiologists? Should medical students avoid imaging specialties?

What actually happened:

  • AI didn’t replace radiologists
  • AI began augmenting radiologist workflow
  • Radiologists increasingly use AI as assistive tools
  • AI handles some straightforward screening tasks
  • Complex cases still require radiologist expertise
  • New roles emerged: radiologists curating datasets, validating AI, interpreting edge cases

The lesson for physicians:

Dire predictions about AI replacing doctors are consistently wrong. What happens instead: AI augments clinical capabilities, handles routine tasks, and creates new workflows requiring physician oversight.

Fear replacement less. Focus on effective integration more.

Deep Learning’s Limitations in Medicine

Despite these advances, deep learning revealed serious limitations:

Black-box problem: Neural networks can’t explain why they make predictions. A radiologist can articulate reasoning (“spiculated mass in the upper lobe with associated lymphadenopathy suggests malignancy”). Deep learning models output probabilities without interpretable justification.

Data hunger: Deep learning requires massive labeled datasets (thousands to millions of examples). Curating these datasets requires enormous physician time.

Brittleness to distribution shift: Models trained on data from Hospital A often perform poorly at Hospital B due to different imaging equipment, patient populations, disease prevalence, or documentation practices.

Adversarial vulnerability: Tiny, imperceptible changes to input images can fool deep learning models completely, a major patient safety concern (Finlayson et al., 2019).

Fairness and bias: AI systems inherit biases from training data. If training data under-represents certain populations, model performance may be worse for those groups (Obermeyer et al., 2019).

These limitations drive ongoing research and FDA regulatory evolution.

IBM Watson for Oncology: The Highest-Profile Failure

No discussion of medical AI history would be complete without IBM Watson for Oncology, arguably the most expensive, highest-profile medical AI failure to date.

The Promise

In 2011, IBM’s Watson defeated human champions on Jeopardy!, demonstrating impressive natural language processing capabilities. IBM launched Watson for Oncology in 2013, partnering with Memorial Sloan Kettering to train the system. IBM promised Watson would reshape medicine by:

  • Analyzing millions of medical journal articles
  • Synthesizing complex patient data from EHRs
  • Recommending personalized, evidence-based cancer treatments
  • Augmenting oncologist decision-making

Major cancer centers worldwide partnered with IBM: MD Anderson Cancer Center, Memorial Sloan Kettering, hospitals across India, China, and other countries. IBM invested billions. Expectations soared.

The Reality

Watson for Oncology failed spectacularly (Ross & Swetlitz, 2019):

Unsafe recommendations: Internal documents revealed Watson recommended treatments that would have harmed patients. In one case, Watson suggested administering chemotherapy to a patient with severe bleeding, a contraindicated, potentially fatal recommendation (Ross & Swetlitz, 2019).

Poor real-world performance: Oncologists found recommendations often didn’t match current evidence-based guidelines or were inappropriate for specific clinical contexts (Liu et al., 2021).

Geographic inappropriateness: Jupiter Hospital (India) reported Watson recommended treatments unavailable in India, ignoring local resource constraints and formulary restrictions (Liu et al., 2021).

Training data issues: Watson was trained primarily on synthetic cases created by Memorial Sloan Kettering physicians, not real-world patient data. It learned institutional preferences, not universal evidence (Strickland, 2019).

MD Anderson debacle: MD Anderson Cancer Center spent $62 million on Watson implementation before canceling the project in 2016, concluding it wasn’t ready for clinical use (Strickland, 2019).

Widespread abandonment: By 2019, multiple health systems had stopped using Watson. IBM sold Watson Health assets in 2021.

What Went Wrong: Lessons for Physicians

Critical Lessons from Watson’s Failure
  1. Marketing ≠ Clinical Validation: Jeopardy! success doesn’t translate to clinical competence. Demand prospective clinical trials, not just impressive demonstrations in controlled settings.

  2. Black-box algorithms are dangerous: Oncologists couldn’t understand Watson’s reasoning or override incorrect recommendations effectively.

  3. Training data matters immensely: Synthetic cases created by one institution don’t represent the diversity of real clinical practice.

  4. Physician involvement is essential: Watson was developed primarily by engineers, with insufficient oncologist input during design and training.

  5. Geographic and institutional context matters: Treatment recommendations must account for local resources, formularies, patient populations, and practice patterns.

  6. Financial incentives can override evidence: IBM’s business model prioritized deployment over clinical validation.

  7. Transparency and reporting: Watson’s failures remained largely hidden until investigative journalists uncovered them. Medical AI needs transparent reporting of failures.

The clinical bottom line:

Watson’s failure demonstrates why physicians must evaluate AI systems with the same rigor applied to new pharmaceuticals: demand prospective clinical trials, transparent reporting, independent validation, and post-market surveillance.

Don’t accept vendor claims without evidence.

The Foundation Model Era (2020-Present)

Large Language Models Arrive

November 2022 brought ChatGPT, introducing millions of people (including physicians) to large language models (LLMs) (Brown et al., 2020). These “foundation models” could:

  • Answer medical questions with apparently sophisticated reasoning
  • Draft clinical notes
  • Explain complex concepts
  • Translate between languages
  • Write code
  • Summarize literature

Unlike narrow AI systems designed for specific tasks, LLMs demonstrated general capabilities across diverse domains.

Medical-Specific Foundation Models

Recognizing both promise and risks of general LLMs in medicine, researchers developed medical-specific models:

Med-PaLM (Google, 2022) and Med-PaLM 2 (2023): - Fine-tuned on medical text - Achieved passing scores on USMLE-style questions (Singhal et al., 2023) - Improved clinical accuracy compared to general LLMs

GPT-4 in medicine (OpenAI, 2023): - Demonstrated strong performance on medical licensing exams - Used by physicians for literature review and summarization, clinical reasoning support, patient education

Challenges remain: - Hallucinations: LLMs confidently generate plausible but incorrect information - Bias: Inherit biases from training data - Liability: Unclear legal framework for LLM-assisted clinical decisions - Privacy: Patient data security concerns - Validation: How to validate general-purpose models across diverse clinical scenarios?

Current State: Promise and Uncertainty

As of 2025, medical AI stands at an inflection point:

Undeniable progress: - FDA has cleared hundreds of AI-based medical devices, with approvals accelerating since 2018 (see FDA AI/ML database) - Radiology, pathology, dermatology, ophthalmology see routine AI use - Clinical decision support systems are deployed in major EHR systems - Foundation models demonstrate unprecedented versatility

Persistent challenges: - Most AI systems remain narrow, fragile, and context-dependent - Integration into clinical workflow remains difficult - Liability frameworks lag technological capabilities - Bias and fairness concerns are well-documented but incompletely solved - Physician trust varies widely by specialty, generation, and prior experience

The fundamental question:

Are today’s foundation models genuinely different from previous AI waves, or are we in another hype cycle that will end in disillusionment?

Evidence suggests: Both. Foundation models represent real algorithmic breakthroughs. But transforming clinical practice requires solving the same deployment challenges that defeated MYCIN and Watson: integration, liability, trust, validation, and proving clinical benefit beyond technical performance metrics.

Key Lessons from History

Seven decades of medical AI teach clear lessons:

Historical Lessons for Physicians
  1. Technical performance ≠ clinical adoption (MYCIN)
  2. Marketing hype ≠ clinical validity (Watson)
  3. Retrospective validation ≠ prospective utility (Google Flu Trends)
  4. Narrow, well-defined tasks work best (IDx-DR)
  5. Augmentation > replacement (CAD in radiology)
  6. Physician involvement is essential (Watson’s failure)
  7. Regulatory frameworks evolve slowly (ongoing FDA challenges)
  8. Liability concerns shape adoption (MYCIN’s legal uncertainty)
  9. Workflow integration is harder than algorithmic development (expert systems)
  10. Patient safety must remain paramount (Watson’s unsafe recommendations)

The path forward:

Learn from failures. Demand evidence. Integrate thoughtfully. Maintain physician oversight. Prioritize patients.

History doesn’t repeat, but it rhymes. The physicians who understand AI’s history will navigate its future most effectively.

Skill Degradation in the Age of AI: Historical Lessons

Medical AI’s most insidious risk may not be algorithmic error or liability uncertainty. It’s cognitive de-skilling, the gradual erosion of clinical expertise when physicians routinely defer to algorithmic recommendations.

This isn’t hypothetical. It’s documented across multiple domains where humans adopted decision support systems.

Parallel Examples: GPS Navigation and Mental Arithmetic

Consider GPS navigation. Before smartphones, drivers developed spatial reasoning and mental maps of their environments. They could navigate without turn-by-turn directions, improvise when routes were blocked, and maintain geographic awareness.

Research now shows systematic decline in these capabilities (Javadi et al., 2017). People who rely heavily on GPS navigation show reduced hippocampal activity and diminished spatial memory formation. They navigate successfully with technology but lose the underlying skill. When GPS fails or provides incorrect directions (which it does), they lack backup competencies.

The calculator provides another example. Students who use calculators extensively show worse mental arithmetic skills than previous generations. The technology is reliable for computation, but students never develop number sense, the ability to estimate whether answers are reasonable or catch obvious errors.

Neither GPS nor calculators are bad technologies. But their widespread adoption fundamentally changed human cognitive capabilities. Users gained convenience, lost skills.

Documented Skill Atrophy in Medical AI

Medical AI is following the same trajectory, with concerning evidence emerging from radiology and gastroenterology.

Computer-Aided Detection (CAD) in Radiology:

Early CAD systems for mammography promised to reduce missed cancers by flagging suspicious regions for radiologist review. The technology worked. What researchers didn’t anticipate: radiologists who routinely used CAD became worse at independent image interpretation.

Studies documented reduced diagnostic accuracy when CAD was unavailable (Alberdi et al., 2004). Radiologists developed dependency on algorithmic prompts rather than systematic image analysis. Their pattern recognition skills, the foundation of radiologic expertise, atrophied with disuse.

The mechanism is straightforward: if an algorithm highlights suspicious regions, why scrutinize the entire image? Over time, physicians scan for algorithmic flags rather than systematically analyzing the entire image. When the algorithm fails (which it will), physicians lack robust backup skills.

The Gastroenterology Study: Physicians Became Significantly Worse

Recent evidence from gastroenterology demonstrates the severity of this problem. A multicentre observational study found that endoscopists using AI assistance for polyp detection showed improved detection rates while using AI. But when performing colonoscopies without AI after months of AI-assisted practice, their adenoma detection rate dropped from 28.4% to 22.4%, a 6-percentage-point decline (Lancet Gastroenterology & Hepatology, 2025).

The AI improved immediate outcomes. It degraded physician expertise.

This creates a dangerous dependency: physicians rely on AI to compensate for skills they no longer maintain, making the technology indispensable even when its benefit diminishes or introduces new risks.

Training the Next Generation: Medical Students Who Never Develop Foundational Skills

The skill degradation problem becomes existential when considering medical education. If students learn medicine alongside AI tools that provide differential diagnoses, treatment recommendations, and clinical decision support, will they develop the same depth of clinical reasoning as previous generations?

Dhruv Khullar, a physician writing in The New Yorker, describes medical students expressing discomfort after using AI assistance: feeling “dirty presenting thoughts to attending physicians, knowing they were actually the A.I.’s thoughts” (Khullar, 2025).

This raises profound questions about what medical training should accomplish in an age of clinical AI:

  • If students can query AI for differential diagnoses, should they still memorize disease presentations?
  • If algorithms interpret ECGs more accurately than cardiologists, what ECG interpretation skills should medical students master?
  • If AI drafts clinical notes, how will students learn to synthesize complex information into coherent clinical narratives?
  • If radiology AI detects pathology automatically, what level of image interpretation competency should physicians maintain?

The risk isn’t that students use AI tools. It’s that they never develop robust foundational skills because AI assistance is always available. Then, when AI fails, is unavailable, or encounters edge cases beyond its training, there’s no physician expertise to fall back on.

Historical Lesson: Every Technology Changed What Physicians Knew

This pattern isn’t new. Each wave of medical technology transformed physician capabilities:

Calculators and clinical scoring systems: Physicians no longer perform complex risk calculations mentally. They input variables into validated scoring systems. Mental arithmetic skills declined, but risk stratification improved.

Electronic health records (EHRs): Physicians rely on structured templates, autocomplete, and copy-forward functionality. Documentation became standardized but less personalized. Narrative skills atrophied. Many physicians struggle to write coherent clinical summaries without EHR templates.

Automated laboratory analyzers: Physicians stopped performing manual blood counts and chemical assays. Lab interpretation skills improved (access to more data), but understanding of underlying physiology declined. Few physicians today could estimate hemoglobin from examining a peripheral blood smear.

Advanced imaging (CT, MRI): Physical examination skills declined as imaging became routine for diagnosis. Studies show physicians miss physical findings that would have been obvious to previous generations, because they order imaging instead of examining patients thoroughly.

Each technology brought real benefits. Each created dependencies. Each changed what it meant to be a competent physician.

AI may represent the most significant shift yet because it targets the core of clinical expertise: pattern recognition, diagnostic reasoning, and decision-making under uncertainty.

The Unsolved Tension: Competence vs. Dependency

There’s no simple resolution to this tension.

Refusing to use AI to preserve traditional skills would harm patients if AI genuinely improves outcomes. But uncritical adoption creates dangerous dependencies and erodes the expertise needed when technology fails.

Medical educators, licensing bodies, and specialty societies are grappling with fundamental questions:

  • What clinical skills are truly foundational, requiring mastery regardless of AI availability?
  • What tasks can we safely delegate to algorithms, accepting that human competency will decline?
  • How do we maintain physician capability to recognize and override algorithmic errors?
  • What backup competencies must physicians preserve for when AI systems fail?
  • How do we train medical students to use AI effectively without becoming entirely dependent on it?

No consensus exists. Medical schools are experimenting with different approaches: some integrate AI early in training, others restrict its use until students demonstrate independent competency. Some specialties (radiology, pathology) are redesigning curricula around AI-augmented practice. Others (surgery, primary care) emphasize skills that remain distinctly human.

The historical lesson is clear: Technology always changes physician capabilities. The question isn’t whether AI will transform medical expertise, but how we manage that transformation to preserve the irreplaceable elements of clinical judgment while embracing genuine improvements.

Physicians who understand this historical pattern, who recognize skill atrophy as an inevitable consequence of technological adoption, will be better positioned to maintain critical competencies while integrating AI thoughtfully.

But we’re still early in this transition. The long-term effects of AI on physician expertise remain uncertain. What’s certain: ignoring the problem guarantees we’ll repeat historical mistakes, creating dependencies we don’t fully understand until they become irreversible.


References