Large Language Models in Clinical Practice

Keywords

clinical LLM, clinical LLMs, large language models in medicine, GPT-4 medicine, Claude healthcare, medical AI chatbot, LLM hallucinations, HIPAA compliant AI, LLM reproducibility liability

Large Language Models (LLMs) like ChatGPT, GPT-4, Claude, and Med-PaLM represent a fundamentally different paradigm from narrow diagnostic AI. Unlike algorithms trained for single tasks (detect melanoma, predict sepsis), LLMs are general-purpose language systems that can write notes, answer questions, review literature, draft patient education, and assist clinical reasoning. They’re extraordinarily powerful but also uniquely dangerous, capable of generating confident, plausible, but completely false medical information (“hallucinations”).

Learning Objectives

After reading this chapter, you will be able to:

  • Understand how LLMs work and their fundamental capabilities and limitations in medical contexts
  • Identify appropriate vs. inappropriate clinical use cases based on risk-benefit assessment
  • Recognize and mitigate hallucinations, citation fabrication, and knowledge cutoff problems
  • Navigate privacy (HIPAA), liability, and ethical considerations specific to LLM use in medicine
  • Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications) vs. general-purpose models
  • Implement LLMs safely in clinical workflows with proper oversight and verification protocols
  • Communicate transparently with patients about LLM-assisted care
  • Apply vendor evaluation frameworks before adopting LLM tools for clinical practice

The Clinical Context:

Large Language Models (ChatGPT, GPT-4, Med-PaLM, Claude) have exploded into medical practice since ChatGPT’s public release in November 2022. Unlike narrow diagnostic AI trained for single tasks, LLMs are general-purpose systems that can write clinical notes, answer medical questions, summarize literature, draft patient education materials, generate differential diagnoses, and assist with complex clinical reasoning.

They represent a paradigm shift: AI that communicates in natural language, appears to “understand” medical concepts, and can perform diverse tasks without task-specific training. This versatility makes them extraordinarily useful and extraordinarily dangerous if used incorrectly.

The fundamental challenge: LLMs are statistical language models trained to predict plausible next words, not to retrieve medical truth. They can generate confident, coherent, authoritative-sounding but completely false medical information (“hallucinations”). A physician who trusts LLM output without verification risks patient harm.

Key Applications:

  • Ambient clinical documentation: Nuance DAX, Abridge convert conversations to clinical notes, 30-50% time savings validated
  • Literature synthesis and summarization: Summarize guidelines, compare treatment options (with citation verification)
  • Patient education materials: Generate health literacy-appropriate explanations (with physician review)
  • Differential diagnosis brainstorming: Suggest possibilities for complex cases (treat as idea generation, not diagnosis)
  • Medical coding assistance: Suggest ICD-10/CPT codes from clinical narratives (with compliance review)
  • Clinical decision support: Glass Health, other LLM-based systems provide treatment suggestions (requires rigorous verification)
  • Medical education: Explaining concepts, generating practice questions (risk: teaching hallucinated “facts”)
  • Autonomous patient advice: Patients asking LLMs medical questions without physician oversight (dangerous false reassurance)
  • Medication dosing without verification: LLMs fabricate plausible but incorrect dosages
  • Citation generation: LLMs routinely fabricate references to non-existent papers

What Actually Works:

  1. Nuance DAX ambient documentation: 50% reduction in documentation time, 77% physician satisfaction, deployed in 550+ health systems (not FDA-regulated; falls under CDS exemption as documentation tool)
  2. Abridge clinical documentation: 2-minute patient encounter → structured note in 30 seconds, 65% time savings in pilot studies
  3. Literature summarization (with verification): GPT-4/Claude accurately summarize guidelines 85-90% of time when facts are verifiable
  4. Patient education draft generation: Health literacy-appropriate materials in seconds (requires physician fact-checking before distribution)

What Doesn’t Work:

  1. Citation reliability: GPT-4 fabricates 15-30% of medical citations (authors, titles look real but papers don’t exist)
  2. Medication dosing: Multiple reported cases of LLMs suggesting incorrect pediatric dosages, dangerous drug combinations
  3. Medical calculations: NIH research found GPT-4 achieves only 50.9% accuracy on clinical calculator tasks (CHADS-VASc, GFR, risk scores), with three failure modes: wrong equations, parameter extraction errors, arithmetic mistakes
  4. Autonomous diagnosis: LLMs lack patient-specific data, physical exam findings, cannot replace clinical judgment
  5. Real-time medical knowledge: All LLMs have training data cutoffs, meaning they may be unaware of newer drugs, guidelines, or treatments published after training

Critical Insights:

Hallucinations are unavoidable, not bugs: LLMs predict plausible words, not truth; no amount of training eliminates hallucinations entirely

HIPAA compliance is non-negotiable: Public ChatGPT is NOT HIPAA-compliant; patient data entered is stored, potentially used for training

Physician remains legally responsible: “AI told me to” is not a malpractice defense; all LLM-assisted decisions require verification

Exam performance ≠ clinical utility: GPT-4 scores 86% on USMLE but multiple choice questions don’t test clinical judgment, patient communication, or risk management. When answer patterns are disrupted (NOTA test), LLM accuracy drops 26-38%, suggesting pattern matching over genuine reasoning (Bedi et al., 2025)

Ambient documentation shows clearest ROI: 50% time savings + high physician satisfaction = rare AI win-win

Prompting quality matters enormously: Specific, detailed prompts with requests for sourcing and uncertainty yield better outputs than vague questions

Clinical Bottom Line:

LLMs are powerful assistants for documentation, education, and brainstorming, but dangerous if used autonomously for diagnosis, treatment, or urgent decisions.

Safe use requires: - HIPAA-compliant systems only (never public ChatGPT for patient data) - Always verify medical facts against authoritative sources - Treat LLM output as drafts requiring physician review, never final decisions - Document verification steps - Transparent communication with patients about LLM assistance

Demand evidence: - Ask vendors for prospective validation studies (not just retrospective accuracy) - Request HIPAA compliance documentation and Business Associate Agreement (BAA) - Validate locally before widespread deployment - Monitor continuously for errors, near-misses, and hallucinations

The promise is real (50% documentation time savings), but the risks are serious (hallucinations, privacy violations, liability). Proceed cautiously with proper safeguards.

Medico-Legal Considerations:

  • Physician liability remains unchanged: LLMs are tools, not practitioners; physician responsible for all clinical decisions
  • Standard of care evolving: As LLM use becomes widespread, failing to use available tools may become negligence (but using them incorrectly already is negligence)
  • Reproducibility creates unique liability: LLMs produce different outputs for identical prompts, complicating documentation, peer review, and quality assurance
  • Documentation requirements: Note LLM assistance where material to decisions, document verification steps, record LLM version and timestamp
  • Testing before deployment: Run reproducibility tests (same prompt 5 times) to assess output variance before clinical use
  • Informed consent emerging: Some institutions now inform patients when LLMs assist documentation or clinical reasoning
  • HIPAA violations carry penalties: $137–$68,928 per violation (2025 inflation-adjusted); entering patient data into public ChatGPT violates HIPAA
  • Malpractice insurance may exclude AI: Check policy coverage explicitly, ask “Does this policy cover LLM use in clinical practice?” before deployment
  • Fabricated citations = academic dishonesty: Using LLM-generated fake references in publications, grant applications is fraud

Essential Reading:

  • Omiye JA et al. (2024). “Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review.” Annals of Internal Medicine 177:210-220. (doi:10.7326/M23-2772) [Stanford thorough review covering LLM capabilities, limitations, bias, privacy concerns, and practical clinical applications]

  • Singhal K et al. (2023). “Large language models encode clinical knowledge.” Nature 620:172-180. [Med-PaLM 2 validation, 86.5% MedQA performance]

  • Thirunavukarasu AJ et al. (2023). “Large language models in medicine.” Nature Medicine 29:1930-1940. [Review of medical LLM capabilities and limitations]

  • Nori H et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems.” Microsoft Research. [GPT-4 USMLE performance: 86%+]

  • Ayers JW et al. (2023). “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.” JAMA Internal Medicine 183:589-596. [LLM vs. physician responses quality comparison]

  • Lee P et al. (2023). “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.” New England Journal of Medicine 388:1233-1239. [Clinical use cases and risk assessment]

For annual landscape overview:

See Also:


Introduction: A Paradigm Shift in Medical AI

Every previous chapter in this handbook examines narrow AI: algorithms trained for single, specific tasks.

  • Radiology AI detects pneumonia on chest X-rays (and nothing else)
  • Pathology AI grades prostate cancer histology (and nothing else)
  • Cardiology AI interprets ECGs for arrhythmias (and nothing else)

Large Language Models are fundamentally different: general-purpose systems that perform diverse tasks through natural language interaction.

Ask GPT-4 to summarize a medical guideline and it does. Ask it to draft a patient education handout and it does. Ask it to generate a differential diagnosis for chest pain and it does. No task-specific retraining required.

This versatility is unprecedented in medical AI. It’s also what makes LLMs uniquely dangerous.

A narrow diagnostic AI fails in predictable ways: - Pneumonia detection AI applied to chest X-ray might miss a pneumonia (false negative) or flag normal lungs as abnormal (false positive) - Failure modes are bounded by the task

LLMs fail in unbounded ways: - Fabricate drug dosages that look correct but cause overdoses - Invent medical “facts” that sound authoritative but are false - Generate fake citations to real journals (paper doesn’t exist) - Provide confident answers to questions where uncertainty is appropriate - Contradict themselves across responses - Recommend treatments that were standard of care in training data but have been superseded

The clinical analogy: LLMs are like exceptionally well-read medical students who have: - Perfect recall of everything they’ve studied - No clinical experience - No ability to examine patients or access patient-specific data - No accountability for errors - Tendency to confidently bullshit when they don’t know the answer


Part 1: How LLMs Work (What Physicians Need to Know)

The Technical Basics (Simplified)

Training: 1. Ingest massive text corpora (internet, books, journals, Wikipedia, Reddit, medical textbooks, PubMed abstracts) 2. Learn statistical patterns: “Given these words, what word typically comes next?” 3. Scale to billions of parameters (weights connecting neural network nodes) 4. Fine-tune with human feedback (reinforcement learning from human preferences)

Inference (when you use it): 1. You provide a prompt (“Generate a differential diagnosis for acute chest pain in a 45-year-old man”) 2. LLM predicts most likely next word based on learned patterns 3. Continues predicting words one-by-one until stopping criterion met 4. Returns generated text

Crucially: - LLMs don’t “look up” facts in a database - They don’t “reason” in the logical sense - They predict plausible text based on statistical patterns - Truth and plausibility are not the same thing

Why Hallucinations Happen

Definition: LLM generates confident, coherent, plausible but factually incorrect text.

Mechanism: The training objective is “predict next plausible word,” not “retrieve correct fact.” As WHO’s 2025 guidance notes, LLMs have no conception of what they produce, only statistical patterns from training data (WHO, 2025). When uncertain, LLMs default to generating text that sounds correct rather than admitting uncertainty or refusing to answer.

Medical examples documented in literature:

  1. Fabricated drug dosages:
    • Prompt: “What is the pediatric dosing for amoxicillin?”
    • GPT-3.5 response: “20-40 mg/kg/day divided every 8 hours” (incorrect for many indications; standard is 25-50 mg/kg/day, some indications 80-90 mg/kg/day)
  2. Invented medical facts:
    • Prompt: “What are the contraindications to beta-blockers in heart failure?”
    • LLM includes “NYHA Class II heart failure” (false; beta-blockers are indicated, not contraindicated, in Class II HF)
  3. Fake citations:
    • Prompt: “Cite studies showing benefit of IV acetaminophen for postoperative pain”
    • GPT-4 generates: “Smith et al. (2019) in JAMA Surgery found 40% reduction in opioid use” (paper doesn’t exist; authors, journal, year all fabricated but plausible)
  4. Outdated recommendations:
    • All LLMs have training data cutoffs (check the specific model’s documentation)
    • May recommend drugs withdrawn from market after training
    • Unaware of updated guidelines published post-training

Why this matters clinically: A physician who trusts LLM output without verification risks: - Incorrect medication dosing → patient harm - Reliance on outdated treatment → suboptimal care - Academic dishonesty from fabricated citations → career consequences

Mitigation strategies: - Always verify drug information against pharmacy databases (Lexicomp, Micromedex, UpToDate) - Cross-check medical facts with authoritative sources (guidelines, textbooks, PubMed) - Never trust LLM citations without looking up the actual papers - Use LLMs for drafts and idea generation, never final medical decisions - Higher stakes = more verification required

Retrieval-Augmented Generation (RAG) for Healthcare

A first comprehensive review of RAG for healthcare applications (Ng et al., NEJM AI, 2025) examines how retrieval-augmented generation addresses three core LLM limitations:

The Three Problems RAG Addresses:

Problem How RAG Helps
Outdated information Retrieves from current knowledge bases, bypassing training cutoff
Hallucinations Grounds responses in retrieved documents, enabling source verification
Reliance on public data Can query institutional guidelines, formularies, and proprietary sources

How RAG Works:

  1. Retrieval: When a query arrives, the system searches a curated knowledge base (guidelines, textbooks, institutional protocols)
  2. Augmentation: Retrieved passages are provided to the LLM as context
  3. Generation: LLM generates response grounded in retrieved documents, with source citations

Clinical Applications:

  • Guideline-grounded responses: RAG can pull from current treatment guidelines, reducing outdated recommendations
  • Institutional integration: Hospital-specific formularies and protocols as knowledge sources
  • Citation verification: Responses include source documents that users can verify
  • Pharmaceutical industry: Drug information queries grounded in package inserts and regulatory documents

Limitations:

  • RAG reduces but does not eliminate hallucinations. LLMs can still misinterpret or misstate retrieved content
  • Retrieval quality depends on knowledge base curation and query matching
  • Adds latency and infrastructure complexity compared to base LLMs
  • Retrieved context may be outdated if knowledge base isn’t maintained

Clinical Implication: When evaluating clinical LLM tools, ask whether they use RAG or similar grounding approaches. Systems that cite sources enable verification; those that don’t require more skepticism.

State of Clinical AI 2026: LLM Diagnostic Performance

The inaugural State of Clinical AI Report from the Stanford-Harvard ARISE network provides a nuanced assessment of LLM diagnostic capabilities.

Impressive benchmark results:

Several studies published in 2025 showed LLMs matching or outperforming physicians on diagnostic reasoning and treatment planning when evaluated on fixed clinical cases (Brodeur et al., 2025, preprint). In some papers, this performance was described as “superhuman.”

The reality check:

Performance depends heavily on how narrowly the problem is framed:

  • When models had to ask follow-up questions, manage incomplete information, or revise decisions as new details emerged, performance dropped (Johri et al., Nature Medicine, 2025)
  • On tests measuring reasoning under uncertainty, AI systems performed closer to medical students than experienced physicians (McCoy et al., NEJM AI, 2025)
  • Models tended to commit strongly to answers even when clinical ambiguity was high
  • Accuracy dropped 26-38% when familiar answer patterns were disrupted (Bedi et al., JAMA Network Open, 2025)

Why this matters for clinical practice:

In everyday medicine, uncertainty is common. The gap between performance on fixed exam questions and performance in ambiguous real-world scenarios is substantial. The report concludes that much of what looks impressive in headline-grabbing studies may not hold up in clinical practice.

Clinical implication: Use LLMs for brainstorming and drafts where you can verify output, not for situations requiring judgment under uncertainty.

When structured data outperforms LLMs: For clinical prediction tasks on structured EHR data, LLMs may underperform traditional approaches. In antimicrobial resistance prediction for sepsis, LLMs analyzing clinical notes achieved AUROC 0.74 compared to 0.85 for deep learning on structured EHR data, with combined approaches offering no improvement (Hixon et al., 2025, conference abstract). LLMs excel at language tasks, not structured clinical prediction.


Part 2: The Major Failure Case Study, Hallucination Disasters

Case 1: The Fabricated Oncology Protocol

Scenario (reported 2023): Physician asked GPT-4 for dosing protocol for pediatric acute lymphoblastic leukemia (ALL) consolidation therapy.

LLM response: Generated detailed protocol with drug names, dosages, timing that looked professionally formatted and authoritative.

The problem: - Methotrexate dose: 50 mg/m² (LLM suggested) vs. actual protocol: 5 g/m² (100x difference) - Vincristine timing: Weekly (LLM) vs. protocol: Every 3 weeks during consolidation - Dexamethasone duration: 5 days (LLM) vs. protocol: 28 days

If followed without verification: Patient would have received 1% of intended methotrexate dose (treatment failure, disease progression) and excessive vincristine (neurotoxicity risk).

Why it happened: LLM trained on general medical text, not specialized oncology protocols. Generated plausible-sounding but incorrect regimen by combining fragments from different contexts.

The lesson: Never use LLMs for medication dosing without rigorous verification against authoritative sources (protocol handbooks, institutional guidelines, pharmacy consultation).

Case 2: The Confident Misdiagnosis

Scenario (published case study): Emergency physician used GPT-4 to generate differential diagnosis for “32-year-old woman with sudden-onset severe headache, photophobia, neck stiffness.”

LLM differential: 1. Migraine (most likely) 2. Tension headache 3. Sinusitis 4. Meningitis (mentioned fourth) 5. Subarachnoid hemorrhage (mentioned fifth)

The actual diagnosis: Subarachnoid hemorrhage (SAH) from ruptured aneurysm.

The problem: LLM ranked benign diagnoses (migraine, tension headache) above life-threatening emergencies (SAH, meningitis) despite classic “thunderclap headache + meningeal signs” presentation.

Why it happened: - Training data bias: Migraine is far more common than SAH in text corpora - LLMs predict based on frequency in training data, not clinical risk stratification - No understanding of “rule out worst-case-first” emergency medicine principle

The lesson: LLMs don’t triage by clinical urgency or risk. Physician must apply clinical judgment to LLM suggestions.

What the physician did right: Used LLM as brainstorming tool, not autonomous diagnosis. Recognized high-risk presentation and ordered CT + LP appropriately.

Case 3: The Citation Fabrication Scandal

Scenario: Medical student submitted literature review using GPT-4 to generate citations supporting statements about hypertension management.

LLM-generated citations (examples): 1. “Johnson et al. (2020). ‘Intensive blood pressure control in elderly patients.’ New England Journal of Medicine 383:1825-1835.” 2. “Patel et al. (2019). ‘Renal outcomes with SGLT2 inhibitors in diabetic hypertension.’ Lancet 394:1119-1128.”

The problem: Neither paper exists. Authors, journals, years, page numbers all plausible but fabricated.

Discovery: Faculty advisor attempted to retrieve papers for detailed review. None found in PubMed, journal archives, or citation databases.

Consequences: - Student received failing grade for academic dishonesty - Faculty implemented “verify all LLM-generated citations” policy - Medical school updated honor code to address AI-assisted writing

Why this matters: - Citation fabrication in grant applications = federal research misconduct - In publications = retraction, career damage - In clinical guidelines = propagation of misinformation

The lesson: Never trust LLM-generated citations. Always verify papers exist and actually support the claims attributed to them.

Case 4: The Medical Calculation Gap

The problem: Physicians routinely use clinical calculators (CHADS-VASc for stroke risk, Cockcroft-Gault for GFR, HEART score for chest pain triage, LDL calculations). These quantitative tools drive treatment decisions daily.

What the research shows: MedCalc-Bench, a benchmark from NIH researchers evaluating LLMs on 55 different medical calculator tasks across 1,000+ patient scenarios, found that the best-performing model (GPT-4 with one-shot prompting) achieved only 50.9% accuracy (Khandekar et al., NeurIPS 2024).

Three distinct failure modes:

  1. Knowledge errors (Type A): LLM doesn’t know the correct equation or rule
    • Example: Asked to calculate CHADS-VASc score, assigns wrong points to criteria
    • Most common error in zero-shot prompting (over 50% of mistakes)
  2. Extraction errors (Type B): LLM extracts wrong parameters from patient note
    • Example: Misidentifies patient age, medication history, or lab values from clinical narrative
    • 16-31% of errors depending on model
  3. Computation errors (Type C): LLM performs arithmetic incorrectly
    • Example: Calculates LDL as 142 mg/dL when correct answer is 128 mg/dL
    • 13-17% of errors even when equation and parameters are correct

Why this matters clinically:

Medical calculations drive treatment decisions: - CHADS-VASc ≥2 → anticoagulation for atrial fibrillation - eGFR <30 → medication dose adjustments - HEART score ≥4 → admission vs. discharge decision

50.9% accuracy means LLMs are flipping a coin on tasks with direct treatment implications.

The performance gap:

Anthropic reported Claude Opus 4.5 achieves 98.1% accuracy on MedCalc-Bench. This represents substantial improvement IF independently validated. Key caveats: - Anthropic’s metric is company-reported, not peer-reviewed - Original research (GPT-4): 50.9% accuracy - 98.1% claim requires independent replication

The lesson: Never trust LLM-generated medical calculations without verification. Check all risk scores, GFR calculations, and dosing adjustments against established calculators (MDCalc, online tools, pharmacy databases).

Clinical workflow: 1. LLM suggests calculation (e.g., “Patient’s CHADS-VASc score is 4”) 2. Verify independently: Use MDCalc or manual calculation 3. If mismatch: Trust the verified calculation, not the LLM 4. Document verification in clinical note


Part 3: The Success Story, Ambient Clinical Documentation

Nuance DAX: Ambient Documentation AI

The problem DAX solves: Physicians spend 2+ hours per day on documentation, often completing notes after-hours. EHR documentation contributes significantly to burnout.

How DAX works: 1. Physician wears microphone during patient encounter 2. DAX records conversation (with patient consent) 3. LLM transcribes speech → converts to structured clinical note 4. Note appears in EHR for physician review/editing 5. Physician reviews, makes corrections, signs note

Evidence base:

Regulatory status: Not FDA-regulated. Falls under CDS (Clinical Decision Support) exemption per 21st Century Cures Act because it generates documentation drafts, not diagnoses or treatment recommendations, and physicians independently review all output.

Clinical validation: Nuance-sponsored study (2023), 150 physicians, 5,000+ patient encounters: - Documentation time reduction: 50% (mean 5.5 min → 2.7 min per encounter) - Physician satisfaction: 77% would recommend to colleagues - Note quality: No significant difference from physician-written notes (blinded expert review) - Error rate: 0.3% factual errors requiring correction (similar to baseline physician error rate in dictation)

Real-world deployment: - 550+ health systems - 35,000+ clinicians using DAX - 85% user retention after 12 months

Cost-benefit: - DAX subscription: ~$369-600/month per physician (varies by contract; $700 one-time implementation fee) - Time savings: 1 hour/day × $200/hour physician cost = $4,000/month saved - ROI: Positive in 1-3 months depending on encounter volume

Pricing source: DAX Copilot pricing page, January 2026. Costs vary by volume and contract terms.

Why this works: - Well-defined task (transcription + note structuring) - Physician review catches errors before note finalization - Integration with EHR workflow - Patient consent obtained upfront - HIPAA-compliant (BAA with healthcare systems)

Limitations: - Requires patient consent (some decline) - Poor audio quality → transcription errors - Complex cases with multiple topics may require substantial editing - Subscription cost barrier for small practices

Abridge: AI-Powered Medical Conversations

Similar ambient documentation tool with comparable performance: - 65% documentation time reduction in pilot studies - Focuses on primary care and specialty clinics - Generates patient-facing visit summaries automatically

The lesson: When LLMs are used for well-defined tasks with physician oversight and proper integration, they deliver genuine value.

Emerging: LLM-Based Clinical Copilots

Beyond documentation, early evidence suggests LLMs may function as real-time clinical safety nets. In a preprint study of 39,849 patient visits at Penda Health clinics in Kenya, clinicians using an LLM copilot (GPT-4o integrated into EHR workflow) showed 16% fewer diagnostic errors and 13% fewer treatment errors compared to controls (Korom et al., 2025, preprint). The copilot flagged potential errors for clinician review using a tiered alert system (green/yellow/red severity), maintaining physician control while providing a second-opinion safety net.

Key caveats: This is a preprint from an OpenAI partnership (company-reported data), patient outcomes showed no statistically significant difference, and a randomized controlled trial is underway. See Primary Care AI for detailed implementation analysis.


Part 4: Appropriate vs. Inappropriate Clinical Use Cases

SAFE Uses (With Physician Oversight)

1. Clinical Documentation Assistance

Use cases: - Draft progress notes from dictation - Generate discharge summaries - Suggest ICD-10/CPT codes - Create procedure notes

Workflow: 1. Physician provides input (dictation, conversation recording, bullet points) 2. LLM generates structured note 3. Physician reviews every detail, edits errors, adds clinical judgment 4. Physician signs final note

The Automation Bias Trap: When Review Becomes Rubber-Stamping

The dangerous reality: As AI accuracy improves, human vigilance drops. Studies show physicians begin “rubber-stamping” AI-generated content after approximately 3 months of successful use (Goddard et al., 2012).

The pattern: - Month 1: Physician carefully reviews every word, catches errors - Month 3: Physician skims notes, catches obvious errors - Month 6: Physician clicks “Sign” with minimal review, trusts the AI - Month 12: Errors slip through; patient harm possible

Counter-measures to maintain vigilance:

  1. Spot-check protocol: Verify at least one specific data point per note (e.g., check one lab value, one medication dose, one vital sign against the record)
  2. Rotation strategy: Vary which section you scrutinize each encounter
  3. Red flag awareness: Know the AI’s failure modes (medication names, dosing, dates, rare conditions)
  4. Scheduled deep review: Once weekly, do a line-by-line audit of a randomly selected AI-generated note
  5. Error tracking: Log every error you catch; if catches drop to zero, you may have stopped looking

The uncomfortable truth: “Physician in the loop” only works if the physician is actually paying attention. The AI doesn’t get tired; you do.

Risk mitigation: - Physician remains legally responsible for note content - Review catches hallucinations, errors, omissions - HIPAA-compliant systems only

Evidence: 50% time savings documented in multiple studies (see DAX above)

2. Literature Synthesis and Summarization

Use cases: - Summarize clinical guidelines - Compare treatment options from multiple sources - Generate literature review outlines - Identify relevant studies for research questions

Workflow: 1. Provide LLM with specific question and context 2. Request summary with citations 3. Verify all citations exist and support claims 4. Cross-check medical facts against primary sources

Example prompt:

"Summarize the 2023 AHA/ACC guidelines for management
of atrial fibrillation, focusing on anticoagulation
recommendations for patients with CHADS-VASc ≥2.
Include specific drug dosing and monitoring requirements.
Cite specific guideline sections."

Risk mitigation: - Verify citations before relying on summary - Cross-check facts with original guidelines - Use as starting point, not final analysis

3. Patient Education Materials

Use cases: - Explain diagnoses in health literacy-appropriate language - Create discharge instructions - Draft procedure consent explanations - Translate medical jargon to plain language

Workflow: 1. Specify reading level, key concepts, patient concerns 2. LLM generates draft 3. Physician reviews for medical accuracy 4. Edits for cultural sensitivity, individual patient factors 5. Shares with patient

Example prompt:

"Create a patient handout about type 2 diabetes management
for a patient with 6th grade reading level. Cover: medication
adherence, blood sugar monitoring, dietary changes, exercise.
Use simple language, avoid jargon, 1-page limit."

Risk mitigation: - Fact-check all medical information - Customize to individual patient (LLM generates generic content) - Consider health literacy, cultural factors

4. Differential Diagnosis Brainstorming

Use cases: - Generate possibilities for complex cases - Identify rare diagnoses to consider - Broaden differential when stuck

Workflow: 1. Provide detailed clinical vignette 2. Request differential with reasoning 3. Treat as idea generation, not diagnosis 4. Pursue appropriate diagnostic workup based on clinical judgment

Example prompt:

"Generate differential diagnosis for 45-year-old woman
with 3 months of progressive dyspnea, dry cough, and
fatigue. Exam: fine bibasilar crackles, no wheezing.
CXR: reticular infiltrates. Consider both common and
rare etiologies. Provide likelihood and key diagnostic
tests for each."

Risk mitigation: - LLM differential is brainstorming, not diagnosis - Verify each possibility clinically plausible for patient - Pursue workup based on pretest probability, not LLM ranking

5. Medical Coding Assistance

Use cases: - Suggest ICD-10/CPT codes from clinical notes - Identify documentation gaps for proper coding - Check code appropriateness

Workflow: 1. LLM analyzes clinical note 2. Suggests codes with reasoning 3. Coding specialist or physician reviews 4. Confirms codes match care delivered and documentation

Risk mitigation: - Compliance review essential (fraudulent coding = federal offense) - Physician confirms codes represent actual care - Regular audits of LLM-suggested codes

DANGEROUS Uses (Do NOT Do)

1. Autonomous Patient Advice

Why dangerous: - Patients ask LLMs medical questions without physician involvement - LLMs provide confident answers regardless of accuracy - Patients may delay appropriate care based on false reassurance

Documented harms: - Patient with chest pain asked ChatGPT “Is this heartburn or heart attack?” - ChatGPT suggested antacids (without seeing patient, knowing history, performing exam) - Patient delayed ER visit 6 hours, presented with STEMI

The lesson: Patients will use LLMs for medical advice regardless of physician recommendations. Educate patients about limitations, encourage them to contact you rather than rely on AI.

See Also: Major Health AI Product Launches (January 2026)

Both OpenAI and Anthropic launched dedicated healthcare products in January 2026:

2. Medication Dosing Without Verification

Why dangerous: - LLMs fabricate plausible but incorrect dosages - Pediatric dosing especially error-prone - Drug interaction checking unreliable

Documented near-miss: - Physician asked GPT-4 for vancomycin dosing in renal failure - LLM suggested dose appropriate for normal renal function - Pharmacist caught error before administration

The lesson: Never use LLM-generated medication dosing without verification against pharmacy databases, dose calculators, or pharmacist consultation.

Medical calculations beyond dosing:

The quantitative reasoning gap extends beyond medication dosing to all medical calculators:

  • Risk scores: CHADS-VASc, HEART score, Caprini VTE risk
  • GFR calculations: Cockcroft-Gault, MDRD equations
  • Lab-derived values: LDL calculation, anion gap
  • Clinical indices: Pneumonia severity index, Wells’ criteria

NIH research found LLMs achieve only 50.9% accuracy on medical calculation tasks, with three failure patterns: wrong equations, parameter extraction errors, and arithmetic mistakes (Khandekar et al., NeurIPS 2024).

The lesson: Verify all LLM calculations against established medical calculators (MDCalc, institutional tools, pharmacy databases). See Case 4: The Medical Calculation Gap for detailed failure modes.

3. Urgent or Emergent Clinical Decisions

Why dangerous: - Time pressure precludes adequate verification - High stakes magnify consequence of errors - Clinical judgment + experience > LLM statistical patterns

The lesson: In emergencies, rely on clinical protocols, expert consultation, established guidelines, not LLM brainstorming.

4. Generating Citations Without Verification

Why dangerous: - LLMs fabricate 15-30% of medical citations - Using fake references = academic dishonesty, research misconduct - Propagates misinformation if not caught

The lesson: Never include LLM-generated citations in manuscripts, grants, presentations without verifying papers exist and support the claims.


Part 5: Prompting Techniques and Evidence-Based Approaches

Well-crafted prompts significantly improve LLM output quality. A scoping review of 114 prompt engineering studies found that structured prompting techniques can improve task performance substantially compared to naive prompts (Zaghir et al., 2024).

Core Prompting Paradigms

Zero-Shot Prompting

The simplest approach: ask a question without examples.

"What are the first-line treatments for community-acquired pneumonia
in an otherwise healthy adult?"

When to use: Simple factual questions, initial exploration, low-stakes queries

Limitations: Less reliable for complex reasoning, nuanced clinical scenarios, or specialized domains

Few-Shot Prompting

Provide examples of desired input-output pairs before your actual question.

Example 1:
Patient: 65-year-old male, chest pain radiating to left arm, diaphoresis
Assessment: High concern for ACS, recommend immediate ECG and troponins

Example 2:
Patient: 28-year-old female, sharp chest pain worse with inspiration
Assessment: Consider pleurisy, PE, or musculoskeletal cause

Now assess:
Patient: 72-year-old female with diabetes, fatigue and jaw pain for 2 days

When to use: When you need consistent output format, domain-specific reasoning patterns, or specialized terminology

Evidence: LLMs enhanced with clinical practice guidelines via few-shot prompting showed improved performance across GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2 compared to zero-shot baselines (Oniani et al., 2024)

Chain-of-Thought (CoT) Prompting

Request step-by-step reasoning rather than direct answers.

"A 58-year-old man presents with progressive dyspnea and bilateral
leg edema. EF is 35%. Think through this step-by-step:
1) What are the key clinical findings?
2) What is the most likely primary diagnosis?
3) What additional workup is needed?
4) What are the initial management priorities?"

When to use: Complex diagnostic reasoning, treatment planning, cases with multiple interacting factors

Evidence: Chain-of-thought prompting allows GPT-4 to mimic clinical reasoning processes while maintaining diagnostic accuracy, improving interpretability (Savage et al., 2024)

When Chain-of-Thought Backfires

CoT is not universally beneficial. Tasks requiring implicit pattern recognition, exception handling, or subtle statistical learning may show reduced performance with CoT prompting. An NEJM AI study found that reasoning-optimized models showed overconfidence and premature commitment to incorrect hypotheses in clinical scenarios requiring flexibility under uncertainty (NEJM AI, 2025).

Practical implication: Use CoT for systematic diagnostic workups; avoid it for gestalt pattern recognition or rapid triage decisions where experienced clinicians rely on intuition.

Structured Clinical Reasoning Prompts

Organize clinical information into predefined categories before requesting analysis.

PATIENT INFORMATION:
- Age/Sex: 45-year-old female
- Chief Complaint: Progressive fatigue x 3 months

HISTORY:
- Duration: 3 months, gradual onset
- Associated: Weight gain, cold intolerance, constipation
- PMH: Type 2 diabetes, hypertension

PHYSICAL EXAM:
- VS: BP 142/88, HR 58, afebrile
- General: Appears fatigued, dry skin, periorbital edema

LABS:
- TSH: 12.4 mIU/L (0.4-4.0)
- Free T4: 0.6 ng/dL (0.8-1.8)

Based on this structured information, provide:
1. Primary diagnosis with reasoning
2. Differential diagnoses to consider
3. Recommended next steps

Evidence: Structured templates that organize clinical information before diagnosis improve LLM diagnostic capabilities compared to unstructured narratives (Sonoda et al., 2024)

Practical Prompting Framework (R-C-T-C-F)

For clinical prompts, include these components:

Component Description Example
Role Define the LLM’s expertise level “You are an internal medicine attending…”
Context Provide relevant background “…reviewing a case for morning report…”
Task Specify exactly what you need “…generate a differential diagnosis…”
Constraints Set boundaries and requirements “…focusing on reversible causes, avoiding rare conditions…”
Format Specify output structure “…as a numbered list with likelihood estimates.”

Poor prompt:

"What's wrong with this patient?"

Effective prompt:

"You are an internal medicine attending reviewing a case for
teaching purposes. A 55-year-old woman presents with fatigue,
unintentional weight loss of 15 lbs over 3 months, and new-onset
diabetes. Generate a differential diagnosis focusing on malignancy
and endocrine causes. Format as a numbered list with brief
reasoning for each, ordered by likelihood."

What the Evidence Shows

Technique Best Use Case Evidence Quality Key Citation
Zero-shot Simple queries, exploration Moderate Baseline in most studies
Few-shot Consistent formatting, specialized domains Strong Oniani et al., 2024
Chain-of-thought Complex reasoning, teaching Strong (with caveats) Savage et al., 2024
Structured templates Diagnostic workups Moderate Sonoda et al., 2024

Common Prompting Mistakes

  1. Vague requests: “Analyze this” vs. “Calculate the CHADS-VASc score and recommend anticoagulation”
  2. Missing context: Asking about drug dosing without patient weight, renal function, or indication
  3. Overloading: Combining multiple complex tasks in one prompt (ask sequentially instead)
  4. Assuming knowledge: LLMs may not know your institution’s specific protocols or formulary
  5. Skipping verification: Even excellent prompts produce outputs requiring clinical validation

Further Reading

  • Meskó, 2023: Tutorial on prompt engineering for medical professionals (JMIR)
  • Zaghir et al., 2024: Scoping review of 114 prompt engineering studies (JMIR)

Part 7: Vendor Evaluation Framework

Before Adopting an LLM Tool for Clinical Practice

Questions to ask vendors:

  1. “Is this system HIPAA-compliant? Can you provide a Business Associate Agreement?”
    • Essential for any system touching patient data
    • No BAA = no patient data entry
  2. “What is the LLM training data cutoff date?”
    • Cutoff dates vary by model and version (check vendor documentation)
    • Older cutoff = more outdated medical knowledge
    • Models with web search can access current information but still require verification
  3. “What peer-reviewed validation studies support clinical use?”
    • Demand JAMA, NEJM, Nature Medicine publications
    • User satisfaction ≠ clinical validation
    • Ask for prospective studies, not just retrospective benchmarks
  4. “What is the hallucination rate for medical content?”
    • If vendor can’t quantify, they haven’t tested rigorously
    • Accept that hallucinations are unavoidable; question is frequency
    • Rates vary dramatically by task: Clinical note summarization shows ~1.5% hallucination rates (Asgari et al., 2025); reference generation in systematic reviews reaches 28-39% (Chelli et al., 2024). RAG-augmented systems can reduce rates to 0-6%
    • Stanford Medicine’s ChatEHR reported 0.73 hallucinations + 1.60 inaccuracies per summarization across 23,000 sessions (Shah et al., 2026, manuscript)
  5. “How does the system handle uncertainty?”
    • Good LLMs express appropriate uncertainty (“I’m not certain, but…”)
    • Bad LLMs confidently hallucinate when uncertain
  6. “What verification/oversight mechanisms are built into the workflow?”
    • Best systems require physician review before acting on LLM output
    • Dangerous systems allow autonomous LLM actions
  7. “How does this integrate with our EHR?”
    • Practical integration essential for adoption
    • Clunky workarounds fail
  8. “What is the cost structure and ROI evidence?”
    • Subscription per physician? API usage fees?
    • Request time-savings data, physician satisfaction metrics
  9. “What testing validates consistency of outputs across multiple runs?”
    • Ask for reproducibility data: same input, how often does output differ?
    • Critical for clinical decisions where consistency matters (dosing, treatment recommendations)
    • If vendor hasn’t tested, they haven’t validated for clinical use
  10. “Does your malpractice insurance explicitly cover LLM use?”
    • Many policies exclude AI-related claims or require explicit rider
    • Ask insurer directly, don’t rely on vendor assurances
    • Request coverage confirmation in writing before deployment
  11. “Who is liable if LLM output causes patient harm?”
    • Most vendors disclaim liability in contracts
    • Physician/institution bears risk
  12. “What data is retained, and can patients opt out?”
    • Data retention policies
    • Patient consent/opt-out mechanisms

Red Flags (Walk Away If You See These)

  1. No HIPAA compliance for clinical use (public ChatGPT marketed for medical decisions)
  2. Claims of “replacing physician judgment” (LLMs assist, don’t replace)
  3. No prospective clinical validation (only bench mark exam scores)
  4. Autonomous actions without physician review (medication ordering, diagnosis without oversight)
  5. Vendor refuses to discuss hallucination rates (hasn’t tested or hiding poor performance)

Part 8: Cost-Benefit Reality

What Does LLM Technology Cost?

Ambient documentation (Nuance DAX, Abridge): - Cost: ~$369-600/month per physician (varies by contract and volume) - Benefit: 1 hour/day time savings × $200/hour = $4,000/month - ROI: Positive in 1-3 months - Non-monetary benefit: Reduced burnout, improved work-life balance

GPT-4 API (HIPAA-compliant): - Cost: ~$0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens - Typical clinical note: 500 tokens input, 1,000 output = $0.075 per note - If 20 notes/day: $1.50/day = $30/month (cheaper than subscription) - But: Requires technical integration, institutional IT support

Glass Health (LLM clinical decision support): - Cost: Free tier available, paid tiers ~$100-300/month - Benefit: Differential diagnosis brainstorming, treatment suggestions - ROI: Unclear; depends on how often you use for complex cases

Epic LLM integration (message drafting, note summarization): - Cost: Bundled into EHR licensing for institutions - Benefit: Incremental time savings across multiple workflows

Do These Tools Save Money?

Ambient documentation: YES - 50% time savings is substantial - Reduced after-hours charting improves physician well-being - Cost-effective based on time saved - Caveat: Requires subscription commitment; per-physician cost limits small practice adoption

API-based documentation assistance: MAYBE - Much cheaper than subscriptions (~$30/month vs. $400-600/month) - But requires IT infrastructure, integration effort - ROI depends on institutional technical capacity

Literature summarization: UNCLEAR - Time savings real (10 min to read guideline vs. 2 min to review LLM summary) - But risk of hallucinations means verification still required - Net time savings modest

Patient education generation: PROBABLY - Faster than writing from scratch - But requires physician review - Best for high-volume needs (discharge instructions, common diagnoses)


Part 9: The Future of Medical LLMs

What’s Coming in the Next 3-5 Years

Likely developments:

  1. EHR-integrated LLMs become standard
    • Epic, Cerner, Oracle already deploying
    • Message drafting, note summarization, coding assistance
    • HIPAA-compliant by design
  2. Multimodal medical LLMs
    • Text + images + lab data + genomics
    • “Show me this rash” + clinical history → differential diagnosis
    • Radiology report + imaging → integrated assessment
  3. Reduced hallucinations
    • Retrieval-augmented generation (LLM + medical database lookup)
    • Better uncertainty quantification
    • Improved factuality through constrained generation
  4. Prospective clinical validation
    • RCTs showing improved outcomes (not just time savings)
    • Cost-effectiveness analyses
    • Comparative studies (LLM-assisted vs. standard care)
  5. Regulatory clarity
    • FDA guidance on LLM medical devices
    • State medical board policies on LLM use
    • Malpractice liability precedents
  6. Open-weight models democratizing global access
    • DeepSeek, Llama 3, Mistral offer computational efficiency for resource-constrained settings
    • Over 300 hospitals in China deployed DeepSeek since January 2025 for clinical decision support
    • Cost: 6.71% of proprietary models (OpenAI o1) with comparable performance on medical benchmarks
    • Critical caveat: Deployment at scale without prospective clinical validation raises safety concerns
    • See AI and Global Health Equity for implementation guidance

Unlikely (despite hype):

  1. Fully autonomous diagnosis/treatment
    • Too high-stakes for pure LLM decision-making
    • Human oversight will remain essential
  2. Complete elimination of hallucinations
    • Fundamental to how LLMs work
    • Mitigation, not elimination, is realistic goal
  3. Replacement of physician-patient relationship
    • LLMs assist communication, don’t replace human connection
    • Empathy, trust, shared decision-making remain human domains

Modern Clinical Decision Support: Evidence Synthesis Tools

The clinical decision support landscape has shifted dramatically with the emergence of large language model-based tools. While traditional CDS focused on rule-based alerts and drug interaction warnings, modern systems provide evidence synthesis, clinical reasoning support, and real-time guideline retrieval. Adoption has been rapid but evidence of clinical impact remains limited.

Adoption Landscape

OpenEvidence, launched in 2024, reached over 40% daily use among U.S. physicians by late 2025, handling over 8.5 million clinical consultations monthly (OpenEvidence, December 2025). The platform uses large language models to synthesize medical literature, clinical guidelines, and drug databases in response to clinical queries.

Other widely adopted tools include: - ChatGPT (16% of physicians) - General-purpose LLM used for clinical queries despite not being health-specific - Abridge (5%) - Ambient clinical documentation - Claude (3%) - General-purpose LLM - DAX Copilot (2.4%) - Microsoft’s ambient documentation system - Doximity GPT (1.8%) - Clinical decision support embedded in physician networking platform

(Axios, December 2025)

Growth trajectory: OpenEvidence grew 2,000%+ year-over-year, adding 65,000 new verified U.S. clinician registrations monthly. This represents the fastest adoption of any physician-facing application in history.

Integration Patterns

EHR-embedded deployment: Health systems have begun integrating evidence synthesis tools as SMART on FHIR apps within Epic, allowing physicians to query evidence without leaving the EHR workflow. This addresses a critical barrier: context switching between EHR and external tools.

Point-of-care use: Unlike traditional CDS that interrupts workflow with alerts, modern tools are pull-based. Clinicians query them when needed rather than receiving unsolicited pop-ups. This reduces alert fatigue but requires clinician initiative.

Evidence Gaps and Adoption Barriers

Despite widespread use, physicians express significant concerns. According to AMA surveys, nearly half of physicians (47%) ranked increased oversight as their top regulatory priority for AI tools, and 87% cited not being held liable for AI model errors as a critical factor for adoption (AMA, February 2025). Key concerns include accuracy and misinformation risk, lack of explainability, and legal liability.

The Stanford-Harvard ARISE assessment (2026): The State of Clinical AI Report reviewed 500+ clinical AI studies and found that while these tools handle millions of consultations monthly, rigorous outcome studies remain rare. Most evaluation relies on user satisfaction and engagement metrics rather than patient outcomes or diagnostic accuracy improvements.

Key concern: These tools often provide synthesized information with high confidence but limited transparency about source quality, evidence strength, or knowledge cutoff dates. When asked about emerging diseases, recent guideline updates, or off-label uses, LLM-based systems may hallucinate references or conflate older evidence with current recommendations.

When Modern CDS Works: Evidence Synthesis at Scale

Successful use case: Clinical guideline synthesis

OpenEvidence excels at queries like “What are the current USPSTF recommendations for colorectal cancer screening in average-risk adults?” where: - Guidelines are publicly available and well-established - Question has a clear, documented answer - Timeliness matters but changes are infrequent - Synthesis across multiple sources adds value

Problematic use case: Rare disease diagnosis

The same tools struggle with queries like “What’s the differential diagnosis for this constellation of symptoms?” where: - Medical literature is vast and pattern-matching fails - Rare presentations require systematic reasoning, not synthesis - Hallucination risk is high when training data is sparse - Clinical judgment and experience are essential

Clinical Practice Implications

Hospital and clinic adoption: Healthcare systems increasingly use these tools for: - Point-of-care guideline lookup (e.g., “What are current AHA recommendations for NSTEMI antiplatelet therapy?”) - Drug information queries (e.g., “What’s the renal dosing adjustment for vancomycin?”) - Evidence synthesis for clinical decision-making

Risks in resource-limited settings: When CDS tools are the primary source of clinical guidance (e.g., rural hospitals without on-site specialists, solo practices), hallucinated information or outdated recommendations can propagate without detection. Unlike well-staffed academic centers where multiple physicians cross-check recommendations, single-provider settings lack redundancy.

Lessons from ARISE: Clinician-AI Collaboration Outperforms Replacement

The ARISE report found consistent evidence that AI-assisted care outperforms either AI alone or clinicians alone:

  • Radiologists consulting optional AI detected more breast cancers without increasing false positives
  • Physicians using AI plus standard resources made better treatment decisions than control groups

However, collaboration is not yet optimized. Deskilling remains a real concern: if clinicians defer to AI recommendations without understanding the reasoning, diagnostic skills atrophy. The optimal collaboration pattern requires:

  1. AI provides evidence synthesis, not definitive answers
  2. Clinician integrates context: patient preferences, comorbidities, social determinants
  3. Uncertainty is explicit: AI indicates confidence and knowledge gaps
  4. Auditability: Clinician can trace recommendations to source evidence

Comparison to Traditional CDS Failures

Modern evidence synthesis tools avoid some failure modes of traditional CDS:

Epic sepsis model: - Proprietary, black-box algorithm - High false positive rate (88%) - Alert fatigue disaster - No outcome benefit despite massive deployment

OpenEvidence/modern CDS: - Pull-based (clinician-initiated) rather than push-based (unsolicited alerts) - No false positives from unsolicited alerts - Transparency varies (some cite sources, others don’t) - Outcome evidence still absent

Shared risk: Both can be deployed at scale without rigorous external validation. Traditional CDS faced regulatory gaps (EHR-embedded tools avoided FDA oversight). Modern CDS faces similar issues: general-purpose LLMs marketed for clinical use aren’t regulated as medical devices.

Open Questions

  1. Outcome measurement: Does AI-assisted clinical decision-making improve patient outcomes? Reduce diagnostic errors? Shorten time to treatment?

  2. Liability: When AI provides incorrect information that a clinician follows, who is liable? The tool vendor? The clinician? The health system?

  3. Equity: Do these tools work equally well for questions about diseases affecting underrepresented populations? For conditions primarily researched in high-income countries?

  4. Knowledge currency: How do these tools handle emerging evidence? COVID-19 revealed that guidelines changed weekly. LLMs trained on historical data can’t capture real-time updates without continuous retraining or retrieval-augmented generation.

For comprehensive vendor evaluation criteria, see Appendix: Vendor Evaluation Framework. For regulatory frameworks, see Evaluating AI Clinical Decision Support Systems.


Patient-Facing AI: Unique Safety Considerations

AI systems increasingly interact directly with patients through chatbots, symptom checkers, health education platforms, and digital assistants. These applications present distinct safety challenges compared to clinician-facing tools because patients lack clinical training to identify AI errors and may act on incorrect information without professional oversight.

The Evidence Gap

The Stanford-Harvard ARISE State of Clinical AI Report (2026) found that patient-facing AI evaluation relies primarily on engagement metrics (user satisfaction, session length, return visits) rather than outcome-focused evidence. Few studies measure whether these tools improve health outcomes, reduce diagnostic errors, or facilitate appropriate care escalation (Stanford Medicine, January 2026).

What’s measured: - User satisfaction scores (typically 70-85%) - Engagement rates (time spent, return visits) - Completion rates for health assessments

What’s rarely measured: - Diagnostic accuracy for patient-reported symptoms - Appropriate triage and escalation to human care - Patient safety outcomes (delayed diagnosis, inappropriate self-treatment) - Health literacy impact (do users understand AI limitations?)

Risk Categories

1. Misplaced Patient Trust

Patients often cannot distinguish between: - Evidence-based health information - AI-generated plausible-sounding misinformation - General wellness advice vs. medical recommendations requiring professional oversight

Example failure mode: Patient uses symptom checker for chest pain. AI suggests gastroesophageal reflux (most common cause statistically) and recommends antacids. Patient self-treats for 3 days. Actual diagnosis: acute coronary syndrome. Outcome: Delayed treatment, preventable myocardial damage.

Why this happens: LLMs optimize for plausible responses, not safety. Chest pain + young patient + no risk factors → statistically likely to be benign. But rare dangerous causes require different reasoning: maximize safety, not likelihood.

2. Delayed Escalation to Professional Care

Patient-facing AI may inadvertently discourage appropriate care-seeking: - Chatbot provides reassurance for concerning symptoms - AI suggests home remedies for conditions requiring clinical evaluation - User perceives AI response as definitive medical advice

ARISE notes this as a critical gap: “evaluation frameworks must focus on outcomes rather than engagement alone, with particular attention to escalation pathways and safety rails for high-risk scenarios.”

3. Health Literacy and Informed Consent

Patients using AI health tools often don’t understand: - These are not medical devices (most aren’t FDA-regulated) - Recommendations aren’t reviewed by healthcare professionals - AI can hallucinate references, statistics, or treatment guidelines - Knowledge cutoff dates mean recent information may be missing

Current state: Few patient-facing AI systems explicitly disclose limitations, error rates, or when to seek professional care instead. Terms of service often include liability disclaimers buried in legal text.

Case Example: ChatGPT for Health Education

OpenAI announced ChatGPT for Health in January 2026, positioning it as a health education tool. The system answers health questions using GPT-4-level reasoning but without clinical validation or regulatory oversight.

Intended use: General health education, wellness information, understanding diagnoses

Actual use: Patients report using it for: - Self-diagnosis of symptoms - Medication guidance - Treatment decisions (e.g., whether to seek emergency care) - Second opinions on physician recommendations

Safety gap: No evidence that the system can reliably distinguish: - Emergency symptoms requiring immediate care - Serious conditions requiring professional evaluation within days - Self-limiting conditions safe to monitor at home

The system provides confident-sounding responses regardless of uncertainty, creating false sense of security.

Physician Perspective: Managing Patient AI Use

Common scenario: Patient arrives with printout from ChatGPT suggesting diagnoses and treatments.

Challenges: - Correcting misinformation takes clinical time - Undermines physician-patient trust if AI contradicts physician - Patients may doctor-shop if physician disagrees with AI - Liability concerns if patient follows AI advice instead of medical recommendation

Communication strategies: - Acknowledge patient’s research initiative - Explain AI limitations (hallucinations, lack of individualization) - Review AI recommendations together, correct errors - Emphasize importance of individualized care vs. generic advice - Document patient’s AI use and physician guidance in chart

ARISE Recommendations

The ARISE report calls for:

  1. Clearer evidence requirements before widespread patient-facing deployment
  2. Stronger escalation pathways to human clinical oversight for concerning symptoms
  3. Evaluation frameworks focused on outcomes, not engagement:
    • Does AI improve health-seeking behavior for serious conditions?
    • Does AI reduce unnecessary ED visits for benign conditions?
    • Does AI correctly identify high-risk scenarios requiring immediate care?
  4. Transparency about limitations:
    • Explicit disclaimers about non-medical-device status
    • Clear guidance on when to seek professional care
    • Disclosure of knowledge cutoff dates and evidence quality

Safety Design Patterns

Pattern 1: Tiered Escalation

User query → AI assessment → Risk stratification:
- High risk (chest pain, severe headache, difficulty breathing) → Immediate redirect to 911 / emergency care
- Medium risk (persistent symptoms, worsening condition) → Prompt to schedule clinical visit within 24-48 hours
- Low risk (wellness, general information) → AI provides information with caveat to seek care if symptoms change

Pattern 2: Uncertainty Disclosure

AI response includes:
- Confidence level (high/medium/low)
- Knowledge gaps ("I don't have information about interactions with your specific medications")
- Explicit limitations ("This is educational information, not medical advice")
- Actionable next steps ("If symptoms worsen or persist beyond X days, seek professional care")

Pattern 3: Human-in-the-Loop for High Stakes

For scenarios with potential serious outcomes: - AI flags query as high-risk - Routes to nurse triage line or clinical decision support - Logs interaction for quality review - Does NOT provide definitive AI-generated recommendation without human oversight

Failure Mode: The Safety Theatre Problem

Some patient-facing AI tools include disclaimers like “This is not medical advice” while functionally operating as medical decision support tools. This creates liability protection for vendors while failing to protect patients.

Example: Symptom checker provides differential diagnosis with probabilities, recommends specific tests, suggests when to seek care vs. self-treat. Footer says “Not medical advice - consult your doctor.” Patient acts on recommendations without professional consultation because AI seemed authoritative.

The disconnect: Legal disclaimer contradicts functional design. If tool isn’t meant for medical decision-making, why provide diagnosis and treatment recommendations?

Regulatory Gaps

Most patient-facing health AI isn’t regulated as medical devices because vendors market them as “wellness” or “educational” tools rather than diagnostic systems. This creates a regulatory arbitrage:

  • FDA-cleared diagnostic AI (e.g., IDx-DR for diabetic retinopathy): Rigorous clinical validation, performance standards, post-market surveillance
  • General wellness AI (most chatbots and symptom checkers): No validation requirements, no performance standards, no adverse event reporting

The distinction depends on marketing claims, not actual use. Patients don’t distinguish between regulated and unregulated tools.

Recommendations for Clinical Practice

When patients use AI health tools:

  1. Ask proactively: “Have you looked up your symptoms online or used any health apps?” normalizes discussion

  2. Review AI recommendations together: Don’t dismiss outright; use as teachable moment about evidence-based medicine

  3. Document: Note patient’s AI use and your clinical guidance in chart for liability protection

  4. Educate about escalation: Teach patients red flag symptoms requiring immediate care regardless of AI reassurance

  5. Equity considerations: Does AI work equally well for:

    • Limited English proficiency patients?
    • Low health literacy populations?
    • Patients without reliable internet access or smartphones?
    • Conditions affecting underrepresented communities?

For comprehensive safety evaluation frameworks, see Clinical AI Safety and Risk Management. For equity considerations, see Medical Ethics, Bias, and Health Equity.


Part 10: Implementation Guide

Safe LLM Implementation Checklist

Pre-Implementation:

During Use:

Post-Implementation:


Part 11: Institutional LLM Deployment: Real-World Evidence

While individual physicians experiment with LLMs, health systems face a distinct challenge: how to deploy LLMs at institutional scale with appropriate governance, workflow integration, and continuous monitoring. Stanford Medicine’s ChatEHR provides the first large-scale evidence on what institutional LLM deployment actually looks like in practice.

The Stanford ChatEHR Model

Stanford Health Care developed ChatEHR as an institutional capability rather than adopting external vendor solutions, enabling what they term a “build-from-within” strategy (Shah et al., 2026). The platform connects multiple LLMs (OpenAI, Anthropic, Google, Meta, DeepSeek) to the complete longitudinal patient record within the EHR.

Two deployment modes:

Mode Description Use Case
Automations Static prompt + data combinations for fixed tasks Transfer eligibility screening, surgical site infection monitoring, chart abstraction
Interactive UI Chat interface within EHR for open-ended queries Pre-visit chart review, summarization, clinical questions

The key insight: Standalone LLM tools (like web-based ChatGPT) create “workflow friction” from manual data entry. Clinical adoption requires real-time access to the longitudinal medical record inside the LLM context window, direct embedding into clinical workflows, and continuous evaluation.

Quantified Hallucination Rates in Production

The most significant contribution of the Stanford deployment is quantified error rates from real-world clinical use, not laboratory benchmarks.

Summarization accuracy (the most common task, 30% of queries):

Metric Rate Definition
Hallucinations per summary 0.73 Statements not found in or supported by the patient record (e.g., mentioning a procedure not documented)
Inaccuracies per summary 1.60 Statements contradicting information in the record (e.g., reporting lab value of 5.0 when record states 3.5)
Total unsupported claims 2.33 per summary Combined hallucinations + inaccuracies
Summaries with ≤1 error 50% Half of all summaries had at most one unsupported claim

Error types identified: - Temporal sequence errors (misstating when events occurred) - Numeric value confusion (labs, vitals) - Role attribution errors (misstating who performed an action) - “Gestalt” care plan confusion (e.g., conflating pulmonary workup with cardiovascular workup)

Clinical implication: Every LLM-generated clinical summary requires verification. Approximately half of summaries contain more than one error. These are not catastrophic hallucinations in most cases, but inaccuracies that could mislead clinical reasoning if unverified.

Adoption and Usage Patterns

User training and adoption:

  • 1,075 clinicians completed mandatory training video
  • 99% reported the training video prepared them for use
  • Training emphasized: key features, prompting tips, and the non-negotiable requirement to verify outputs

Usage scale (first 3 months of broad deployment):

Metric Value
Total sessions 23,000+
Daily active users ~100
Tokens processed 19 billion
Sessions using external HIE data >50%
Most common task Summarization (30%)

User types: 424 physicians, 180 APPs, 151 residents, 60 fellows using the system at least once.

Response times: Most queries returned in under 20 seconds, though complex patient records with large timelines could take up to 50 seconds (with 95%+ of time spent assembling the FHIR record bundle, not LLM inference).

Value Assessment Framework

Stanford developed a structured framework to quantify LLM deployment value across three categories:

Category Definition Example
Cost savings Direct monetary reductions Avoided manual chart review labor
Time savings Decreased time on manual tasks (time repurposed, not eliminated) 120 charts/day avoided manual review = ~4 hours saved
Revenue growth Incremental revenue from new workflows or improved throughput Increased transfer throughput freeing beds

First-year estimated savings: $6 million (conservative, without quantifying the benefit of improved patient care).

Example automation ROI:

Automation Time Savings Revenue Impact
Transfer eligibility screening ~$100K labor $2.4–3.3M (1,700 transfers/year freeing beds)
Inpatient hospice identification ~6,570 hours/year (3 FTEs) Difficult to quantify
Pre-visit chart review 120 charts/day avoided Time returned to clinical care

The UI value proposition: If 100 daily users running 3 queries each save 10 minutes per query searching charts, that translates to ~50 hours/day of physician time. At median physician hourly rates, this approximates $2.2 million annually in time savings, against ~$20,000 in LLM API costs.

Governance and Monitoring

Stanford implemented continuous monitoring across three dimensions:

1. System integrity monitoring: - Response times, error codes, timeouts - Token usage and cost tracking - Data retrieval verification (re-extracting benchmark patients to detect upstream changes)

2. Performance monitoring: - In-workflow feedback collection (thumbs up/down) - Task categorization from interaction logs - Unsupported claims rate estimation on sample of sessions - Benchmark dataset maintenance for each automation

3. Impact monitoring: - Action rates (what proportion of flagged patients received recommended follow-up) - Documentation metrics (use of generated text in notes) - Engagement metrics (views, copies, repeat usage)

User feedback rates: ~5% of users provided feedback; of those, two-thirds was positive (thumbs up).

Lessons for Health Systems

What the Stanford deployment demonstrates:

  1. Workflow integration is essential. Copy-paste from standalone tools creates friction that limits adoption. LLMs embedded in EHR workflows see sustained use.

  2. Automations require different governance than interactive use. Predefined prompt + data combinations can be validated against truth sets and monitored systematically. Interactive chat requires task categorization, sampling, and ongoing quality assessment.

  3. Benchmark performance is necessary but insufficient. Stanford used MedHELM (Nature Medicine, 2026) for initial model selection, but real-world monitoring revealed error patterns (temporal confusion, numeric errors) that benchmarks don’t capture.

  4. User training drives safe adoption. Mandatory training video completion before access, combined with community support channels, correlated with high reported preparedness and appropriate skepticism toward outputs.

  5. Value quantification requires structured frameworks. Time savings are “soft” (repurposed, not eliminated) and harder to measure than cost savings or revenue growth. Combining all three provides a more complete picture.

What remains challenging:

  • Verification behavior erosion: As users become accustomed to AI-generated content, verification may decline
  • Automated fact verification is still maturing (see VeriFact for emerging approaches)
  • Value of improved patient care is difficult to quantify

The Vendor-Agnostic Advantage

The “build-from-within” approach provides institutional agency that vendor-dependent deployments do not:

Factor Vendor Solution Institutional Platform
Model selection Vendor’s chosen model Best model for each task
Data governance Vendor’s terms Institution controls data
Customization Limited to vendor options Task-specific automations
Monitoring Vendor-provided dashboards Custom metrics aligned to institutional priorities
Cost Per-user licensing Per-query API costs (~$0.16/query)
Continuity Vendor business risk Internal capability persists

The strategic argument: LLM deployment as institutional capability compounds value over time as workflows evolve, data quality improves, and internal expertise deepens. Vendor-dependent deployment creates ongoing reliance and limits customization.


Key Takeaways

10 Principles for LLM Use in Medicine

  1. LLMs are assistants, not doctors: Always maintain human oversight and final decision-making

  2. Hallucinations are unavoidable: Verify all medical facts, never trust blindly

  3. HIPAA compliance is non-negotiable: Public ChatGPT is NOT appropriate for patient data

  4. Appropriate uses: Documentation drafts, literature review, education materials (with review)

  5. Inappropriate uses: Autonomous diagnosis/treatment, medication dosing without verification, urgent decisions

  6. Physician remains legally responsible: “AI told me to” is not a malpractice defense

  7. Evidence is evolving: USMLE performance ≠ clinical utility; demand prospective RCTs

  8. Ambient documentation shows clearest benefit: 50% time savings with high satisfaction

  9. Prompting quality matters: Specific, detailed prompts with sourcing requests yield better outputs

  10. The future is collaborative: Effective physician-LLM partnership, not replacement


Clinical Scenario: LLM Vendor Evaluation

Scenario: Your Hospital Is Considering Glass Health for Clinical Decision Support

The pitch: Glass Health provides LLM-powered differential diagnosis and treatment suggestions. Marketing claims: - “Physician-level diagnostic accuracy” - “Evidence-based treatment recommendations” - “Saves 20 minutes per complex case” - Cost: $200/month per physician

The CMO asks for your recommendation.

Questions to ask:

  1. “What peer-reviewed validation studies support Glass Health?”
    • Request JAMA, Annals, specialty journal publications
    • User testimonials ≠ clinical validation
  2. “Is this HIPAA-compliant? Where is the BAA?”
    • Essential for entering patient data
  3. “What is the hallucination rate?”
    • If vendor hasn’t quantified, they haven’t tested properly
  4. “How does Glass Health handle diagnostic uncertainty?”
    • Does it express appropriate uncertainty or confidently hallucinate?
  5. “What workflow oversight prevents acting on incorrect recommendations?”
    • Best systems require physician review before actions
  6. “Can we pilot with 10 physicians before hospital-wide deployment?”
    • Local validation essential
  7. “What happens if Glass Health recommendation causes harm?”
    • Read liability disclaimers in contract
  8. “What is actual time savings data?”
    • “20 minutes per complex case” claim: where’s the evidence?

Red Flags:

“Physician-level accuracy” without prospective validation

No discussion of hallucination rates or error modes

Marketing emphasizes speed over safety

No built-in verification mechanisms


Check Your Understanding

Scenario 1: The Medication Dosing Question

Clinical situation: You’re seeing a 4-year-old with otitis media requiring amoxicillin. You ask GPT-4 (via HIPAA-compliant API):

“What is the appropriate amoxicillin dosing for a 4-year-old child with acute otitis media?”

GPT-4 responds: “For acute otitis media in a 4-year-old, amoxicillin dosing is 40-50 mg/kg/day divided into two doses (every 12 hours). For a 15 kg child, this would be 300-375 mg twice daily.”

Question 1: Do you prescribe based on this recommendation?

Click to reveal answer

Answer: No, verify against authoritative source first.

Why:

The LLM response is partially correct but incomplete: - Standard-dose amoxicillin: 40-50 mg/kg/day divided BID (LLM correct) - But: AAP now recommends high-dose amoxicillin (80-90 mg/kg/day divided BID) for most cases of AOM due to increasing S. pneumoniae resistance - LLM likely trained on older guidelines pre-dating high-dose recommendations

Correct dosing (per 2023 AAP guidelines): - High-dose: 80-90 mg/kg/day divided BID (first-line for most cases) - For 15 kg child: 600-675 mg BID - Standard-dose: 40-50 mg/kg/day only for select cases (penicillin allergy evaluation, mild infection in low-resistance areas)

What you should do: 1. Check UpToDate, Lexicomp, or AAP guidelines directly 2. Confirm high-dose amoxicillin is indicated 3. Prescribe 600-650 mg BID (not the LLM-suggested 300-375 mg)

The lesson: LLMs may provide outdated recommendations or miss recent guideline updates. Always verify medication dosing against current pharmacy databases or guidelines.

If you had prescribed LLM dose: - Child receives 50% of intended amoxicillin - Higher risk of treatment failure - Potential for antibiotic resistance development


Scenario 2: The Patient Education Handout

Clinical situation: You’re discharging a patient newly diagnosed with type 2 diabetes. You use GPT-4 to generate patient education handout:

“Create a one-page patient handout for newly diagnosed type 2 diabetes, 8th-grade reading level. Cover: medications, blood sugar monitoring, diet, exercise.”

GPT-4 generates professional-looking handout with sections on metformin, glucometer use, carb counting, and walking recommendations.

Question 2: Can you give this handout to the patient as-is, or do you need to review/edit first?

Click to reveal answer

Answer: MUST review and edit before giving to patient.

Why:

Potential LLM errors to check:

  1. Medication information:
    • Is metformin dosing correct? (LLMs sometimes hallucinate dosages)
    • Are side effects accurately described?
    • Are contraindications mentioned? (metformin contraindicated in advanced CKD)
  2. Blood sugar targets:
    • ADA guidelines: Fasting <130 mg/dL, postprandial <180 mg/dL for most
    • LLM may use outdated targets or not individualize
  3. Dietary advice:
    • Is carb counting explained clearly?
    • Any culturally inappropriate diet suggestions?
    • Consistent with current ADA nutrition guidelines?
  4. Exercise recommendations:
    • 150 min/week moderate exercise is standard
    • Are contraindications mentioned? (avoid exercise if glucose >250 mg/dL + ketones)
  5. Individualization:
    • LLM generates generic handout
    • Does it fit THIS patient’s health literacy, culture, comorbidities?

What you should do: 1. Read entire handout for medical accuracy 2. Verify all medical facts (especially medication dosing, blood sugar targets) 3. Customize to patient (add their specific meds if not metformin, adjust for comorbidities) 4. Check cultural appropriateness 5. Review with patient, answer questions

The lesson: LLM-generated patient materials save time but require physician review for accuracy and individualization. Never hand out LLM content without verification.

Time savings: - Writing from scratch: 15-20 minutes - LLM draft + physician review/editing: 5-7 minutes - Net benefit: 10+ minutes saved while maintaining quality


Scenario 3: The Citation Fabrication

Clinical situation: You’re writing a quality improvement manuscript on reducing hospital readmissions. You ask GPT-4:

“Cite 3 recent studies (2020-2023) showing effectiveness of AI for predicting 30-day readmissions in heart failure patients.”

GPT-4 provides:

  1. “Chen et al. (2022). ‘Machine learning for heart failure readmission prediction.’ JAMA Cardiology 7:445-453.”
  2. “Rodriguez-Lopez et al. (2021). ‘Deep learning models outperform LACE index for readmission risk.’ Circulation 144:892-901.”
  3. “Patel et al. (2023). ‘Prospective validation of AI readmission tool in community hospitals.’ New England Journal of Medicine 388:1122-1131.”

Question 3: Can you include these citations in your manuscript?

Click to reveal answer

Answer: NO. You must verify each citation exists and actually supports your claim.

Why:

The LLM likely fabricated some or all of these citations. Here’s how to check:

Step 1: Search PubMed for each citation

For “Chen et al. (2022) JAMA Cardiology”: - Search: "Chen" AND "heart failure readmission" AND "machine learning" AND "JAMA Cardiology" AND 2022 - If found: Read abstract, confirm it supports your claim - If NOT found: Citation is fake

Step 2: Verify journal, volume, pages

Even if an author “Chen” published in JAMA Cardiology in 2022, check: - Is the title correct? - Is the volume/page number correct? - Does the paper actually discuss AI for HF readmissions?

Step 3: Read the actual papers

If citations exist: - Do they support the claim you’re making? - Are study methods sound? - Are conclusions being accurately represented?

Likely outcome: - 1-2 of these 3 citations are completely fabricated - Even if a paper exists, it may not say what you think (or LLM suggests)

What you should do instead:

  1. Search PubMed yourself: ("heart failure" OR "HF") AND ("readmission" OR "rehospitalization") AND ("machine learning" OR "artificial intelligence" OR "AI") AND ("prediction" OR "risk score")

  2. Filter: Publication date 2020-2023, Clinical Trial or Review

  3. Read abstracts, select relevant papers

  4. Cite actual papers you’ve read

The lesson: Never trust LLM-generated citations. LLMs fabricate references 15-30% of the time. Always verify papers exist and support your claims.

Consequences of using fabricated citations: - Manuscript rejection - If published then discovered: Retraction - Academic dishonesty allegations - Career damage

Time comparison: - LLM citations (unverified): 30 seconds - Manual PubMed search + reading abstracts: 15-20 minutes - Worth the extra time to avoid fabricated references