Large Language Models in Clinical Practice
clinical LLM, clinical LLMs, large language models in medicine, GPT-4 medicine, Claude healthcare, medical AI chatbot, LLM hallucinations, HIPAA compliant AI, LLM reproducibility liability
Large Language Models (LLMs) like ChatGPT, GPT-4, Claude, and Med-PaLM represent a fundamentally different paradigm from narrow diagnostic AI. Unlike algorithms trained for single tasks (detect melanoma, predict sepsis), LLMs are general-purpose language systems that can write notes, answer questions, review literature, draft patient education, and assist clinical reasoning. They’re extraordinarily powerful but also uniquely dangerous, capable of generating confident, plausible, but completely false medical information (“hallucinations”).
After reading this chapter, you will be able to:
- Understand how LLMs work and their fundamental capabilities and limitations in medical contexts
- Identify appropriate vs. inappropriate clinical use cases based on risk-benefit assessment
- Recognize and mitigate hallucinations, citation fabrication, and knowledge cutoff problems
- Navigate privacy (HIPAA), liability, and ethical considerations specific to LLM use in medicine
- Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications) vs. general-purpose models
- Implement LLMs safely in clinical workflows with proper oversight and verification protocols
- Communicate transparently with patients about LLM-assisted care
- Apply vendor evaluation frameworks before adopting LLM tools for clinical practice
Introduction: A Paradigm Shift in Medical AI
Every previous chapter in this handbook examines narrow AI: algorithms trained for single, specific tasks.
- Radiology AI detects pneumonia on chest X-rays (and nothing else)
- Pathology AI grades prostate cancer histology (and nothing else)
- Cardiology AI interprets ECGs for arrhythmias (and nothing else)
Large Language Models are fundamentally different: general-purpose systems that perform diverse tasks through natural language interaction.
Ask GPT-4 to summarize a medical guideline and it does. Ask it to draft a patient education handout and it does. Ask it to generate a differential diagnosis for chest pain and it does. No task-specific retraining required.
This versatility is unprecedented in medical AI. It’s also what makes LLMs uniquely dangerous.
A narrow diagnostic AI fails in predictable ways: - Pneumonia detection AI applied to chest X-ray might miss a pneumonia (false negative) or flag normal lungs as abnormal (false positive) - Failure modes are bounded by the task
LLMs fail in unbounded ways: - Fabricate drug dosages that look correct but cause overdoses - Invent medical “facts” that sound authoritative but are false - Generate fake citations to real journals (paper doesn’t exist) - Provide confident answers to questions where uncertainty is appropriate - Contradict themselves across responses - Recommend treatments that were standard of care in training data but have been superseded
The clinical analogy: LLMs are like exceptionally well-read medical students who have: - Perfect recall of everything they’ve studied - No clinical experience - No ability to examine patients or access patient-specific data - No accountability for errors - Tendency to confidently bullshit when they don’t know the answer
Part 1: How LLMs Work (What Physicians Need to Know)
The Technical Basics (Simplified)
Training: 1. Ingest massive text corpora (internet, books, journals, Wikipedia, Reddit, medical textbooks, PubMed abstracts) 2. Learn statistical patterns: “Given these words, what word typically comes next?” 3. Scale to billions of parameters (weights connecting neural network nodes) 4. Fine-tune with human feedback (reinforcement learning from human preferences)
Inference (when you use it): 1. You provide a prompt (“Generate a differential diagnosis for acute chest pain in a 45-year-old man”) 2. LLM predicts most likely next word based on learned patterns 3. Continues predicting words one-by-one until stopping criterion met 4. Returns generated text
Crucially: - LLMs don’t “look up” facts in a database - They don’t “reason” in the logical sense - They predict plausible text based on statistical patterns - Truth and plausibility are not the same thing
Why Hallucinations Happen
Definition: LLM generates confident, coherent, plausible but factually incorrect text.
Mechanism: The training objective is “predict next plausible word,” not “retrieve correct fact.” As WHO’s 2025 guidance notes, LLMs have no conception of what they produce, only statistical patterns from training data (WHO, 2025). When uncertain, LLMs default to generating text that sounds correct rather than admitting uncertainty or refusing to answer.
Medical examples documented in literature:
- Fabricated drug dosages:
- Prompt: “What is the pediatric dosing for amoxicillin?”
- GPT-3.5 response: “20-40 mg/kg/day divided every 8 hours” (incorrect for many indications; standard is 25-50 mg/kg/day, some indications 80-90 mg/kg/day)
- Invented medical facts:
- Prompt: “What are the contraindications to beta-blockers in heart failure?”
- LLM includes “NYHA Class II heart failure” (false; beta-blockers are indicated, not contraindicated, in Class II HF)
- Fake citations:
- Prompt: “Cite studies showing benefit of IV acetaminophen for postoperative pain”
- GPT-4 generates: “Smith et al. (2019) in JAMA Surgery found 40% reduction in opioid use” (paper doesn’t exist; authors, journal, year all fabricated but plausible)
- Outdated recommendations:
- All LLMs have training data cutoffs (check the specific model’s documentation)
- May recommend drugs withdrawn from market after training
- Unaware of updated guidelines published post-training
Why this matters clinically: A physician who trusts LLM output without verification risks: - Incorrect medication dosing → patient harm - Reliance on outdated treatment → suboptimal care - Academic dishonesty from fabricated citations → career consequences
Mitigation strategies: - Always verify drug information against pharmacy databases (Lexicomp, Micromedex, UpToDate) - Cross-check medical facts with authoritative sources (guidelines, textbooks, PubMed) - Never trust LLM citations without looking up the actual papers - Use LLMs for drafts and idea generation, never final medical decisions - Higher stakes = more verification required
Retrieval-Augmented Generation (RAG) for Healthcare
A first comprehensive review of RAG for healthcare applications (Ng et al., NEJM AI, 2025) examines how retrieval-augmented generation addresses three core LLM limitations:
The Three Problems RAG Addresses:
| Problem | How RAG Helps |
|---|---|
| Outdated information | Retrieves from current knowledge bases, bypassing training cutoff |
| Hallucinations | Grounds responses in retrieved documents, enabling source verification |
| Reliance on public data | Can query institutional guidelines, formularies, and proprietary sources |
How RAG Works:
- Retrieval: When a query arrives, the system searches a curated knowledge base (guidelines, textbooks, institutional protocols)
- Augmentation: Retrieved passages are provided to the LLM as context
- Generation: LLM generates response grounded in retrieved documents, with source citations
Clinical Applications:
- Guideline-grounded responses: RAG can pull from current treatment guidelines, reducing outdated recommendations
- Institutional integration: Hospital-specific formularies and protocols as knowledge sources
- Citation verification: Responses include source documents that users can verify
- Pharmaceutical industry: Drug information queries grounded in package inserts and regulatory documents
Limitations:
- RAG reduces but does not eliminate hallucinations. LLMs can still misinterpret or misstate retrieved content
- Retrieval quality depends on knowledge base curation and query matching
- Adds latency and infrastructure complexity compared to base LLMs
- Retrieved context may be outdated if knowledge base isn’t maintained
Clinical Implication: When evaluating clinical LLM tools, ask whether they use RAG or similar grounding approaches. Systems that cite sources enable verification; those that don’t require more skepticism.
State of Clinical AI 2026: LLM Diagnostic Performance
The inaugural State of Clinical AI Report from the Stanford-Harvard ARISE network provides a nuanced assessment of LLM diagnostic capabilities.
Impressive benchmark results:
Several studies published in 2025 showed LLMs matching or outperforming physicians on diagnostic reasoning and treatment planning when evaluated on fixed clinical cases (Brodeur et al., 2025, preprint). In some papers, this performance was described as “superhuman.”
The reality check:
Performance depends heavily on how narrowly the problem is framed:
- When models had to ask follow-up questions, manage incomplete information, or revise decisions as new details emerged, performance dropped (Johri et al., Nature Medicine, 2025)
- On tests measuring reasoning under uncertainty, AI systems performed closer to medical students than experienced physicians (McCoy et al., NEJM AI, 2025)
- Models tended to commit strongly to answers even when clinical ambiguity was high
- Accuracy dropped 26-38% when familiar answer patterns were disrupted (Bedi et al., JAMA Network Open, 2025)
Why this matters for clinical practice:
In everyday medicine, uncertainty is common. The gap between performance on fixed exam questions and performance in ambiguous real-world scenarios is substantial. The report concludes that much of what looks impressive in headline-grabbing studies may not hold up in clinical practice.
Clinical implication: Use LLMs for brainstorming and drafts where you can verify output, not for situations requiring judgment under uncertainty.
When structured data outperforms LLMs: For clinical prediction tasks on structured EHR data, LLMs may underperform traditional approaches. In antimicrobial resistance prediction for sepsis, LLMs analyzing clinical notes achieved AUROC 0.74 compared to 0.85 for deep learning on structured EHR data, with combined approaches offering no improvement (Hixon et al., 2025, conference abstract). LLMs excel at language tasks, not structured clinical prediction.
Part 2: The Major Failure Case Study, Hallucination Disasters
Case 1: The Fabricated Oncology Protocol
Scenario (reported 2023): Physician asked GPT-4 for dosing protocol for pediatric acute lymphoblastic leukemia (ALL) consolidation therapy.
LLM response: Generated detailed protocol with drug names, dosages, timing that looked professionally formatted and authoritative.
The problem: - Methotrexate dose: 50 mg/m² (LLM suggested) vs. actual protocol: 5 g/m² (100x difference) - Vincristine timing: Weekly (LLM) vs. protocol: Every 3 weeks during consolidation - Dexamethasone duration: 5 days (LLM) vs. protocol: 28 days
If followed without verification: Patient would have received 1% of intended methotrexate dose (treatment failure, disease progression) and excessive vincristine (neurotoxicity risk).
Why it happened: LLM trained on general medical text, not specialized oncology protocols. Generated plausible-sounding but incorrect regimen by combining fragments from different contexts.
The lesson: Never use LLMs for medication dosing without rigorous verification against authoritative sources (protocol handbooks, institutional guidelines, pharmacy consultation).
Case 2: The Confident Misdiagnosis
Scenario (published case study): Emergency physician used GPT-4 to generate differential diagnosis for “32-year-old woman with sudden-onset severe headache, photophobia, neck stiffness.”
LLM differential: 1. Migraine (most likely) 2. Tension headache 3. Sinusitis 4. Meningitis (mentioned fourth) 5. Subarachnoid hemorrhage (mentioned fifth)
The actual diagnosis: Subarachnoid hemorrhage (SAH) from ruptured aneurysm.
The problem: LLM ranked benign diagnoses (migraine, tension headache) above life-threatening emergencies (SAH, meningitis) despite classic “thunderclap headache + meningeal signs” presentation.
Why it happened: - Training data bias: Migraine is far more common than SAH in text corpora - LLMs predict based on frequency in training data, not clinical risk stratification - No understanding of “rule out worst-case-first” emergency medicine principle
The lesson: LLMs don’t triage by clinical urgency or risk. Physician must apply clinical judgment to LLM suggestions.
What the physician did right: Used LLM as brainstorming tool, not autonomous diagnosis. Recognized high-risk presentation and ordered CT + LP appropriately.
Case 3: The Citation Fabrication Scandal
Scenario: Medical student submitted literature review using GPT-4 to generate citations supporting statements about hypertension management.
LLM-generated citations (examples): 1. “Johnson et al. (2020). ‘Intensive blood pressure control in elderly patients.’ New England Journal of Medicine 383:1825-1835.” 2. “Patel et al. (2019). ‘Renal outcomes with SGLT2 inhibitors in diabetic hypertension.’ Lancet 394:1119-1128.”
The problem: Neither paper exists. Authors, journals, years, page numbers all plausible but fabricated.
Discovery: Faculty advisor attempted to retrieve papers for detailed review. None found in PubMed, journal archives, or citation databases.
Consequences: - Student received failing grade for academic dishonesty - Faculty implemented “verify all LLM-generated citations” policy - Medical school updated honor code to address AI-assisted writing
Why this matters: - Citation fabrication in grant applications = federal research misconduct - In publications = retraction, career damage - In clinical guidelines = propagation of misinformation
The lesson: Never trust LLM-generated citations. Always verify papers exist and actually support the claims attributed to them.
Case 4: The Medical Calculation Gap
The problem: Physicians routinely use clinical calculators (CHADS-VASc for stroke risk, Cockcroft-Gault for GFR, HEART score for chest pain triage, LDL calculations). These quantitative tools drive treatment decisions daily.
What the research shows: MedCalc-Bench, a benchmark from NIH researchers evaluating LLMs on 55 different medical calculator tasks across 1,000+ patient scenarios, found that the best-performing model (GPT-4 with one-shot prompting) achieved only 50.9% accuracy (Khandekar et al., NeurIPS 2024).
Three distinct failure modes:
- Knowledge errors (Type A): LLM doesn’t know the correct equation or rule
- Example: Asked to calculate CHADS-VASc score, assigns wrong points to criteria
- Most common error in zero-shot prompting (over 50% of mistakes)
- Extraction errors (Type B): LLM extracts wrong parameters from patient note
- Example: Misidentifies patient age, medication history, or lab values from clinical narrative
- 16-31% of errors depending on model
- Computation errors (Type C): LLM performs arithmetic incorrectly
- Example: Calculates LDL as 142 mg/dL when correct answer is 128 mg/dL
- 13-17% of errors even when equation and parameters are correct
Why this matters clinically:
Medical calculations drive treatment decisions: - CHADS-VASc ≥2 → anticoagulation for atrial fibrillation - eGFR <30 → medication dose adjustments - HEART score ≥4 → admission vs. discharge decision
50.9% accuracy means LLMs are flipping a coin on tasks with direct treatment implications.
The performance gap:
Anthropic reported Claude Opus 4.5 achieves 98.1% accuracy on MedCalc-Bench. This represents substantial improvement IF independently validated. Key caveats: - Anthropic’s metric is company-reported, not peer-reviewed - Original research (GPT-4): 50.9% accuracy - 98.1% claim requires independent replication
The lesson: Never trust LLM-generated medical calculations without verification. Check all risk scores, GFR calculations, and dosing adjustments against established calculators (MDCalc, online tools, pharmacy databases).
Clinical workflow: 1. LLM suggests calculation (e.g., “Patient’s CHADS-VASc score is 4”) 2. Verify independently: Use MDCalc or manual calculation 3. If mismatch: Trust the verified calculation, not the LLM 4. Document verification in clinical note
Part 3: The Success Story, Ambient Clinical Documentation
Nuance DAX: Ambient Documentation AI
The problem DAX solves: Physicians spend 2+ hours per day on documentation, often completing notes after-hours. EHR documentation contributes significantly to burnout.
How DAX works: 1. Physician wears microphone during patient encounter 2. DAX records conversation (with patient consent) 3. LLM transcribes speech → converts to structured clinical note 4. Note appears in EHR for physician review/editing 5. Physician reviews, makes corrections, signs note
Evidence base:
Regulatory status: Not FDA-regulated. Falls under CDS (Clinical Decision Support) exemption per 21st Century Cures Act because it generates documentation drafts, not diagnoses or treatment recommendations, and physicians independently review all output.
Clinical validation: Nuance-sponsored study (2023), 150 physicians, 5,000+ patient encounters: - Documentation time reduction: 50% (mean 5.5 min → 2.7 min per encounter) - Physician satisfaction: 77% would recommend to colleagues - Note quality: No significant difference from physician-written notes (blinded expert review) - Error rate: 0.3% factual errors requiring correction (similar to baseline physician error rate in dictation)
Real-world deployment: - 550+ health systems - 35,000+ clinicians using DAX - 85% user retention after 12 months
Cost-benefit: - DAX subscription: ~$369-600/month per physician (varies by contract; $700 one-time implementation fee) - Time savings: 1 hour/day × $200/hour physician cost = $4,000/month saved - ROI: Positive in 1-3 months depending on encounter volume
Pricing source: DAX Copilot pricing page, January 2026. Costs vary by volume and contract terms.
Why this works: - Well-defined task (transcription + note structuring) - Physician review catches errors before note finalization - Integration with EHR workflow - Patient consent obtained upfront - HIPAA-compliant (BAA with healthcare systems)
Limitations: - Requires patient consent (some decline) - Poor audio quality → transcription errors - Complex cases with multiple topics may require substantial editing - Subscription cost barrier for small practices
Abridge: AI-Powered Medical Conversations
Similar ambient documentation tool with comparable performance: - 65% documentation time reduction in pilot studies - Focuses on primary care and specialty clinics - Generates patient-facing visit summaries automatically
The lesson: When LLMs are used for well-defined tasks with physician oversight and proper integration, they deliver genuine value.
Emerging: LLM-Based Clinical Copilots
Beyond documentation, early evidence suggests LLMs may function as real-time clinical safety nets. In a preprint study of 39,849 patient visits at Penda Health clinics in Kenya, clinicians using an LLM copilot (GPT-4o integrated into EHR workflow) showed 16% fewer diagnostic errors and 13% fewer treatment errors compared to controls (Korom et al., 2025, preprint). The copilot flagged potential errors for clinician review using a tiered alert system (green/yellow/red severity), maintaining physician control while providing a second-opinion safety net.
Key caveats: This is a preprint from an OpenAI partnership (company-reported data), patient outcomes showed no statistically significant difference, and a randomized controlled trial is underway. See Primary Care AI for detailed implementation analysis.
Part 4: Appropriate vs. Inappropriate Clinical Use Cases
SAFE Uses (With Physician Oversight)
1. Clinical Documentation Assistance
Use cases: - Draft progress notes from dictation - Generate discharge summaries - Suggest ICD-10/CPT codes - Create procedure notes
Workflow: 1. Physician provides input (dictation, conversation recording, bullet points) 2. LLM generates structured note 3. Physician reviews every detail, edits errors, adds clinical judgment 4. Physician signs final note
The dangerous reality: As AI accuracy improves, human vigilance drops. Studies show physicians begin “rubber-stamping” AI-generated content after approximately 3 months of successful use (Goddard et al., 2012).
The pattern: - Month 1: Physician carefully reviews every word, catches errors - Month 3: Physician skims notes, catches obvious errors - Month 6: Physician clicks “Sign” with minimal review, trusts the AI - Month 12: Errors slip through; patient harm possible
Counter-measures to maintain vigilance:
- Spot-check protocol: Verify at least one specific data point per note (e.g., check one lab value, one medication dose, one vital sign against the record)
- Rotation strategy: Vary which section you scrutinize each encounter
- Red flag awareness: Know the AI’s failure modes (medication names, dosing, dates, rare conditions)
- Scheduled deep review: Once weekly, do a line-by-line audit of a randomly selected AI-generated note
- Error tracking: Log every error you catch; if catches drop to zero, you may have stopped looking
The uncomfortable truth: “Physician in the loop” only works if the physician is actually paying attention. The AI doesn’t get tired; you do.
Risk mitigation: - Physician remains legally responsible for note content - Review catches hallucinations, errors, omissions - HIPAA-compliant systems only
Evidence: 50% time savings documented in multiple studies (see DAX above)
2. Literature Synthesis and Summarization
Use cases: - Summarize clinical guidelines - Compare treatment options from multiple sources - Generate literature review outlines - Identify relevant studies for research questions
Workflow: 1. Provide LLM with specific question and context 2. Request summary with citations 3. Verify all citations exist and support claims 4. Cross-check medical facts against primary sources
Example prompt:
"Summarize the 2023 AHA/ACC guidelines for management
of atrial fibrillation, focusing on anticoagulation
recommendations for patients with CHADS-VASc ≥2.
Include specific drug dosing and monitoring requirements.
Cite specific guideline sections."
Risk mitigation: - Verify citations before relying on summary - Cross-check facts with original guidelines - Use as starting point, not final analysis
3. Patient Education Materials
Use cases: - Explain diagnoses in health literacy-appropriate language - Create discharge instructions - Draft procedure consent explanations - Translate medical jargon to plain language
Workflow: 1. Specify reading level, key concepts, patient concerns 2. LLM generates draft 3. Physician reviews for medical accuracy 4. Edits for cultural sensitivity, individual patient factors 5. Shares with patient
Example prompt:
"Create a patient handout about type 2 diabetes management
for a patient with 6th grade reading level. Cover: medication
adherence, blood sugar monitoring, dietary changes, exercise.
Use simple language, avoid jargon, 1-page limit."
Risk mitigation: - Fact-check all medical information - Customize to individual patient (LLM generates generic content) - Consider health literacy, cultural factors
4. Differential Diagnosis Brainstorming
Use cases: - Generate possibilities for complex cases - Identify rare diagnoses to consider - Broaden differential when stuck
Workflow: 1. Provide detailed clinical vignette 2. Request differential with reasoning 3. Treat as idea generation, not diagnosis 4. Pursue appropriate diagnostic workup based on clinical judgment
Example prompt:
"Generate differential diagnosis for 45-year-old woman
with 3 months of progressive dyspnea, dry cough, and
fatigue. Exam: fine bibasilar crackles, no wheezing.
CXR: reticular infiltrates. Consider both common and
rare etiologies. Provide likelihood and key diagnostic
tests for each."
Risk mitigation: - LLM differential is brainstorming, not diagnosis - Verify each possibility clinically plausible for patient - Pursue workup based on pretest probability, not LLM ranking
5. Medical Coding Assistance
Use cases: - Suggest ICD-10/CPT codes from clinical notes - Identify documentation gaps for proper coding - Check code appropriateness
Workflow: 1. LLM analyzes clinical note 2. Suggests codes with reasoning 3. Coding specialist or physician reviews 4. Confirms codes match care delivered and documentation
Risk mitigation: - Compliance review essential (fraudulent coding = federal offense) - Physician confirms codes represent actual care - Regular audits of LLM-suggested codes
DANGEROUS Uses (Do NOT Do)
1. Autonomous Patient Advice
Why dangerous: - Patients ask LLMs medical questions without physician involvement - LLMs provide confident answers regardless of accuracy - Patients may delay appropriate care based on false reassurance
Documented harms: - Patient with chest pain asked ChatGPT “Is this heartburn or heart attack?” - ChatGPT suggested antacids (without seeing patient, knowing history, performing exam) - Patient delayed ER visit 6 hours, presented with STEMI
The lesson: Patients will use LLMs for medical advice regardless of physician recommendations. Educate patients about limitations, encourage them to contact you rather than rely on AI.
Both OpenAI and Anthropic launched dedicated healthcare products in January 2026:
- ChatGPT Health (OpenAI, January 2026): Consumer health AI with medical record and wellness app integration. See AI Tools Every Physician Should Know: Consumer Health AI.
- Claude for Healthcare (Anthropic, January 11, 2026): Enterprise-focused with BAA coverage via AWS/Google/Azure, plus consumer features for Pro/Max subscribers. See HIPAA-compliant alternatives below.
2. Medication Dosing Without Verification
Why dangerous: - LLMs fabricate plausible but incorrect dosages - Pediatric dosing especially error-prone - Drug interaction checking unreliable
Documented near-miss: - Physician asked GPT-4 for vancomycin dosing in renal failure - LLM suggested dose appropriate for normal renal function - Pharmacist caught error before administration
The lesson: Never use LLM-generated medication dosing without verification against pharmacy databases, dose calculators, or pharmacist consultation.
Medical calculations beyond dosing:
The quantitative reasoning gap extends beyond medication dosing to all medical calculators:
- Risk scores: CHADS-VASc, HEART score, Caprini VTE risk
- GFR calculations: Cockcroft-Gault, MDRD equations
- Lab-derived values: LDL calculation, anion gap
- Clinical indices: Pneumonia severity index, Wells’ criteria
NIH research found LLMs achieve only 50.9% accuracy on medical calculation tasks, with three failure patterns: wrong equations, parameter extraction errors, and arithmetic mistakes (Khandekar et al., NeurIPS 2024).
The lesson: Verify all LLM calculations against established medical calculators (MDCalc, institutional tools, pharmacy databases). See Case 4: The Medical Calculation Gap for detailed failure modes.
3. Urgent or Emergent Clinical Decisions
Why dangerous: - Time pressure precludes adequate verification - High stakes magnify consequence of errors - Clinical judgment + experience > LLM statistical patterns
The lesson: In emergencies, rely on clinical protocols, expert consultation, established guidelines, not LLM brainstorming.
4. Generating Citations Without Verification
Why dangerous: - LLMs fabricate 15-30% of medical citations - Using fake references = academic dishonesty, research misconduct - Propagates misinformation if not caught
The lesson: Never include LLM-generated citations in manuscripts, grants, presentations without verifying papers exist and support the claims.
Part 5: Prompting Techniques and Evidence-Based Approaches
Well-crafted prompts significantly improve LLM output quality. A scoping review of 114 prompt engineering studies found that structured prompting techniques can improve task performance substantially compared to naive prompts (Zaghir et al., 2024).
Core Prompting Paradigms
Zero-Shot Prompting
The simplest approach: ask a question without examples.
"What are the first-line treatments for community-acquired pneumonia
in an otherwise healthy adult?"
When to use: Simple factual questions, initial exploration, low-stakes queries
Limitations: Less reliable for complex reasoning, nuanced clinical scenarios, or specialized domains
Few-Shot Prompting
Provide examples of desired input-output pairs before your actual question.
Example 1:
Patient: 65-year-old male, chest pain radiating to left arm, diaphoresis
Assessment: High concern for ACS, recommend immediate ECG and troponins
Example 2:
Patient: 28-year-old female, sharp chest pain worse with inspiration
Assessment: Consider pleurisy, PE, or musculoskeletal cause
Now assess:
Patient: 72-year-old female with diabetes, fatigue and jaw pain for 2 days
When to use: When you need consistent output format, domain-specific reasoning patterns, or specialized terminology
Evidence: LLMs enhanced with clinical practice guidelines via few-shot prompting showed improved performance across GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2 compared to zero-shot baselines (Oniani et al., 2024)
Chain-of-Thought (CoT) Prompting
Request step-by-step reasoning rather than direct answers.
"A 58-year-old man presents with progressive dyspnea and bilateral
leg edema. EF is 35%. Think through this step-by-step:
1) What are the key clinical findings?
2) What is the most likely primary diagnosis?
3) What additional workup is needed?
4) What are the initial management priorities?"
When to use: Complex diagnostic reasoning, treatment planning, cases with multiple interacting factors
Evidence: Chain-of-thought prompting allows GPT-4 to mimic clinical reasoning processes while maintaining diagnostic accuracy, improving interpretability (Savage et al., 2024)
CoT is not universally beneficial. Tasks requiring implicit pattern recognition, exception handling, or subtle statistical learning may show reduced performance with CoT prompting. An NEJM AI study found that reasoning-optimized models showed overconfidence and premature commitment to incorrect hypotheses in clinical scenarios requiring flexibility under uncertainty (NEJM AI, 2025).
Practical implication: Use CoT for systematic diagnostic workups; avoid it for gestalt pattern recognition or rapid triage decisions where experienced clinicians rely on intuition.
Structured Clinical Reasoning Prompts
Organize clinical information into predefined categories before requesting analysis.
PATIENT INFORMATION:
- Age/Sex: 45-year-old female
- Chief Complaint: Progressive fatigue x 3 months
HISTORY:
- Duration: 3 months, gradual onset
- Associated: Weight gain, cold intolerance, constipation
- PMH: Type 2 diabetes, hypertension
PHYSICAL EXAM:
- VS: BP 142/88, HR 58, afebrile
- General: Appears fatigued, dry skin, periorbital edema
LABS:
- TSH: 12.4 mIU/L (0.4-4.0)
- Free T4: 0.6 ng/dL (0.8-1.8)
Based on this structured information, provide:
1. Primary diagnosis with reasoning
2. Differential diagnoses to consider
3. Recommended next steps
Evidence: Structured templates that organize clinical information before diagnosis improve LLM diagnostic capabilities compared to unstructured narratives (Sonoda et al., 2024)
Practical Prompting Framework (R-C-T-C-F)
For clinical prompts, include these components:
| Component | Description | Example |
|---|---|---|
| Role | Define the LLM’s expertise level | “You are an internal medicine attending…” |
| Context | Provide relevant background | “…reviewing a case for morning report…” |
| Task | Specify exactly what you need | “…generate a differential diagnosis…” |
| Constraints | Set boundaries and requirements | “…focusing on reversible causes, avoiding rare conditions…” |
| Format | Specify output structure | “…as a numbered list with likelihood estimates.” |
Poor prompt:
"What's wrong with this patient?"
Effective prompt:
"You are an internal medicine attending reviewing a case for
teaching purposes. A 55-year-old woman presents with fatigue,
unintentional weight loss of 15 lbs over 3 months, and new-onset
diabetes. Generate a differential diagnosis focusing on malignancy
and endocrine causes. Format as a numbered list with brief
reasoning for each, ordered by likelihood."
What the Evidence Shows
| Technique | Best Use Case | Evidence Quality | Key Citation |
|---|---|---|---|
| Zero-shot | Simple queries, exploration | Moderate | Baseline in most studies |
| Few-shot | Consistent formatting, specialized domains | Strong | Oniani et al., 2024 |
| Chain-of-thought | Complex reasoning, teaching | Strong (with caveats) | Savage et al., 2024 |
| Structured templates | Diagnostic workups | Moderate | Sonoda et al., 2024 |
Common Prompting Mistakes
- Vague requests: “Analyze this” vs. “Calculate the CHADS-VASc score and recommend anticoagulation”
- Missing context: Asking about drug dosing without patient weight, renal function, or indication
- Overloading: Combining multiple complex tasks in one prompt (ask sequentially instead)
- Assuming knowledge: LLMs may not know your institution’s specific protocols or formulary
- Skipping verification: Even excellent prompts produce outputs requiring clinical validation
Further Reading
- Meskó, 2023: Tutorial on prompt engineering for medical professionals (JMIR)
- Zaghir et al., 2024: Scoping review of 114 prompt engineering studies (JMIR)
Part 6: Privacy, HIPAA, and Legal Considerations
The HIPAA Problem
CRITICAL: Public ChatGPT is NOT HIPAA-compliant
Why: - OpenAI stores conversations - May use data for model training (unless opt-out configured) - No Business Associate Agreement (BAA) for free/Plus tiers - Data transmitted through OpenAI servers
Consequences of HIPAA violation: - Civil penalties: $137–$68,928 per violation, annual caps up to $2 million per category (2025 inflation-adjusted tiers) - Criminal penalties: Up to $250,000 and 10 years imprisonment (for willful violations) - Institutional sanctions, career consequences
What NOT to enter into public ChatGPT: - Patient names, MRNs, DOB, addresses - Detailed clinical vignettes with rare diagnoses (re-identification possible) - Protected health information of any kind
HIPAA-compliant alternatives:
- OpenAI for Healthcare (ChatGPT for Healthcare)
- Enterprise product with BAA available (January 2026)
- GPT-5.2 models, evidence retrieval with citations
- Data not used for training, customer-managed encryption
- See Enterprise AI Platforms for details
- ChatGPT Health
- Dedicated health conversation space launched January 2026
- Medical record integration (FHIR), wellness app connections (Apple Health, Function Health, Peloton)
- OpenAI claims HIPAA compliance for dedicated health environment
- Current evidence status: No peer-reviewed studies on clinical effectiveness; company-reported metrics only
- See AI Tools Every Physician Should Know for detailed coverage
- Claude for Healthcare
- Launched January 11, 2026 at JPM Healthcare Conference
- BAA available via AWS Bedrock, Google Cloud, Microsoft Azure
- Zero-training policy: health data siloed and never used for model training
- Enterprise features: native ICD-10, NPI Registry, PubMed integrations; FHIR development; prior authorization workflows
- Consumer features (US Pro/Max): Apple Health, HealthEx medical record sync; test result explanations; appointment prep
- Named partners: Banner Health, Novo Nordisk, Sanofi, AstraZeneca, Flatiron Health, Veeva
- Current evidence status: No peer-reviewed clinical validation; company-reported partner list
- Azure OpenAI Service
- GPT-4/GPT-5 via Microsoft Azure
- BAA available for healthcare customers
- Data not used for training
- Cost: API fees (usage-based)
- Google Cloud Vertex AI
- Med-PaLM 2, PaLM 2
- BAA for healthcare
- Enterprise controls
- Cost: Enterprise licensing
- Epic Integrated LLMs
- Built into EHR workflow
- HIPAA-compliant by design
- Deployment accelerating 2024-2026
- Vendor-specific medical LLMs
- Nuance DAX, Abridge, Glass Health
- BAA with healthcare systems
- Subscription models
Safe practices: - Use only HIPAA-compliant systems for patient data - De-identify cases before entering into public LLMs (but de-identification is imperfect) - Institutional approval before LLM deployment - Document patient consent where appropriate
Medical Liability Landscape
Current legal framework (evolving):
Physician remains responsible: - LLM is a tool, not a practitioner - Physician liable for all clinical decisions - “AI told me to” is not a malpractice defense
Standard of care questions: 1. Is physician negligent for NOT using available LLM tools? - Currently: No clear standard - Future: May become expected for documentation efficiency
- Is physician negligent for USING LLM incorrectly?
- Yes: Using public ChatGPT for patient data = HIPAA violation
- Yes: Following LLM recommendation without verification that causes harm
- Yes: Delegating clinical judgment to LLM
The reproducibility problem:
LLMs produce different outputs for identical prompts, creating unique liability challenges. Unlike traditional medical software that produces deterministic results (same input always yields same output), LLMs use probabilistic sampling, meaning the same clinical question asked twice may generate different recommendations.
Documentation implications (Maddox et al., 2025):
- If LLM-generated clinical note varies between runs, which version becomes the legal record?
- Peer review of LLM-assisted decisions becomes difficult when outputs aren’t reproducible
- Quality assurance audits cannot validate LLM recommendations after the fact if the system produces different outputs when tested
Defensive documentation strategies:
- Note the specific LLM version and timestamp (e.g., “GPT-4 Turbo via Azure, January 15, 2025, 14:32”)
- Document key LLM outputs verbatim in clinical notes when material to decisions
- Explicitly note verification steps taken (“Differential diagnosis generated by AI, reviewed against UpToDate guidelines, decision based on clinical judgment”)
- Save LLM conversation logs where institutional policy and technical capacity allow
Malpractice insurance: - Check policy coverage for AI-assisted care - Some policies may exclude AI-related claims - Ask explicitly: “Does this policy cover LLM use in clinical practice?” (Missouri Medicine, 2025) - Notify insurer of LLM tool use before deployment, not after adverse events
Testing LLM consistency before deployment:
Before adopting any LLM tool for clinical use, test reproducibility:
- Select 10-20 representative clinical prompts (differential diagnosis questions, treatment recommendations, documentation tasks)
- Run each prompt 5 times with identical inputs
- Assess variance: Do outputs differ substantively or only stylistically?
- Document acceptable thresholds: Stylistic variation (word choice) acceptable; factual variation (different drug dosages) unacceptable
- Red flags: If the same prompt yields contradictory recommendations (e.g., “start beta-blocker” vs. “beta-blockers contraindicated”), do not deploy without vendor explanation
For medication dosing, diagnostic recommendations, or high-stakes decisions, reproducibility testing is essential before clinical deployment.
Risk mitigation: - Use only validated, HIPAA-compliant systems - Always verify LLM outputs - Maintain human oversight for all decisions - Document verification - Obtain consent where appropriate - Monitor for errors continuously - Test reproducibility before deployment (see Physician AI Liability and Regulatory Compliance for detailed liability framework)
System Prompts as Clinical Policy
Beyond user prompts (how you ask questions), system prompts define an LLM’s underlying persona and behavioral parameters. These are typically set by vendors or IT departments, not end users, but they significantly affect clinical outputs.
Why this matters for physicians:
A Mount Sinai study tested 20 LLMs across 5 million clinical decisions (ED vignettes and discharge summaries) and found that assigning different “physician personas” (ethical orientation crossed with cognitive style) shifted affirmative action rates from 36.9% to 46.4% under identical clinical evidence. This 9.5 percentage-point swing represents a substantial change in treatment recommendations, autonomy decisions, and resource utilization, without any change in the underlying clinical facts (Klang et al., 2026, preprint).
Practical implications:
System prompts function as policy settings. The same LLM can behave conservatively or liberally depending on how its persona is configured at deployment.
Organizations deploying clinical LLMs should:
- Document the system prompt configuration
- Version control system prompts like any clinical policy
- Audit how persona settings affect outputs in their specific use cases
- Involve clinical leadership in persona selection decisions
Individual physicians should ask:
- “What system prompt or persona is configured for this tool?”
- “Has the organization tested how different configurations affect clinical recommendations?”
- “Who approved the current configuration, and when was it last reviewed?”
The governance gap: Most current LLM deployment guidance focuses on data privacy and verification workflows. System prompt configuration, which can shift clinical action thresholds by nearly 10 percentage points, receives less attention but may require equivalent governance oversight.
Part 7: Vendor Evaluation Framework
Before Adopting an LLM Tool for Clinical Practice
Questions to ask vendors:
- “Is this system HIPAA-compliant? Can you provide a Business Associate Agreement?”
- Essential for any system touching patient data
- No BAA = no patient data entry
- “What is the LLM training data cutoff date?”
- Cutoff dates vary by model and version (check vendor documentation)
- Older cutoff = more outdated medical knowledge
- Models with web search can access current information but still require verification
- “What peer-reviewed validation studies support clinical use?”
- Demand JAMA, NEJM, Nature Medicine publications
- User satisfaction ≠ clinical validation
- Ask for prospective studies, not just retrospective benchmarks
- “What is the hallucination rate for medical content?”
- If vendor can’t quantify, they haven’t tested rigorously
- Accept that hallucinations are unavoidable; question is frequency
- Rates vary dramatically by task: Clinical note summarization shows ~1.5% hallucination rates (Asgari et al., 2025); reference generation in systematic reviews reaches 28-39% (Chelli et al., 2024). RAG-augmented systems can reduce rates to 0-6%
- Stanford Medicine’s ChatEHR reported 0.73 hallucinations + 1.60 inaccuracies per summarization across 23,000 sessions (Shah et al., 2026, manuscript)
- “How does the system handle uncertainty?”
- Good LLMs express appropriate uncertainty (“I’m not certain, but…”)
- Bad LLMs confidently hallucinate when uncertain
- “What verification/oversight mechanisms are built into the workflow?”
- Best systems require physician review before acting on LLM output
- Dangerous systems allow autonomous LLM actions
- “How does this integrate with our EHR?”
- Practical integration essential for adoption
- Clunky workarounds fail
- “What is the cost structure and ROI evidence?”
- Subscription per physician? API usage fees?
- Request time-savings data, physician satisfaction metrics
- “What testing validates consistency of outputs across multiple runs?”
- Ask for reproducibility data: same input, how often does output differ?
- Critical for clinical decisions where consistency matters (dosing, treatment recommendations)
- If vendor hasn’t tested, they haven’t validated for clinical use
- “Does your malpractice insurance explicitly cover LLM use?”
- Many policies exclude AI-related claims or require explicit rider
- Ask insurer directly, don’t rely on vendor assurances
- Request coverage confirmation in writing before deployment
- “Who is liable if LLM output causes patient harm?”
- Most vendors disclaim liability in contracts
- Physician/institution bears risk
- “What data is retained, and can patients opt out?”
- Data retention policies
- Patient consent/opt-out mechanisms
Red Flags (Walk Away If You See These)
- No HIPAA compliance for clinical use (public ChatGPT marketed for medical decisions)
- Claims of “replacing physician judgment” (LLMs assist, don’t replace)
- No prospective clinical validation (only bench mark exam scores)
- Autonomous actions without physician review (medication ordering, diagnosis without oversight)
- Vendor refuses to discuss hallucination rates (hasn’t tested or hiding poor performance)
Part 8: Cost-Benefit Reality
What Does LLM Technology Cost?
Ambient documentation (Nuance DAX, Abridge): - Cost: ~$369-600/month per physician (varies by contract and volume) - Benefit: 1 hour/day time savings × $200/hour = $4,000/month - ROI: Positive in 1-3 months - Non-monetary benefit: Reduced burnout, improved work-life balance
GPT-4 API (HIPAA-compliant): - Cost: ~$0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens - Typical clinical note: 500 tokens input, 1,000 output = $0.075 per note - If 20 notes/day: $1.50/day = $30/month (cheaper than subscription) - But: Requires technical integration, institutional IT support
Glass Health (LLM clinical decision support): - Cost: Free tier available, paid tiers ~$100-300/month - Benefit: Differential diagnosis brainstorming, treatment suggestions - ROI: Unclear; depends on how often you use for complex cases
Epic LLM integration (message drafting, note summarization): - Cost: Bundled into EHR licensing for institutions - Benefit: Incremental time savings across multiple workflows
Do These Tools Save Money?
Ambient documentation: YES - 50% time savings is substantial - Reduced after-hours charting improves physician well-being - Cost-effective based on time saved - Caveat: Requires subscription commitment; per-physician cost limits small practice adoption
API-based documentation assistance: MAYBE - Much cheaper than subscriptions (~$30/month vs. $400-600/month) - But requires IT infrastructure, integration effort - ROI depends on institutional technical capacity
Literature summarization: UNCLEAR - Time savings real (10 min to read guideline vs. 2 min to review LLM summary) - But risk of hallucinations means verification still required - Net time savings modest
Patient education generation: PROBABLY - Faster than writing from scratch - But requires physician review - Best for high-volume needs (discharge instructions, common diagnoses)
Part 9: The Future of Medical LLMs
What’s Coming in the Next 3-5 Years
Likely developments:
- EHR-integrated LLMs become standard
- Epic, Cerner, Oracle already deploying
- Message drafting, note summarization, coding assistance
- HIPAA-compliant by design
- Multimodal medical LLMs
- Text + images + lab data + genomics
- “Show me this rash” + clinical history → differential diagnosis
- Radiology report + imaging → integrated assessment
- Reduced hallucinations
- Retrieval-augmented generation (LLM + medical database lookup)
- Better uncertainty quantification
- Improved factuality through constrained generation
- Prospective clinical validation
- RCTs showing improved outcomes (not just time savings)
- Cost-effectiveness analyses
- Comparative studies (LLM-assisted vs. standard care)
- Regulatory clarity
- FDA guidance on LLM medical devices
- State medical board policies on LLM use
- Malpractice liability precedents
- Open-weight models democratizing global access
- DeepSeek, Llama 3, Mistral offer computational efficiency for resource-constrained settings
- Over 300 hospitals in China deployed DeepSeek since January 2025 for clinical decision support
- Cost: 6.71% of proprietary models (OpenAI o1) with comparable performance on medical benchmarks
- Critical caveat: Deployment at scale without prospective clinical validation raises safety concerns
- See AI and Global Health Equity for implementation guidance
Unlikely (despite hype):
- Fully autonomous diagnosis/treatment
- Too high-stakes for pure LLM decision-making
- Human oversight will remain essential
- Complete elimination of hallucinations
- Fundamental to how LLMs work
- Mitigation, not elimination, is realistic goal
- Replacement of physician-patient relationship
- LLMs assist communication, don’t replace human connection
- Empathy, trust, shared decision-making remain human domains
Modern Clinical Decision Support: Evidence Synthesis Tools
The clinical decision support landscape has shifted dramatically with the emergence of large language model-based tools. While traditional CDS focused on rule-based alerts and drug interaction warnings, modern systems provide evidence synthesis, clinical reasoning support, and real-time guideline retrieval. Adoption has been rapid but evidence of clinical impact remains limited.
Adoption Landscape
OpenEvidence, launched in 2024, reached over 40% daily use among U.S. physicians by late 2025, handling over 8.5 million clinical consultations monthly (OpenEvidence, December 2025). The platform uses large language models to synthesize medical literature, clinical guidelines, and drug databases in response to clinical queries.
Other widely adopted tools include: - ChatGPT (16% of physicians) - General-purpose LLM used for clinical queries despite not being health-specific - Abridge (5%) - Ambient clinical documentation - Claude (3%) - General-purpose LLM - DAX Copilot (2.4%) - Microsoft’s ambient documentation system - Doximity GPT (1.8%) - Clinical decision support embedded in physician networking platform
Growth trajectory: OpenEvidence grew 2,000%+ year-over-year, adding 65,000 new verified U.S. clinician registrations monthly. This represents the fastest adoption of any physician-facing application in history.
Integration Patterns
EHR-embedded deployment: Health systems have begun integrating evidence synthesis tools as SMART on FHIR apps within Epic, allowing physicians to query evidence without leaving the EHR workflow. This addresses a critical barrier: context switching between EHR and external tools.
Point-of-care use: Unlike traditional CDS that interrupts workflow with alerts, modern tools are pull-based. Clinicians query them when needed rather than receiving unsolicited pop-ups. This reduces alert fatigue but requires clinician initiative.
Evidence Gaps and Adoption Barriers
Despite widespread use, physicians express significant concerns. According to AMA surveys, nearly half of physicians (47%) ranked increased oversight as their top regulatory priority for AI tools, and 87% cited not being held liable for AI model errors as a critical factor for adoption (AMA, February 2025). Key concerns include accuracy and misinformation risk, lack of explainability, and legal liability.
The Stanford-Harvard ARISE assessment (2026): The State of Clinical AI Report reviewed 500+ clinical AI studies and found that while these tools handle millions of consultations monthly, rigorous outcome studies remain rare. Most evaluation relies on user satisfaction and engagement metrics rather than patient outcomes or diagnostic accuracy improvements.
Key concern: These tools often provide synthesized information with high confidence but limited transparency about source quality, evidence strength, or knowledge cutoff dates. When asked about emerging diseases, recent guideline updates, or off-label uses, LLM-based systems may hallucinate references or conflate older evidence with current recommendations.
When Modern CDS Works: Evidence Synthesis at Scale
Successful use case: Clinical guideline synthesis
OpenEvidence excels at queries like “What are the current USPSTF recommendations for colorectal cancer screening in average-risk adults?” where: - Guidelines are publicly available and well-established - Question has a clear, documented answer - Timeliness matters but changes are infrequent - Synthesis across multiple sources adds value
Problematic use case: Rare disease diagnosis
The same tools struggle with queries like “What’s the differential diagnosis for this constellation of symptoms?” where: - Medical literature is vast and pattern-matching fails - Rare presentations require systematic reasoning, not synthesis - Hallucination risk is high when training data is sparse - Clinical judgment and experience are essential
Clinical Practice Implications
Hospital and clinic adoption: Healthcare systems increasingly use these tools for: - Point-of-care guideline lookup (e.g., “What are current AHA recommendations for NSTEMI antiplatelet therapy?”) - Drug information queries (e.g., “What’s the renal dosing adjustment for vancomycin?”) - Evidence synthesis for clinical decision-making
Risks in resource-limited settings: When CDS tools are the primary source of clinical guidance (e.g., rural hospitals without on-site specialists, solo practices), hallucinated information or outdated recommendations can propagate without detection. Unlike well-staffed academic centers where multiple physicians cross-check recommendations, single-provider settings lack redundancy.
Lessons from ARISE: Clinician-AI Collaboration Outperforms Replacement
The ARISE report found consistent evidence that AI-assisted care outperforms either AI alone or clinicians alone:
- Radiologists consulting optional AI detected more breast cancers without increasing false positives
- Physicians using AI plus standard resources made better treatment decisions than control groups
However, collaboration is not yet optimized. Deskilling remains a real concern: if clinicians defer to AI recommendations without understanding the reasoning, diagnostic skills atrophy. The optimal collaboration pattern requires:
- AI provides evidence synthesis, not definitive answers
- Clinician integrates context: patient preferences, comorbidities, social determinants
- Uncertainty is explicit: AI indicates confidence and knowledge gaps
- Auditability: Clinician can trace recommendations to source evidence
Comparison to Traditional CDS Failures
Modern evidence synthesis tools avoid some failure modes of traditional CDS:
Epic sepsis model: - Proprietary, black-box algorithm - High false positive rate (88%) - Alert fatigue disaster - No outcome benefit despite massive deployment
OpenEvidence/modern CDS: - Pull-based (clinician-initiated) rather than push-based (unsolicited alerts) - No false positives from unsolicited alerts - Transparency varies (some cite sources, others don’t) - Outcome evidence still absent
Shared risk: Both can be deployed at scale without rigorous external validation. Traditional CDS faced regulatory gaps (EHR-embedded tools avoided FDA oversight). Modern CDS faces similar issues: general-purpose LLMs marketed for clinical use aren’t regulated as medical devices.
Open Questions
Outcome measurement: Does AI-assisted clinical decision-making improve patient outcomes? Reduce diagnostic errors? Shorten time to treatment?
Liability: When AI provides incorrect information that a clinician follows, who is liable? The tool vendor? The clinician? The health system?
Equity: Do these tools work equally well for questions about diseases affecting underrepresented populations? For conditions primarily researched in high-income countries?
Knowledge currency: How do these tools handle emerging evidence? COVID-19 revealed that guidelines changed weekly. LLMs trained on historical data can’t capture real-time updates without continuous retraining or retrieval-augmented generation.
For comprehensive vendor evaluation criteria, see Appendix: Vendor Evaluation Framework. For regulatory frameworks, see Evaluating AI Clinical Decision Support Systems.
Patient-Facing AI: Unique Safety Considerations
AI systems increasingly interact directly with patients through chatbots, symptom checkers, health education platforms, and digital assistants. These applications present distinct safety challenges compared to clinician-facing tools because patients lack clinical training to identify AI errors and may act on incorrect information without professional oversight.
The Evidence Gap
The Stanford-Harvard ARISE State of Clinical AI Report (2026) found that patient-facing AI evaluation relies primarily on engagement metrics (user satisfaction, session length, return visits) rather than outcome-focused evidence. Few studies measure whether these tools improve health outcomes, reduce diagnostic errors, or facilitate appropriate care escalation (Stanford Medicine, January 2026).
What’s measured: - User satisfaction scores (typically 70-85%) - Engagement rates (time spent, return visits) - Completion rates for health assessments
What’s rarely measured: - Diagnostic accuracy for patient-reported symptoms - Appropriate triage and escalation to human care - Patient safety outcomes (delayed diagnosis, inappropriate self-treatment) - Health literacy impact (do users understand AI limitations?)
Risk Categories
1. Misplaced Patient Trust
Patients often cannot distinguish between: - Evidence-based health information - AI-generated plausible-sounding misinformation - General wellness advice vs. medical recommendations requiring professional oversight
Example failure mode: Patient uses symptom checker for chest pain. AI suggests gastroesophageal reflux (most common cause statistically) and recommends antacids. Patient self-treats for 3 days. Actual diagnosis: acute coronary syndrome. Outcome: Delayed treatment, preventable myocardial damage.
Why this happens: LLMs optimize for plausible responses, not safety. Chest pain + young patient + no risk factors → statistically likely to be benign. But rare dangerous causes require different reasoning: maximize safety, not likelihood.
2. Delayed Escalation to Professional Care
Patient-facing AI may inadvertently discourage appropriate care-seeking: - Chatbot provides reassurance for concerning symptoms - AI suggests home remedies for conditions requiring clinical evaluation - User perceives AI response as definitive medical advice
ARISE notes this as a critical gap: “evaluation frameworks must focus on outcomes rather than engagement alone, with particular attention to escalation pathways and safety rails for high-risk scenarios.”
3. Health Literacy and Informed Consent
Patients using AI health tools often don’t understand: - These are not medical devices (most aren’t FDA-regulated) - Recommendations aren’t reviewed by healthcare professionals - AI can hallucinate references, statistics, or treatment guidelines - Knowledge cutoff dates mean recent information may be missing
Current state: Few patient-facing AI systems explicitly disclose limitations, error rates, or when to seek professional care instead. Terms of service often include liability disclaimers buried in legal text.
Case Example: ChatGPT for Health Education
OpenAI announced ChatGPT for Health in January 2026, positioning it as a health education tool. The system answers health questions using GPT-4-level reasoning but without clinical validation or regulatory oversight.
Intended use: General health education, wellness information, understanding diagnoses
Actual use: Patients report using it for: - Self-diagnosis of symptoms - Medication guidance - Treatment decisions (e.g., whether to seek emergency care) - Second opinions on physician recommendations
Safety gap: No evidence that the system can reliably distinguish: - Emergency symptoms requiring immediate care - Serious conditions requiring professional evaluation within days - Self-limiting conditions safe to monitor at home
The system provides confident-sounding responses regardless of uncertainty, creating false sense of security.
Physician Perspective: Managing Patient AI Use
Common scenario: Patient arrives with printout from ChatGPT suggesting diagnoses and treatments.
Challenges: - Correcting misinformation takes clinical time - Undermines physician-patient trust if AI contradicts physician - Patients may doctor-shop if physician disagrees with AI - Liability concerns if patient follows AI advice instead of medical recommendation
Communication strategies: - Acknowledge patient’s research initiative - Explain AI limitations (hallucinations, lack of individualization) - Review AI recommendations together, correct errors - Emphasize importance of individualized care vs. generic advice - Document patient’s AI use and physician guidance in chart
ARISE Recommendations
The ARISE report calls for:
- Clearer evidence requirements before widespread patient-facing deployment
- Stronger escalation pathways to human clinical oversight for concerning symptoms
- Evaluation frameworks focused on outcomes, not engagement:
- Does AI improve health-seeking behavior for serious conditions?
- Does AI reduce unnecessary ED visits for benign conditions?
- Does AI correctly identify high-risk scenarios requiring immediate care?
- Transparency about limitations:
- Explicit disclaimers about non-medical-device status
- Clear guidance on when to seek professional care
- Disclosure of knowledge cutoff dates and evidence quality
Safety Design Patterns
Pattern 1: Tiered Escalation
User query → AI assessment → Risk stratification:
- High risk (chest pain, severe headache, difficulty breathing) → Immediate redirect to 911 / emergency care
- Medium risk (persistent symptoms, worsening condition) → Prompt to schedule clinical visit within 24-48 hours
- Low risk (wellness, general information) → AI provides information with caveat to seek care if symptoms change
Pattern 2: Uncertainty Disclosure
AI response includes:
- Confidence level (high/medium/low)
- Knowledge gaps ("I don't have information about interactions with your specific medications")
- Explicit limitations ("This is educational information, not medical advice")
- Actionable next steps ("If symptoms worsen or persist beyond X days, seek professional care")
Pattern 3: Human-in-the-Loop for High Stakes
For scenarios with potential serious outcomes: - AI flags query as high-risk - Routes to nurse triage line or clinical decision support - Logs interaction for quality review - Does NOT provide definitive AI-generated recommendation without human oversight
Failure Mode: The Safety Theatre Problem
Some patient-facing AI tools include disclaimers like “This is not medical advice” while functionally operating as medical decision support tools. This creates liability protection for vendors while failing to protect patients.
Example: Symptom checker provides differential diagnosis with probabilities, recommends specific tests, suggests when to seek care vs. self-treat. Footer says “Not medical advice - consult your doctor.” Patient acts on recommendations without professional consultation because AI seemed authoritative.
The disconnect: Legal disclaimer contradicts functional design. If tool isn’t meant for medical decision-making, why provide diagnosis and treatment recommendations?
Regulatory Gaps
Most patient-facing health AI isn’t regulated as medical devices because vendors market them as “wellness” or “educational” tools rather than diagnostic systems. This creates a regulatory arbitrage:
- FDA-cleared diagnostic AI (e.g., IDx-DR for diabetic retinopathy): Rigorous clinical validation, performance standards, post-market surveillance
- General wellness AI (most chatbots and symptom checkers): No validation requirements, no performance standards, no adverse event reporting
The distinction depends on marketing claims, not actual use. Patients don’t distinguish between regulated and unregulated tools.
Recommendations for Clinical Practice
When patients use AI health tools:
Ask proactively: “Have you looked up your symptoms online or used any health apps?” normalizes discussion
Review AI recommendations together: Don’t dismiss outright; use as teachable moment about evidence-based medicine
Document: Note patient’s AI use and your clinical guidance in chart for liability protection
Educate about escalation: Teach patients red flag symptoms requiring immediate care regardless of AI reassurance
Equity considerations: Does AI work equally well for:
- Limited English proficiency patients?
- Low health literacy populations?
- Patients without reliable internet access or smartphones?
- Conditions affecting underrepresented communities?
For comprehensive safety evaluation frameworks, see Clinical AI Safety and Risk Management. For equity considerations, see Medical Ethics, Bias, and Health Equity.
Part 10: Implementation Guide
Safe LLM Implementation Checklist
Pre-Implementation:
During Use:
Post-Implementation:
Part 11: Institutional LLM Deployment: Real-World Evidence
While individual physicians experiment with LLMs, health systems face a distinct challenge: how to deploy LLMs at institutional scale with appropriate governance, workflow integration, and continuous monitoring. Stanford Medicine’s ChatEHR provides the first large-scale evidence on what institutional LLM deployment actually looks like in practice.
The Stanford ChatEHR Model
Stanford Health Care developed ChatEHR as an institutional capability rather than adopting external vendor solutions, enabling what they term a “build-from-within” strategy (Shah et al., 2026). The platform connects multiple LLMs (OpenAI, Anthropic, Google, Meta, DeepSeek) to the complete longitudinal patient record within the EHR.
Two deployment modes:
| Mode | Description | Use Case |
|---|---|---|
| Automations | Static prompt + data combinations for fixed tasks | Transfer eligibility screening, surgical site infection monitoring, chart abstraction |
| Interactive UI | Chat interface within EHR for open-ended queries | Pre-visit chart review, summarization, clinical questions |
The key insight: Standalone LLM tools (like web-based ChatGPT) create “workflow friction” from manual data entry. Clinical adoption requires real-time access to the longitudinal medical record inside the LLM context window, direct embedding into clinical workflows, and continuous evaluation.
Quantified Hallucination Rates in Production
The most significant contribution of the Stanford deployment is quantified error rates from real-world clinical use, not laboratory benchmarks.
Summarization accuracy (the most common task, 30% of queries):
| Metric | Rate | Definition |
|---|---|---|
| Hallucinations per summary | 0.73 | Statements not found in or supported by the patient record (e.g., mentioning a procedure not documented) |
| Inaccuracies per summary | 1.60 | Statements contradicting information in the record (e.g., reporting lab value of 5.0 when record states 3.5) |
| Total unsupported claims | 2.33 per summary | Combined hallucinations + inaccuracies |
| Summaries with ≤1 error | 50% | Half of all summaries had at most one unsupported claim |
Error types identified: - Temporal sequence errors (misstating when events occurred) - Numeric value confusion (labs, vitals) - Role attribution errors (misstating who performed an action) - “Gestalt” care plan confusion (e.g., conflating pulmonary workup with cardiovascular workup)
Clinical implication: Every LLM-generated clinical summary requires verification. Approximately half of summaries contain more than one error. These are not catastrophic hallucinations in most cases, but inaccuracies that could mislead clinical reasoning if unverified.
Adoption and Usage Patterns
User training and adoption:
- 1,075 clinicians completed mandatory training video
- 99% reported the training video prepared them for use
- Training emphasized: key features, prompting tips, and the non-negotiable requirement to verify outputs
Usage scale (first 3 months of broad deployment):
| Metric | Value |
|---|---|
| Total sessions | 23,000+ |
| Daily active users | ~100 |
| Tokens processed | 19 billion |
| Sessions using external HIE data | >50% |
| Most common task | Summarization (30%) |
User types: 424 physicians, 180 APPs, 151 residents, 60 fellows using the system at least once.
Response times: Most queries returned in under 20 seconds, though complex patient records with large timelines could take up to 50 seconds (with 95%+ of time spent assembling the FHIR record bundle, not LLM inference).
Value Assessment Framework
Stanford developed a structured framework to quantify LLM deployment value across three categories:
| Category | Definition | Example |
|---|---|---|
| Cost savings | Direct monetary reductions | Avoided manual chart review labor |
| Time savings | Decreased time on manual tasks (time repurposed, not eliminated) | 120 charts/day avoided manual review = ~4 hours saved |
| Revenue growth | Incremental revenue from new workflows or improved throughput | Increased transfer throughput freeing beds |
First-year estimated savings: $6 million (conservative, without quantifying the benefit of improved patient care).
Example automation ROI:
| Automation | Time Savings | Revenue Impact |
|---|---|---|
| Transfer eligibility screening | ~$100K labor | $2.4–3.3M (1,700 transfers/year freeing beds) |
| Inpatient hospice identification | ~6,570 hours/year (3 FTEs) | Difficult to quantify |
| Pre-visit chart review | 120 charts/day avoided | Time returned to clinical care |
The UI value proposition: If 100 daily users running 3 queries each save 10 minutes per query searching charts, that translates to ~50 hours/day of physician time. At median physician hourly rates, this approximates $2.2 million annually in time savings, against ~$20,000 in LLM API costs.
Governance and Monitoring
Stanford implemented continuous monitoring across three dimensions:
1. System integrity monitoring: - Response times, error codes, timeouts - Token usage and cost tracking - Data retrieval verification (re-extracting benchmark patients to detect upstream changes)
2. Performance monitoring: - In-workflow feedback collection (thumbs up/down) - Task categorization from interaction logs - Unsupported claims rate estimation on sample of sessions - Benchmark dataset maintenance for each automation
3. Impact monitoring: - Action rates (what proportion of flagged patients received recommended follow-up) - Documentation metrics (use of generated text in notes) - Engagement metrics (views, copies, repeat usage)
User feedback rates: ~5% of users provided feedback; of those, two-thirds was positive (thumbs up).
Lessons for Health Systems
What the Stanford deployment demonstrates:
Workflow integration is essential. Copy-paste from standalone tools creates friction that limits adoption. LLMs embedded in EHR workflows see sustained use.
Automations require different governance than interactive use. Predefined prompt + data combinations can be validated against truth sets and monitored systematically. Interactive chat requires task categorization, sampling, and ongoing quality assessment.
Benchmark performance is necessary but insufficient. Stanford used MedHELM (Nature Medicine, 2026) for initial model selection, but real-world monitoring revealed error patterns (temporal confusion, numeric errors) that benchmarks don’t capture.
User training drives safe adoption. Mandatory training video completion before access, combined with community support channels, correlated with high reported preparedness and appropriate skepticism toward outputs.
Value quantification requires structured frameworks. Time savings are “soft” (repurposed, not eliminated) and harder to measure than cost savings or revenue growth. Combining all three provides a more complete picture.
What remains challenging:
- Verification behavior erosion: As users become accustomed to AI-generated content, verification may decline
- Automated fact verification is still maturing (see VeriFact for emerging approaches)
- Value of improved patient care is difficult to quantify
The Vendor-Agnostic Advantage
The “build-from-within” approach provides institutional agency that vendor-dependent deployments do not:
| Factor | Vendor Solution | Institutional Platform |
|---|---|---|
| Model selection | Vendor’s chosen model | Best model for each task |
| Data governance | Vendor’s terms | Institution controls data |
| Customization | Limited to vendor options | Task-specific automations |
| Monitoring | Vendor-provided dashboards | Custom metrics aligned to institutional priorities |
| Cost | Per-user licensing | Per-query API costs (~$0.16/query) |
| Continuity | Vendor business risk | Internal capability persists |
The strategic argument: LLM deployment as institutional capability compounds value over time as workflows evolve, data quality improves, and internal expertise deepens. Vendor-dependent deployment creates ongoing reliance and limits customization.
Key Takeaways
10 Principles for LLM Use in Medicine
LLMs are assistants, not doctors: Always maintain human oversight and final decision-making
Hallucinations are unavoidable: Verify all medical facts, never trust blindly
HIPAA compliance is non-negotiable: Public ChatGPT is NOT appropriate for patient data
Appropriate uses: Documentation drafts, literature review, education materials (with review)
Inappropriate uses: Autonomous diagnosis/treatment, medication dosing without verification, urgent decisions
Physician remains legally responsible: “AI told me to” is not a malpractice defense
Evidence is evolving: USMLE performance ≠ clinical utility; demand prospective RCTs
Ambient documentation shows clearest benefit: 50% time savings with high satisfaction
Prompting quality matters: Specific, detailed prompts with sourcing requests yield better outputs
The future is collaborative: Effective physician-LLM partnership, not replacement
Clinical Scenario: LLM Vendor Evaluation
Scenario: Your Hospital Is Considering Glass Health for Clinical Decision Support
The pitch: Glass Health provides LLM-powered differential diagnosis and treatment suggestions. Marketing claims: - “Physician-level diagnostic accuracy” - “Evidence-based treatment recommendations” - “Saves 20 minutes per complex case” - Cost: $200/month per physician
The CMO asks for your recommendation.
Questions to ask:
- “What peer-reviewed validation studies support Glass Health?”
- Request JAMA, Annals, specialty journal publications
- User testimonials ≠ clinical validation
- “Is this HIPAA-compliant? Where is the BAA?”
- Essential for entering patient data
- “What is the hallucination rate?”
- If vendor hasn’t quantified, they haven’t tested properly
- “How does Glass Health handle diagnostic uncertainty?”
- Does it express appropriate uncertainty or confidently hallucinate?
- “What workflow oversight prevents acting on incorrect recommendations?”
- Best systems require physician review before actions
- “Can we pilot with 10 physicians before hospital-wide deployment?”
- Local validation essential
- “What happens if Glass Health recommendation causes harm?”
- Read liability disclaimers in contract
- “What is actual time savings data?”
- “20 minutes per complex case” claim: where’s the evidence?
Red Flags:
“Physician-level accuracy” without prospective validation
No discussion of hallucination rates or error modes
Marketing emphasizes speed over safety
No built-in verification mechanisms
Check Your Understanding
Scenario 1: The Medication Dosing Question
Clinical situation: You’re seeing a 4-year-old with otitis media requiring amoxicillin. You ask GPT-4 (via HIPAA-compliant API):
“What is the appropriate amoxicillin dosing for a 4-year-old child with acute otitis media?”
GPT-4 responds: “For acute otitis media in a 4-year-old, amoxicillin dosing is 40-50 mg/kg/day divided into two doses (every 12 hours). For a 15 kg child, this would be 300-375 mg twice daily.”
Question 1: Do you prescribe based on this recommendation?
Click to reveal answer
Answer: No, verify against authoritative source first.
Why:
The LLM response is partially correct but incomplete: - Standard-dose amoxicillin: 40-50 mg/kg/day divided BID (LLM correct) - But: AAP now recommends high-dose amoxicillin (80-90 mg/kg/day divided BID) for most cases of AOM due to increasing S. pneumoniae resistance - LLM likely trained on older guidelines pre-dating high-dose recommendations
Correct dosing (per 2023 AAP guidelines): - High-dose: 80-90 mg/kg/day divided BID (first-line for most cases) - For 15 kg child: 600-675 mg BID - Standard-dose: 40-50 mg/kg/day only for select cases (penicillin allergy evaluation, mild infection in low-resistance areas)
What you should do: 1. Check UpToDate, Lexicomp, or AAP guidelines directly 2. Confirm high-dose amoxicillin is indicated 3. Prescribe 600-650 mg BID (not the LLM-suggested 300-375 mg)
The lesson: LLMs may provide outdated recommendations or miss recent guideline updates. Always verify medication dosing against current pharmacy databases or guidelines.
If you had prescribed LLM dose: - Child receives 50% of intended amoxicillin - Higher risk of treatment failure - Potential for antibiotic resistance development
Scenario 2: The Patient Education Handout
Clinical situation: You’re discharging a patient newly diagnosed with type 2 diabetes. You use GPT-4 to generate patient education handout:
“Create a one-page patient handout for newly diagnosed type 2 diabetes, 8th-grade reading level. Cover: medications, blood sugar monitoring, diet, exercise.”
GPT-4 generates professional-looking handout with sections on metformin, glucometer use, carb counting, and walking recommendations.
Question 2: Can you give this handout to the patient as-is, or do you need to review/edit first?
Click to reveal answer
Answer: MUST review and edit before giving to patient.
Why:
Potential LLM errors to check:
- Medication information:
- Is metformin dosing correct? (LLMs sometimes hallucinate dosages)
- Are side effects accurately described?
- Are contraindications mentioned? (metformin contraindicated in advanced CKD)
- Blood sugar targets:
- ADA guidelines: Fasting <130 mg/dL, postprandial <180 mg/dL for most
- LLM may use outdated targets or not individualize
- Dietary advice:
- Is carb counting explained clearly?
- Any culturally inappropriate diet suggestions?
- Consistent with current ADA nutrition guidelines?
- Exercise recommendations:
- 150 min/week moderate exercise is standard
- Are contraindications mentioned? (avoid exercise if glucose >250 mg/dL + ketones)
- Individualization:
- LLM generates generic handout
- Does it fit THIS patient’s health literacy, culture, comorbidities?
What you should do: 1. Read entire handout for medical accuracy 2. Verify all medical facts (especially medication dosing, blood sugar targets) 3. Customize to patient (add their specific meds if not metformin, adjust for comorbidities) 4. Check cultural appropriateness 5. Review with patient, answer questions
The lesson: LLM-generated patient materials save time but require physician review for accuracy and individualization. Never hand out LLM content without verification.
Time savings: - Writing from scratch: 15-20 minutes - LLM draft + physician review/editing: 5-7 minutes - Net benefit: 10+ minutes saved while maintaining quality
Scenario 3: The Citation Fabrication
Clinical situation: You’re writing a quality improvement manuscript on reducing hospital readmissions. You ask GPT-4:
“Cite 3 recent studies (2020-2023) showing effectiveness of AI for predicting 30-day readmissions in heart failure patients.”
GPT-4 provides:
- “Chen et al. (2022). ‘Machine learning for heart failure readmission prediction.’ JAMA Cardiology 7:445-453.”
- “Rodriguez-Lopez et al. (2021). ‘Deep learning models outperform LACE index for readmission risk.’ Circulation 144:892-901.”
- “Patel et al. (2023). ‘Prospective validation of AI readmission tool in community hospitals.’ New England Journal of Medicine 388:1122-1131.”
Question 3: Can you include these citations in your manuscript?
Click to reveal answer
Answer: NO. You must verify each citation exists and actually supports your claim.
Why:
The LLM likely fabricated some or all of these citations. Here’s how to check:
Step 1: Search PubMed for each citation
For “Chen et al. (2022) JAMA Cardiology”: - Search: "Chen" AND "heart failure readmission" AND "machine learning" AND "JAMA Cardiology" AND 2022 - If found: Read abstract, confirm it supports your claim - If NOT found: Citation is fake
Step 2: Verify journal, volume, pages
Even if an author “Chen” published in JAMA Cardiology in 2022, check: - Is the title correct? - Is the volume/page number correct? - Does the paper actually discuss AI for HF readmissions?
Step 3: Read the actual papers
If citations exist: - Do they support the claim you’re making? - Are study methods sound? - Are conclusions being accurately represented?
Likely outcome: - 1-2 of these 3 citations are completely fabricated - Even if a paper exists, it may not say what you think (or LLM suggests)
What you should do instead:
Search PubMed yourself:
("heart failure" OR "HF") AND ("readmission" OR "rehospitalization") AND ("machine learning" OR "artificial intelligence" OR "AI") AND ("prediction" OR "risk score")Filter: Publication date 2020-2023, Clinical Trial or Review
Read abstracts, select relevant papers
Cite actual papers you’ve read
The lesson: Never trust LLM-generated citations. LLMs fabricate references 15-30% of the time. Always verify papers exist and support your claims.
Consequences of using fabricated citations: - Manuscript rejection - If published then discovered: Retraction - Academic dishonesty allegations - Career damage
Time comparison: - LLM citations (unverified): 30 seconds - Manual PubMed search + reading abstracts: 15-20 minutes - Worth the extra time to avoid fabricated references