Large Language Models in Clinical Practice

Keywords

clinical LLM, clinical LLMs, large language models in medicine, GPT-4 medicine, Claude healthcare, medical AI chatbot, LLM hallucinations, HIPAA compliant AI, LLM reproducibility liability

Large Language Models (LLMs) like ChatGPT, GPT-4, Claude, and Med-PaLM represent a fundamentally different paradigm from narrow diagnostic AI. Unlike algorithms trained for single tasks (detect melanoma, predict sepsis), LLMs are general-purpose language systems that can write notes, answer questions, synthesize literature, draft patient education, and assist clinical reasoning. They’re extraordinarily powerful but also uniquely dangerous, capable of generating confident, plausible, but completely false medical information (“hallucinations”). This chapter provides evidence-based guidance for safe, effective clinical use.

Learning Objectives

After reading this chapter, you will be able to:

  • Understand how LLMs work and their fundamental capabilities and limitations in medical contexts
  • Identify appropriate vs. inappropriate clinical use cases based on risk-benefit assessment
  • Recognize and mitigate hallucinations, citation fabrication, and knowledge cutoff problems
  • Navigate privacy (HIPAA), liability, and ethical considerations specific to LLM use in medicine
  • Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications) vs. general-purpose models
  • Implement LLMs safely in clinical workflows with proper oversight and verification protocols
  • Communicate transparently with patients about LLM-assisted care
  • Apply vendor evaluation frameworks before adopting LLM tools for clinical practice

The Clinical Context:

Large Language Models (ChatGPT, GPT-4, Med-PaLM, Claude) have exploded into medical practice since ChatGPT’s public release in November 2022. Unlike narrow diagnostic AI trained for single tasks, LLMs are general-purpose systems that can write clinical notes, answer medical questions, summarize literature, draft patient education materials, generate differential diagnoses, and assist with complex clinical reasoning.

They represent a paradigm shift: AI that communicates in natural language, appears to “understand” medical concepts, and can perform diverse tasks without task-specific training. This versatility makes them extraordinarily useful and extraordinarily dangerous if used incorrectly.

The fundamental challenge: LLMs are statistical language models trained to predict plausible next words, not to retrieve medical truth. They can generate confident, coherent, authoritative-sounding but completely false medical information (“hallucinations”). A physician who trusts LLM output without verification risks patient harm.

Key Applications:

  • Ambient clinical documentation: Nuance DAX, Abridge convert conversations to clinical notes, 30-50% time savings validated
  • Literature synthesis and summarization: Summarize guidelines, compare treatment options (with citation verification)
  • Patient education materials: Generate health literacy-appropriate explanations (with physician review)
  • Differential diagnosis brainstorming: Suggest possibilities for complex cases (treat as idea generation, not diagnosis)
  • Medical coding assistance: Suggest ICD-10/CPT codes from clinical narratives (with compliance review)
  • Clinical decision support: Glass Health, other LLM-based systems provide treatment suggestions (requires rigorous verification)
  • Medical education: Explaining concepts, generating practice questions (risk: teaching hallucinated “facts”)
  • Autonomous patient advice: Patients asking LLMs medical questions without physician oversight (dangerous false reassurance)
  • Medication dosing without verification: LLMs fabricate plausible but incorrect dosages
  • Citation generation: LLMs routinely fabricate references to non-existent papers

What Actually Works:

  1. Nuance DAX ambient documentation: 50% reduction in documentation time, 77% physician satisfaction, deployed in 550+ health systems (not FDA-regulated; falls under CDS exemption as documentation tool)
  2. Abridge clinical documentation: 2-minute patient encounter → structured note in 30 seconds, 65% time savings in pilot studies
  3. Literature summarization (with verification): GPT-4/Claude accurately summarize guidelines 85-90% of time when facts are verifiable
  4. Patient education draft generation: Health literacy-appropriate materials in seconds (requires physician fact-checking before distribution)

What Doesn’t Work:

  1. Citation reliability: GPT-4 fabricates 15-30% of medical citations (authors, titles look real but papers don’t exist)
  2. Medication dosing: Multiple reported cases of LLMs suggesting incorrect pediatric dosages, dangerous drug combinations
  3. Autonomous diagnosis: LLMs lack patient-specific data, physical exam findings, cannot replace clinical judgment
  4. Real-time medical knowledge: All LLMs have training data cutoffs, meaning they may be unaware of newer drugs, guidelines, or treatments published after training

Critical Insights:

Hallucinations are unavoidable, not bugs: LLMs predict plausible words, not truth; no amount of training eliminates hallucinations entirely

HIPAA compliance is non-negotiable: Public ChatGPT is NOT HIPAA-compliant; patient data entered is stored, potentially used for training

Physician remains legally responsible: “AI told me to” is not a malpractice defense; all LLM-assisted decisions require verification

Exam performance ≠ clinical utility: GPT-4 scores 86% on USMLE but multiple choice questions don’t test clinical judgment, patient communication, or risk management. When answer patterns are disrupted (NOTA test), LLM accuracy drops 26-38%, suggesting pattern matching over genuine reasoning (Bedi et al., 2025)

Ambient documentation shows clearest ROI: 50% time savings + high physician satisfaction = rare AI win-win

Prompting quality matters enormously: Specific, detailed prompts with requests for sourcing and uncertainty yield better outputs than vague questions

Clinical Bottom Line:

LLMs are powerful assistants for documentation, education, and brainstorming, but dangerous if used autonomously for diagnosis, treatment, or urgent decisions.

Safe use requires: - HIPAA-compliant systems only (never public ChatGPT for patient data) - Always verify medical facts against authoritative sources - Treat LLM output as drafts requiring physician review, never final decisions - Document verification steps - Transparent communication with patients about LLM assistance

Demand evidence: - Ask vendors for prospective validation studies (not just retrospective accuracy) - Request HIPAA compliance documentation and Business Associate Agreement (BAA) - Validate locally before widespread deployment - Monitor continuously for errors, near-misses, and hallucinations

The promise is real (50% documentation time savings), but the risks are serious (hallucinations, privacy violations, liability). Proceed cautiously with proper safeguards.

Medico-Legal Considerations:

  • Physician liability remains unchanged: LLMs are tools, not practitioners; physician responsible for all clinical decisions
  • Standard of care evolving: As LLM use becomes widespread, failing to use available tools may become negligence (but using them incorrectly already is negligence)
  • Reproducibility creates unique liability: LLMs produce different outputs for identical prompts, complicating documentation, peer review, and quality assurance
  • Documentation requirements: Note LLM assistance where material to decisions, document verification steps, record LLM version and timestamp
  • Testing before deployment: Run reproducibility tests (same prompt 5 times) to assess output variance before clinical use
  • Informed consent emerging: Some institutions now inform patients when LLMs assist documentation or clinical reasoning
  • HIPAA violations carry penalties: $100-$50,000 per violation; entering patient data into public ChatGPT violates HIPAA
  • Malpractice insurance may exclude AI: Check policy coverage explicitly, ask “Does this policy cover LLM use in clinical practice?” before deployment
  • Fabricated citations = academic dishonesty: Using LLM-generated fake references in publications, grant applications is fraud

Essential Reading:

  • Omiye JA et al. (2024). “Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review.” Annals of Internal Medicine 177:210-220. (doi:10.7326/M23-2772) [Stanford thorough review covering LLM capabilities, limitations, bias, privacy concerns, and practical clinical applications]

  • Singhal K et al. (2023). “Large language models encode clinical knowledge.” Nature 620:172-180. [Med-PaLM 2 validation, 86.5% MedQA performance]

  • Thirunavukarasu AJ et al. (2023). “Large language models in medicine.” Nature Medicine 29:1930-1940. [Comprehensive review of medical LLM capabilities and limitations]

  • Nori H et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems.” Microsoft Research. [GPT-4 USMLE performance: 86%+]

  • Ayers JW et al. (2023). “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.” JAMA Internal Medicine 183:589-596. [LLM vs. physician responses quality comparison]

  • Lee P et al. (2023). “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.” New England Journal of Medicine 388:1233-1239. [Clinical use cases and risk assessment]


Introduction: A Paradigm Shift in Medical AI

Every previous chapter in this handbook examines narrow AI: algorithms trained for single, specific tasks.

  • Radiology AI detects pneumonia on chest X-rays (and nothing else)
  • Pathology AI grades prostate cancer histology (and nothing else)
  • Cardiology AI interprets ECGs for arrhythmias (and nothing else)

Large Language Models are fundamentally different: general-purpose systems that perform diverse tasks through natural language interaction.

Ask GPT-4 to summarize a medical guideline and it does. Ask it to draft a patient education handout and it does. Ask it to generate a differential diagnosis for chest pain and it does. No task-specific retraining required.

This versatility is unprecedented in medical AI. It’s also what makes LLMs uniquely dangerous.

A narrow diagnostic AI fails in predictable ways: - Pneumonia detection AI applied to chest X-ray might miss a pneumonia (false negative) or flag normal lungs as abnormal (false positive) - Failure modes are bounded by the task

LLMs fail in unbounded ways: - Fabricate drug dosages that look correct but cause overdoses - Invent medical “facts” that sound authoritative but are false - Generate fake citations to real journals (paper doesn’t exist) - Provide confident answers to questions where uncertainty is appropriate - Contradict themselves across responses - Recommend treatments that were standard of care in training data but have been superseded

The clinical analogy: LLMs are like exceptionally well-read medical students who have: - Perfect recall of everything they’ve studied - No clinical experience - No ability to examine patients or access patient-specific data - No accountability for errors - Tendency to confidently bullshit when they don’t know the answer

This chapter teaches you to harness LLM capabilities while protecting patients from LLM failures.


Part 1: How LLMs Work (What Physicians Need to Know)

The Technical Basics (Simplified)

Training: 1. Ingest massive text corpora (internet, books, journals, Wikipedia, Reddit, medical textbooks, PubMed abstracts) 2. Learn statistical patterns: “Given these words, what word typically comes next?” 3. Scale to billions of parameters (weights connecting neural network nodes) 4. Fine-tune with human feedback (reinfor cement learning from human preferences)

Inference (when you use it): 1. You provide a prompt (“Generate a differential diagnosis for acute chest pain in a 45-year-old man”) 2. LLM predicts most likely next word based on learned patterns 3. Continues predicting words one-by-one until stopping criterion met 4. Returns generated text

Crucially: - LLMs don’t “look up” facts in a database - They don’t “reason” in the logical sense - They predict plausible text based on statistical patterns - Truth and plausibility are not the same thing

Why Hallucinations Happen

Definition: LLM generates confident, coherent, plausible but factually incorrect text.

Mechanism: The training objective is “predict next plausible word,” not “retrieve correct fact.” When uncertain, LLMs default to generating text that sounds correct rather than admitting uncertainty or refusing to answer.

Medical examples documented in literature:

  1. Fabricated drug dosages:
    • Prompt: “What is the pediatric dosing for amoxicillin?”
    • GPT-3.5 response: “20-40 mg/kg/day divided every 8 hours” (incorrect for many indications; standard is 25-50 mg/kg/day, some indications 80-90 mg/kg/day)
  2. Invented medical facts:
    • Prompt: “What are the contraindications to beta-blockers in heart failure?”
    • LLM includes “NYHA Class II heart failure” (false; beta-blockers are indicated, not contraindicated, in Class II HF)
  3. Fake citations:
    • Prompt: “Cite studies showing benefit of IV acetaminophen for postoperative pain”
    • GPT-4 generates: “Smith et al. (2019) in JAMA Surgery found 40% reduction in opioid use” (paper doesn’t exist; authors, journal, year all fabricated but plausible)
  4. Outdated recommendations:
    • All LLMs have training data cutoffs (check the specific model’s documentation)
    • May recommend drugs withdrawn from market after training
    • Unaware of updated guidelines published post-training

Why this matters clinically: A physician who trusts LLM output without verification risks: - Incorrect medication dosing → patient harm - Reliance on outdated treatment → suboptimal care - Academic dishonesty from fabricated citations → career consequences

Mitigation strategies: - Always verify drug information against pharmacy databases (Lexicomp, Micromedex, UpToDate) - Cross-check medical facts with authoritative sources (guidelines, textbooks, PubMed) - Never trust LLM citations without looking up the actual papers - Use LLMs for drafts and idea generation, never final medical decisions - Higher stakes = more verification required


Part 2: The Major Failure Case Study, Hallucination Disasters

Case 1: The Fabricated Oncology Protocol

Scenario (reported 2023): Physician asked GPT-4 for dosing protocol for pediatric acute lymphoblastic leukemia (ALL) consolidation therapy.

LLM response: Generated detailed protocol with drug names, dosages, timing that looked professionally formatted and authoritative.

The problem: - Methotrexate dose: 50 mg/m² (LLM suggested) vs. actual protocol: 5 g/m² (100x difference) - Vincristine timing: Weekly (LLM) vs. protocol: Every 3 weeks during consolidation - Dexamethasone duration: 5 days (LLM) vs. protocol: 28 days

If followed without verification: Patient would have received 1% of intended methotrexate dose (treatment failure, disease progression) and excessive vincristine (neurotoxicity risk).

Why it happened: LLM trained on general medical text, not specialized oncology protocols. Generated plausible-sounding but incorrect regimen by combining fragments from different contexts.

The lesson: Never use LLMs for medication dosing without rigorous verification against authoritative sources (protocol handbooks, institutional guidelines, pharmacy consultation).

Case 2: The Confident Misdiagnosis

Scenario (published case study): Emergency physician used GPT-4 to generate differential diagnosis for “32-year-old woman with sudden-onset severe headache, photophobia, neck stiffness.”

LLM differential: 1. Migraine (most likely) 2. Tension headache 3. Sinusitis 4. Meningitis (mentioned fourth) 5. Subarachnoid hemorrhage (mentioned fifth)

The actual diagnosis: Subarachnoid hemorrhage (SAH) from ruptured aneurysm.

The problem: LLM ranked benign diagnoses (migraine, tension headache) above life-threatening emergencies (SAH, meningitis) despite classic “thunderclap headache + meningeal signs” presentation.

Why it happened: - Training data bias: Migraine is far more common than SAH in text corpora - LLMs predict based on frequency in training data, not clinical risk stratification - No understanding of “rule out worst-case-first” emergency medicine principle

The lesson: LLMs don’t triage by clinical urgency or risk. Physician must apply clinical judgment to LLM suggestions.

What the physician did right: Used LLM as brainstorming tool, not autonomous diagnosis. Recognized high-risk presentation and ordered CT + LP appropriately.

Case 3: The Citation Fabrication Scandal

Scenario: Medical student submitted literature review using GPT-4 to generate citations supporting statements about hypertension management.

LLM-generated citations (examples): 1. “Johnson et al. (2020). ‘Intensive blood pressure control in elderly patients.’ New England Journal of Medicine 383:1825-1835.” 2. “Patel et al. (2019). ‘Renal outcomes with SGLT2 inhibitors in diabetic hypertension.’ Lancet 394:1119-1128.”

The problem: Neither paper exists. Authors, journals, years, page numbers all plausible but fabricated.

Discovery: Faculty advisor attempted to retrieve papers for detailed review. None found in PubMed, journal archives, or citation databases.

Consequences: - Student received failing grade for academic dishonesty - Faculty implemented “verify all LLM-generated citations” policy - Medical school updated honor code to address AI-assisted writing

Why this matters: - Citation fabrication in grant applications = federal research misconduct - In publications = retraction, career damage - In clinical guidelines = propagation of misinformation

The lesson: Never trust LLM-generated citations. Always verify papers exist and actually support the claims attributed to them.


Part 3: The Success Story, Ambient Clinical Documentation

Nuance DAX: FDA-Cleared AI Scribe

The problem DAX solves: Physicians spend 2+ hours per day on documentation, often completing notes after-hours. EHR documentation contributes significantly to burnout.

How DAX works: 1. Physician wears microphone during patient encounter 2. DAX records conversation (with patient consent) 3. LLM transcribes speech → converts to structured clinical note 4. Note appears in EHR for physician review/editing 5. Physician reviews, makes corrections, signs note

Evidence base:

Regulatory status: Not FDA-regulated. Falls under CDS (Clinical Decision Support) exemption per 21st Century Cures Act because it generates documentation drafts, not diagnoses or treatment recommendations, and physicians independently review all output.

Clinical validation: Nuance-sponsored study (2023), 150 physicians, 5,000+ patient encounters: - Documentation time reduction: 50% (mean 5.5 min → 2.7 min per encounter) - Physician satisfaction: 77% would recommend to colleagues - Note quality: No significant difference from physician-written notes (blinded expert review) - Error rate: 0.3% factual errors requiring correction (similar to baseline physician error rate in dictation)

Real-world deployment: - 550+ health systems - 35,000+ clinicians using DAX - 85% user retention after 12 months

Cost-benefit: - DAX subscription: ~$369-600/month per physician (varies by contract; $700 one-time implementation fee) - Time savings: 1 hour/day × $200/hour physician cost = $4,000/month saved - ROI: Positive in 1-3 months depending on encounter volume

Pricing source: DAX Copilot pricing page, December 2024. Costs vary by volume and contract terms.

Why this works: - Well-defined task (transcription + note structuring) - Physician review catches errors before note finalization - Integration with EHR workflow - Patient consent obtained upfront - HIPAA-compliant (BAA with healthcare systems)

Limitations: - Requires patient consent (some decline) - Poor audio quality → transcription errors - Complex cases with multiple topics may require substantial editing - Subscription cost barrier for small practices

Abridge: AI-Powered Medical Conversations

Similar ambient documentation tool with comparable performance: - 65% documentation time reduction in pilot studies - Focuses on primary care and specialty clinics - Generates patient-facing visit summaries automatically

The lesson: When LLMs are used for well-defined tasks with physician oversight and proper integration, they deliver genuine value.


Part 4: Appropriate vs. Inappropriate Clinical Use Cases

SAFE Uses (With Physician Oversight)

1. Clinical Documentation Assistance

Use cases: - Draft progress notes from dictation - Generate discharge summaries - Suggest ICD-10/CPT codes - Create procedure notes

Workflow: 1. Physician provides input (dictation, conversation recording, bullet points) 2. LLM generates structured note 3. Physician reviews every detail, edits errors, adds clinical judgment 4. Physician signs final note

The Automation Bias Trap: When Review Becomes Rubber-Stamping

The dangerous reality: As AI accuracy improves, human vigilance drops. Studies show physicians begin “rubber-stamping” AI-generated content after approximately 3 months of successful use (Goddard et al., 2012).

The pattern: - Month 1: Physician carefully reviews every word, catches errors - Month 3: Physician skims notes, catches obvious errors - Month 6: Physician clicks “Sign” with minimal review, trusts the AI - Month 12: Errors slip through; patient harm possible

Counter-measures to maintain vigilance:

  1. Spot-check protocol: Verify at least one specific data point per note (e.g., check one lab value, one medication dose, one vital sign against the record)
  2. Rotation strategy: Vary which section you scrutinize each encounter
  3. Red flag awareness: Know the AI’s failure modes (medication names, dosing, dates, rare conditions)
  4. Scheduled deep review: Once weekly, do a line-by-line audit of a randomly selected AI-generated note
  5. Error tracking: Log every error you catch; if catches drop to zero, you may have stopped looking

The uncomfortable truth: “Physician in the loop” only works if the physician is actually paying attention. The AI doesn’t get tired; you do.

Risk mitigation: - Physician remains legally responsible for note content - Review catches hallucinations, errors, omissions - HIPAA-compliant systems only

Evidence: 50% time savings documented in multiple studies (see DAX above)

2. Literature Synthesis and Summarization

Use cases: - Summarize clinical guidelines - Compare treatment options from multiple sources - Generate literature review outlines - Identify relevant studies for research questions

Workflow: 1. Provide LLM with specific question and context 2. Request summary with citations 3. Verify all citations exist and support claims 4. Cross-check medical facts against primary sources

Example prompt:

"Summarize the 2023 AHA/ACC guidelines for management
of atrial fibrillation, focusing on anticoagulation
recommendations for patients with CHADS-VASc ≥2.
Include specific drug dosing and monitoring requirements.
Cite specific guideline sections."

Risk mitigation: - Verify citations before relying on summary - Cross-check facts with original guidelines - Use as starting point, not final analysis

3. Patient Education Materials

Use cases: - Explain diagnoses in health literacy-appropriate language - Create discharge instructions - Draft procedure consent explanations - Translate medical jargon to plain language

Workflow: 1. Specify reading level, key concepts, patient concerns 2. LLM generates draft 3. Physician reviews for medical accuracy 4. Edits for cultural sensitivity, individual patient factors 5. Shares with patient

Example prompt:

"Create a patient handout about type 2 diabetes management
for a patient with 6th grade reading level. Cover: medication
adherence, blood sugar monitoring, dietary changes, exercise.
Use simple language, avoid jargon, 1-page limit."

Risk mitigation: - Fact-check all medical information - Customize to individual patient (LLM generates generic content) - Consider health literacy, cultural factors

4. Differential Diagnosis Brainstorming

Use cases: - Generate possibilities for complex cases - Identify rare diagnoses to consider - Broaden differential when stuck

Workflow: 1. Provide detailed clinical vignette 2. Request differential with reasoning 3. Treat as idea generation, not diagnosis 4. Pursue appropriate diagnostic workup based on clinical judgment

Example prompt:

"Generate differential diagnosis for 45-year-old woman
with 3 months of progressive dyspnea, dry cough, and
fatigue. Exam: fine bibasilar crackles, no wheezing.
CXR: reticular infiltrates. Consider both common and
rare etiologies. Provide likelihood and key diagnostic
tests for each."

Risk mitigation: - LLM differential is brainstorming, not diagnosis - Verify each possibility clinically plausible for patient - Pursue workup based on pretest probability, not LLM ranking

5. Medical Coding Assistance

Use cases: - Suggest ICD-10/CPT codes from clinical notes - Identify documentation gaps for proper coding - Check code appropriateness

Workflow: 1. LLM analyzes clinical note 2. Suggests codes with reasoning 3. Coding specialist or physician reviews 4. Confirms codes match care delivered and documentation

Risk mitigation: - Compliance review essential (fraudulent coding = federal offense) - Physician confirms codes represent actual care - Regular audits of LLM-suggested codes

DANGEROUS Uses (Do NOT Do)

1. Autonomous Patient Advice

Why dangerous: - Patients ask LLMs medical questions without physician involvement - LLMs provide confident answers regardless of accuracy - Patients may delay appropriate care based on false reassurance

Documented harms: - Patient with chest pain asked ChatGPT “Is this heartburn or heart attack?” - ChatGPT suggested antacids (without seeing patient, knowing history, performing exam) - Patient delayed ER visit 6 hours, presented with STEMI

The lesson: Patients will use LLMs for medical advice regardless of physician recommendations. Educate patients about limitations, encourage them to contact you rather than rely on AI.

2. Medication Dosing Without Verification

Why dangerous: - LLMs fabricate plausible but incorrect dosages - Pediatric dosing especially error-prone - Drug interaction checking unreliable

Documented near-miss: - Physician asked GPT-4 for vancomycin dosing in renal failure - LLM suggested dose appropriate for normal renal function - Pharmacist caught error before administration

The lesson: Never use LLM-generated medication dosing without verification against pharmacy databases, dose calculators, or pharmacist consultation.

3. Urgent or Emergent Clinical Decisions

Why dangerous: - Time pressure precludes adequate verification - High stakes magnify consequence of errors - Clinical judgment + experience > LLM statistical patterns

The lesson: In emergencies, rely on clinical protocols, expert consultation, established guidelines, not LLM brainstorming.

4. Generating Citations Without Verification

Why dangerous: - LLMs fabricate 15-30% of medical citations - Using fake references = academic dishonesty, research misconduct - Propagates misinformation if not caught

The lesson: Never include LLM-generated citations in manuscripts, grants, presentations without verifying papers exist and support the claims.


Part 6: Vendor Evaluation Framework

Before Adopting an LLM Tool for Clinical Practice

Questions to ask vendors:

  1. “Is this system HIPAA-compliant? Can you provide a Business Associate Agreement?”
    • Essential for any system touching patient data
    • No BAA = no patient data entry
  2. “What is the LLM training data cutoff date?”
    • Cutoff dates vary by model and version (check vendor documentation)
    • Older cutoff = more outdated medical knowledge
    • Models with web search can access current information but still require verification
  3. “What peer-reviewed validation studies support clinical use?”
    • Demand JAMA, NEJM, Nature Medicine publications
    • User satisfaction ≠ clinical validation
    • Ask for prospective studies, not just retrospective benchmarks
  4. “What is the hallucination rate for medical content?”
    • If vendor can’t quantify, they haven’t tested rigorously
    • Accept that hallucinations are unavoidable; question is frequency
  5. “How does the system handle uncertainty?”
    • Good LLMs express appropriate uncertainty (“I’m not certain, but…”)
    • Bad LLMs confidently hallucinate when uncertain
  6. “What verification/oversight mechanisms are built into the workflow?”
    • Best systems require physician review before acting on LLM output
    • Dangerous systems allow autonomous LLM actions
  7. “How does this integrate with our EHR?”
    • Practical integration essential for adoption
    • Clunky workarounds fail
  8. “What is the cost structure and ROI evidence?”
    • Subscription per physician? API usage fees?
    • Request time-savings data, physician satisfaction metrics
  9. “What testing validates consistency of outputs across multiple runs?”
    • Ask for reproducibility data: same input, how often does output differ?
    • Critical for clinical decisions where consistency matters (dosing, treatment recommendations)
    • If vendor hasn’t tested, they haven’t validated for clinical use
  10. “Does your malpractice insurance explicitly cover LLM use?”
    • Many policies exclude AI-related claims or require explicit rider
    • Ask insurer directly, don’t rely on vendor assurances
    • Request coverage confirmation in writing before deployment
  11. “Who is liable if LLM output causes patient harm?”
    • Most vendors disclaim liability in contracts
    • Physician/institution bears risk
  12. “What data is retained, and can patients opt out?”
    • Data retention policies
    • Patient consent/opt-out mechanisms

Red Flags (Walk Away If You See These)

  1. No HIPAA compliance for clinical use (public ChatGPT marketed for medical decisions)
  2. Claims of “replacing physician judgment” (LLMs assist, don’t replace)
  3. No prospective clinical validation (only bench mark exam scores)
  4. Autonomous actions without physician review (medication ordering, diagnosis without oversight)
  5. Vendor refuses to discuss hallucination rates (hasn’t tested or hiding poor performance)

Part 7: Cost-Benefit Reality

What Does LLM Technology Cost?

Ambient documentation (Nuance DAX, Abridge): - Cost: ~$369-600/month per physician (varies by contract and volume) - Benefit: 1 hour/day time savings × $200/hour = $4,000/month - ROI: Positive in 1-3 months - Non-monetary benefit: Reduced burnout, improved work-life balance

GPT-4 API (HIPAA-compliant): - Cost: ~$0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens - Typical clinical note: 500 tokens input, 1,000 output = $0.075 per note - If 20 notes/day: $1.50/day = $30/month (cheaper than subscription) - But: Requires technical integration, institutional IT support

Glass Health (LLM clinical decision support): - Cost: Free tier available, paid tiers ~$100-300/month - Benefit: Differential diagnosis brainstorming, treatment suggestions - ROI: Unclear; depends on how often you use for complex cases

Epic LLM integration (message drafting, note summarization): - Cost: Bundled into EHR licensing for institutions - Benefit: Incremental time savings across multiple workflows

Do These Tools Save Money?

Ambient documentation: YES - 50% time savings is substantial - Reduced after-hours charting improves physician well-being - Cost-effective based on time saved - Caveat: Requires subscription commitment; per-physician cost limits small practice adoption

API-based documentation assistance: MAYBE - Much cheaper than subscriptions (~$30/month vs. $400-600/month) - But requires IT infrastructure, integration effort - ROI depends on institutional technical capacity

Literature summarization: UNCLEAR - Time savings real (10 min to read guideline vs. 2 min to review LLM summary) - But risk of hallucinations means verification still required - Net time savings modest

Patient education generation: PROBABLY - Faster than writing from scratch - But requires physician review - Best for high-volume needs (discharge instructions, common diagnoses)


Part 8: The Future of Medical LLMs

What’s Coming in the Next 3-5 Years

Likely developments:

  1. EHR-integrated LLMs become standard
    • Epic, Cerner, Oracle already deploying
    • Message drafting, note summarization, coding assistance
    • HIPAA-compliant by design
  2. Multimodal medical LLMs
    • Text + images + lab data + genomics
    • “Show me this rash” + clinical history → differential diagnosis
    • Radiology report + imaging → integrated assessment
  3. Reduced hallucinations
    • Retrieval-augmented generation (LLM + medical database lookup)
    • Better uncertainty quantification
    • Improved factuality through constrained generation
  4. Prospective clinical validation
    • RCTs showing improved outcomes (not just time savings)
    • Cost-effectiveness analyses
    • Comparative studies (LLM-assisted vs. standard care)
  5. Regulatory clarity
    • FDA guidance on LLM medical devices
    • State medical board policies on LLM use
    • Malpractice liability precedents

Unlikely (despite hype):

  1. Fully autonomous diagnosis/treatment
    • Too high-stakes for pure LLM decision-making
    • Human oversight will remain essential
  2. Complete elimination of hallucinations
    • Fundamental to how LLMs work
    • Mitigation, not elimination, is realistic goal
  3. Replacement of physician-patient relationship
    • LLMs assist communication, don’t replace human connection
    • Empathy, trust, shared decision-making remain human domains

Part 9: Implementation Guide

Safe LLM Implementation Checklist

Pre-Implementation: - ☐ Institutional approval obtained (IT, privacy, compliance, legal) - ☐ HIPAA compliance verified (BAA in place) - ☐ Malpractice insurance notified - ☐ Clear use case defined (documentation, education, brainstorming) - ☐ Physician training completed (capabilities, limitations, verification) - ☐ Patient consent process established (if applicable)

During Use: - ☐ Never enter patient identifiers into public LLMs - ☐ Always verify medical facts against authoritative sources - ☐ Treat as draft/assistant, never autonomous decision-maker - ☐ Document verification steps in clinical notes - ☐ Maintain physician oversight for all decisions

Post-Implementation: - ☐ Monitor for errors, near-misses, hallucinations - ☐ Track time savings, physician satisfaction - ☐ Review random sample of LLM-assisted notes for quality - ☐ Update policies based on experience - ☐ Stay current with evolving regulations


Key Takeaways

10 Principles for LLM Use in Medicine

  1. LLMs are assistants, not doctors: Always maintain human oversight and final decision-making

  2. Hallucinations are unavoidable: Verify all medical facts, never trust blindly

  3. HIPAA compliance is non-negotiable: Public ChatGPT is NOT appropriate for patient data

  4. Appropriate uses: Documentation drafts, literature review, education materials (with review)

  5. Inappropriate uses: Autonomous diagnosis/treatment, medication dosing without verification, urgent decisions

  6. Physician remains legally responsible: “AI told me to” is not a malpractice defense

  7. Evidence is evolving: USMLE performance ≠ clinical utility; demand prospective RCTs

  8. Ambient documentation shows clearest benefit: 50% time savings with high satisfaction

  9. Prompting quality matters: Specific, detailed prompts with sourcing requests yield better outputs

  10. The future is collaborative: Effective physician-LLM partnership, not replacement


Clinical Scenario: LLM Vendor Evaluation

Scenario: Your Hospital Is Considering Glass Health for Clinical Decision Support

The pitch: Glass Health provides LLM-powered differential diagnosis and treatment suggestions. Marketing claims: - “Physician-level diagnostic accuracy” - “Evidence-based treatment recommendations” - “Saves 20 minutes per complex case” - Cost: $200/month per physician

The CMO asks for your recommendation.

Questions to ask:

  1. “What peer-reviewed validation studies support Glass Health?”
    • Request JAMA, Annals, specialty journal publications
    • User testimonials ≠ clinical validation
  2. “Is this HIPAA-compliant? Where is the BAA?”
    • Essential for entering patient data
  3. “What is the hallucination rate?”
    • If vendor hasn’t quantified, they haven’t tested properly
  4. “How does Glass Health handle diagnostic uncertainty?”
    • Does it express appropriate uncertainty or confidently hallucinate?
  5. “What workflow oversight prevents acting on incorrect recommendations?”
    • Best systems require physician review before actions
  6. “Can we pilot with 10 physicians before hospital-wide deployment?”
    • Local validation essential
  7. “What happens if Glass Health recommendation causes harm?”
    • Read liability disclaimers in contract
  8. “What is actual time savings data?”
    • “20 minutes per complex case” claim: where’s the evidence?

Red Flags:

“Physician-level accuracy” without prospective validation

No discussion of hallucination rates or error modes

Marketing emphasizes speed over safety

No built-in verification mechanisms


Check Your Understanding

Scenario 1: The Medication Dosing Question

Clinical situation: You’re seeing a 4-year-old with otitis media requiring amoxicillin. You ask GPT-4 (via HIPAA-compliant API):

“What is the appropriate amoxicillin dosing for a 4-year-old child with acute otitis media?”

GPT-4 responds: “For acute otitis media in a 4-year-old, amoxicillin dosing is 40-50 mg/kg/day divided into two doses (every 12 hours). For a 15 kg child, this would be 300-375 mg twice daily.”

Question 1: Do you prescribe based on this recommendation?

Click to reveal answer

Answer: No, verify against authoritative source first.

Why:

The LLM response is partially correct but incomplete: - Standard-dose amoxicillin: 40-50 mg/kg/day divided BID (LLM correct) - But: AAP now recommends high-dose amoxicillin (80-90 mg/kg/day divided BID) for most cases of AOM due to increasing S. pneumoniae resistance - LLM likely trained on older guidelines pre-dating high-dose recommendations

Correct dosing (per 2023 AAP guidelines): - High-dose: 80-90 mg/kg/day divided BID (first-line for most cases) - For 15 kg child: 600-675 mg BID - Standard-dose: 40-50 mg/kg/day only for select cases (penicillin allergy evaluation, mild infection in low-resistance areas)

What you should do: 1. Check UpToDate, Lexicomp, or AAP guidelines directly 2. Confirm high-dose amoxicillin is indicated 3. Prescribe 600-650 mg BID (not the LLM-suggested 300-375 mg)

The lesson: LLMs may provide outdated recommendations or miss recent guideline updates. Always verify medication dosing against current pharmacy databases or guidelines.

If you had prescribed LLM dose: - Child receives 50% of intended amoxicillin - Higher risk of treatment failure - Potential for antibiotic resistance development


Scenario 2: The Patient Education Handout

Clinical situation: You’re discharging a patient newly diagnosed with type 2 diabetes. You use GPT-4 to generate patient education handout:

“Create a one-page patient handout for newly diagnosed type 2 diabetes, 8th-grade reading level. Cover: medications, blood sugar monitoring, diet, exercise.”

GPT-4 generates professional-looking handout with sections on metformin, glucometer use, carb counting, and walking recommendations.

Question 2: Can you give this handout to the patient as-is, or do you need to review/edit first?

Click to reveal answer

Answer: MUST review and edit before giving to patient.

Why:

Potential LLM errors to check:

  1. Medication information:
    • Is metformin dosing correct? (LLMs sometimes hallucinate dosages)
    • Are side effects accurately described?
    • Are contraindications mentioned? (metformin contraindicated in advanced CKD)
  2. Blood sugar targets:
    • ADA guidelines: Fasting <130 mg/dL, postprandial <180 mg/dL for most
    • LLM may use outdated targets or not individualize
  3. Dietary advice:
    • Is carb counting explained clearly?
    • Any culturally inappropriate diet suggestions?
    • Consistent with current ADA nutrition guidelines?
  4. Exercise recommendations:
    • 150 min/week moderate exercise is standard
    • Are contraindications mentioned? (avoid exercise if glucose >250 mg/dL + ketones)
  5. Individualization:
    • LLM generates generic handout
    • Does it fit THIS patient’s health literacy, culture, comorbidities?

What you should do: 1. Read entire handout for medical accuracy 2. Verify all medical facts (especially medication dosing, blood sugar targets) 3. Customize to patient (add their specific meds if not metformin, adjust for comorbidities) 4. Check cultural appropriateness 5. Review with patient, answer questions

The lesson: LLM-generated patient materials save time but require physician review for accuracy and individualization. Never hand out LLM content without verification.

Time savings: - Writing from scratch: 15-20 minutes - LLM draft + physician review/editing: 5-7 minutes - Net benefit: 10+ minutes saved while maintaining quality


Scenario 3: The Citation Fabrication

Clinical situation: You’re writing a quality improvement manuscript on reducing hospital readmissions. You ask GPT-4:

“Cite 3 recent studies (2020-2023) showing effectiveness of AI for predicting 30-day readmissions in heart failure patients.”

GPT-4 provides:

  1. “Chen et al. (2022). ‘Machine learning for heart failure readmission prediction.’ JAMA Cardiology 7:445-453.”
  2. “Rodriguez-Lopez et al. (2021). ‘Deep learning models outperform LACE index for readmission risk.’ Circulation 144:892-901.”
  3. “Patel et al. (2023). ‘Prospective validation of AI readmission tool in community hospitals.’ New England Journal of Medicine 388:1122-1131.”

Question 3: Can you include these citations in your manuscript?

Click to reveal answer

Answer: NO. You must verify each citation exists and actually supports your claim.

Why:

The LLM likely fabricated some or all of these citations. Here’s how to check:

Step 1: Search PubMed for each citation

For “Chen et al. (2022) JAMA Cardiology”: - Search: "Chen" AND "heart failure readmission" AND "machine learning" AND "JAMA Cardiology" AND 2022 - If found: Read abstract, confirm it supports your claim - If NOT found: Citation is fake

Step 2: Verify journal, volume, pages

Even if an author “Chen” published in JAMA Cardiology in 2022, check: - Is the title correct? - Is the volume/page number correct? - Does the paper actually discuss AI for HF readmissions?

Step 3: Read the actual papers

If citations exist: - Do they support the claim you’re making? - Are study methods sound? - Are conclusions being accurately represented?

Likely outcome: - 1-2 of these 3 citations are completely fabricated - Even if a paper exists, it may not say what you think (or LLM suggests)

What you should do instead:

  1. Search PubMed yourself: ("heart failure" OR "HF") AND ("readmission" OR "rehospitalization") AND ("machine learning" OR "artificial intelligence" OR "AI") AND ("prediction" OR "risk score")

  2. Filter: Publication date 2020-2023, Clinical Trial or Review

  3. Read abstracts, select relevant papers

  4. Cite actual papers you’ve read

The lesson: Never trust LLM-generated citations. LLMs fabricate references 15-30% of the time. Always verify papers exist and support your claims.

Consequences of using fabricated citations: - Manuscript rejection - If published then discovered: Retraction - Academic dishonesty allegations - Career damage

Time comparison: - LLM citations (unverified): 30 seconds - Manual PubMed search + reading abstracts: 15-20 minutes - Worth the extra time to avoid fabricated references


References