25 Large Language Models in Clinical Practice

Learning Objectives

Large Language Models (LLMs) like ChatGPT, GPT-4, and Med-PaLM represent a new paradigm in medical AI. This chapter provides evidence-based guidance for safe, effective clinical use. You will learn to:

Understand how LLMs work and their fundamental capabilities/limitations
Identify appropriate vs. inappropriate clinical use cases
Recognize and mitigate hallucinations and errors
Navigate privacy, liability, and ethical considerations
Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications)
Implement LLMs safely in clinical workflows
Communicate with patients about LLM-assisted care

Essential for all physicians considering LLM use in practice.

📋 Chapter Summary (TL;DR)

The Clinical Context: Large Language Models (ChatGPT, GPT-4, Med-PaLM) have exploded into medical practice since late 2022. Unlike narrow diagnostic AI, LLMs are general-purpose systems that can write notes, answer questions, summarize literature, draft patient education, and assist clinical reasoning. They’re powerful but dangerous if used incorrectly—capable of generating confident, plausible, but completely false medical information (“hallucinations”).

What Are LLMs?

Technical: Neural networks trained on massive text corpora (internet, books, journals) to predict next words. Learn statistical patterns in language, not truth or medical knowledge per se.

Clinical analogy: Like a medical student who has read everything but has: - No clinical experience - No ability to examine patients - No access to patient-specific data - No accountability for errors - Perfect memory but imperfect understanding

Key LLMs in Medicine (2024-2025):

GPT-4 (OpenAI): - General-purpose LLM, not medical-specific - Strong performance on USMLE-style questions (86%+ passing) - Widely accessible (ChatGPT Plus, API) - No HIPAA compliance for public ChatGPT

Med-PaLM 2 (Google): - Medical-specific LLM - 86.5% on MedQA (medical licensing questions) - Better medical accuracy than GPT-4 on some benchmarks (Singhal et al. 2023) - Not publicly available (research/enterprise only)

Claude (Anthropic): - General LLM with strong reasoning - Constitutional AI (safety-focused training) - Healthcare-specific enterprise offerings

Commercial Medical LLMs: - Glass Health (clinical decision support) - Nabla Copilot (medical documentation) - Various EHR vendors integrating LLMs

What LLMs Can Do Well:

✅ Literature synthesis: Summarizing research papers, guidelines ✅ Patient education materials: Generating health literacy-appropriate explanations ✅ Documentation assistance: Drafting clinical notes (with physician review) ✅ Differential diagnosis brainstorming: Generating possibilities for complex cases ✅ Medical coding: Suggesting ICD-10/CPT codes from clinical notes ✅ Language translation: Medical terminology across languages ✅ Teaching/explaining concepts: Medical education, CME content

What LLMs CANNOT Do (Critical Limitations):

❌ Access real-time patient data: No connection to EHR, labs, imaging ❌ Examine patients: No physical exam, vital signs, clinical gestalt ❌ Provide reliable citations: Often fabricates references ❌ Avoid hallucinations: Confidently generates plausible but false information ❌ Update in real-time: Training data has cutoff date (knowledge gaps for new info) ❌ Take responsibility: No medical license, no liability, no accountability ❌ Replace physician judgment: Context, nuance, patient preferences require human input

The Hallucination Problem (CRITICAL):

Definition: LLMs generate confident, coherent, plausible but factually incorrect text

Medical examples: - Fabricating drug dosages that look correct but are wrong - Inventing medical “facts” that sound authoritative - Creating fake citations to real journals (title, authors look real, paper doesn’t exist) - Contradicting itself between responses - Providing outdated treatment recommendations

Why hallucinations occur: LLMs predict plausible next words, not truth. No internal fact-checking, no database lookup, no uncertainty quantification.

Clinical danger: Physician trusts LLM output without verification → patient harm

Mitigation strategies: - Always verify medical facts against authoritative sources - Cross-check drug information with pharmacy databases - Validate citations (LLMs commonly fabricate references) - Use LLMs for drafts/ideas, never final medical decisions - Higher stakes = more verification required

Appropriate Clinical Use Cases:

✅ SAFE - With Physician Oversight:

1. Documentation Assistance: - Draft progress notes from physician dictation - Generate discharge summaries (physician reviews/edits) - Suggest ICD-10/CPT codes - Workflow: LLM drafts → Physician reviews, edits, verifies → Signs note - Risk mitigation: Physician remains responsible, reviews every detail

2. Literature Synthesis: - Summarize recent guidelines - Compare treatment options from multiple sources - Generate literature review drafts - Risk mitigation: Verify citations, cross-check facts, use as starting point

3. Patient Education: - Draft explanations of diagnoses, procedures - Create health literacy-appropriate materials - Translate medical jargon to plain language - Risk mitigation: Physician reviews for accuracy before sharing with patients

4. Clinical Reasoning Support: - Generate differential diagnoses for complex cases - Suggest diagnostic workup considerations - Identify potential drug interactions - Risk mitigation: Treat as brainstorming tool, verify all suggestions, physician makes final decisions

5. Medical Coding Assistance: - Suggest appropriate codes from clinical notes - Identify documentation gaps for coding - Risk mitigation: Compliance review, physician confirms codes match care delivered

6. Administrative Tasks: - Draft prior authorization letters - Generate referral summaries - Create patient handouts - Risk mitigation: Review for accuracy and completeness

❌ DANGEROUS - Do NOT Do:

1. Autonomous patient advice: - Patients asking LLMs medical questions without physician involvement - Risk: Hallucinations, outdated information, missing context

2. Medication dosing without verification: - LLMs can generate plausible but incorrect dosages - Risk: Overdose, underdose, contraindications missed

3. Urgent/emergent decisions: - Time-sensitive clinical decisions without verification - Risk: Delays or errors in critical care

4. Replacing specialist consultation: - Complex cases requiring expert judgment - Risk: Missing nuances, specialized knowledge

5. Generating citations without checking: - LLMs fabricate plausible-looking references - Risk: Academic dishonesty, spreading misinformation

6. Diagnosis without examination: - LLMs lack patient-specific data, physical exam - Risk: Misdiagnosis, missed critical findings

Privacy and HIPAA Considerations:

⚠️ CRITICAL: Public ChatGPT is NOT HIPAA-compliant

Public LLMs (ChatGPT, Claude, etc.): - Data may be stored, used for training - No Business Associate Agreement (BAA) - NEVER enter patient identifiers (names, MRNs, DOB, SSN) - De-identification required but risky (re-identification possible with detailed cases)

HIPAA-compliant alternatives: - GPT-4 API with BAA (enterprise agreements) - Azure OpenAI Service (healthcare tier) - Google Cloud Vertex AI with BAA - Vendor-specific medical LLMs with BAA

Safe practices: - Use only HIPAA-compliant systems for patient data - De-identify cases thoroughly - Institutional approval required - Document consent where appropriate

Evidence Base for Medical LLMs:

Performance on Medical Licensing Exams:

GPT-4: - USMLE Step 1: 86%+ (passing ~60%) - USMLE Step 2 CK: 86%+ - USMLE Step 3: 86%+ - Caveat: Multiple choice ≠ clinical practice

Med-PaLM 2: - MedQA (USMLE-style): 86.5% - Outperforms physicians on some benchmarks - Better calibration (knows when uncertain) than GPT-4 (Singhal et al. 2023)

Clinical Reasoning Tasks:

Mixed results: - Good at pattern matching, recall - Struggles with complex multi-step reasoning - Lacks clinical judgment, gestalt - Overconfident (doesn’t express uncertainty well)

Prospective Clinical Validation:

Limited data: - Most studies: retrospective chart review, simulated cases - Few prospective real-world clinical deployments - No RCTs showing improved patient outcomes - Evidence gap: Performance on exams ≠ clinical utility

Documentation Assistance:

Promising early evidence: - Physician satisfaction high - Time savings 30-50% - Quality concerns remain (accuracy, completeness) - Ongoing studies

Medical-Specific LLM Enhancements:

Med-PaLM (Google): - Fine-tuned on medical text - Better medical terminology understanding - Improved accuracy on medical questions - Status: Research/enterprise only

Clinical BERT/BioBERT: - Specialized for biomedical text understanding - Used for information extraction from notes - Not general conversational AI

Vendor Implementations:

Glass Health: - LLM-powered clinical decision support - Generates differential diagnosis, treatment plans - Physician review required - Evidence: User satisfaction, limited clinical validation

Nabla Copilot: - Medical documentation assistant - Ambient listening + LLM note generation - Evidence: Time savings, user satisfaction

Epic integrating LLMs: - Message drafting, note summarization - Rolling out to health systems - Evidence: Early deployment, validation ongoing

Prompt Engineering for Medical Use:

Effective prompting improves output quality:

✅ Good prompts: - Specific, detailed clinical scenarios - Request sourcing (“cite guidelines”) - Ask for differential diagnosis, not definitive diagnosis - Request uncertainty (“what are you uncertain about?”)

Example:

"Generate a differential diagnosis for a 45-year-old man
with acute chest pain, considering both cardiac and
non-cardiac causes. Include likelihood and key
differentiating features for each."

❌ Poor prompts: - Vague (“tell me about chest pain”) - Requesting definitive diagnosis without full info - No request for reasoning or uncertainty - Treating LLM as oracle rather than assistant

Iterative refinement: - Follow-up questions clarify, narrow focus - Request explanations for suggestions - Ask LLM to critique its own reasoning

Limitations and Failure Modes:

1. Knowledge Cutoff: - Training data ends at specific date - New drugs, guidelines, treatments unknown - Example: LLM unaware of 2024 guidelines published after its training

2. Reasoning Failures: - Appears logical but conclusions wrong - Misapplies guidelines to specific cases - Confuses similar conditions

3. Statistical Bias: - Reflects biases in training data - May perpetuate healthcare disparities - Underrepresentation of rare diseases, diverse populations

4. Context Window Limits: - Can only “remember” recent conversation - Loses context in long exchanges - May contradict earlier statements

5. Inability to Say “I Don’t Know”: - Tends to generate plausible answer even when uncertain - Rarely expresses appropriate uncertainty

Medical Liability Considerations:

Current Legal Landscape (evolving):

Physician remains responsible: - LLM is tool, not practitioner - Physician liable for all clinical decisions - “AI told me to” not a defense

Standard of care questions: - Is physician negligent for NOT using LLM if available? - Is physician negligent for USING LLM incorrectly? - Currently unclear, state-by-state variation

Documentation requirements: - Document LLM use where material to decisions - Document verification of LLM outputs - Informed consent for LLM-assisted care (emerging practice)

Malpractice insurance: - Check coverage for AI-assisted care - Some policies may exclude or limit - Notify insurer of AI tool use

Risk mitigation strategies: - Use only validated, HIPAA-compliant systems - Always verify LLM outputs - Maintain human oversight and final decision-making - Document verification steps - Obtain appropriate consents - Stay informed on evolving regulations

Ethical Considerations:

Transparency: - Should patients be told when LLMs assisted care? - Emerging consensus: Yes, transparency builds trust - Analogy: Disclosing use of other assistive technologies

Equity: - LLM performance may vary by demographics - Training data biases → biased outputs - Access disparities (who can afford LLM tools?)

Autonomy: - Patients should have option to decline LLM-assisted care - Respect patient preferences

Quality: - Benefit (efficiency) vs. risk (errors) - When do benefits outweigh risks? - Continuous monitoring essential

Professional integrity: - Is LLM use consistent with professionalism? - Does it enhance or diminish physician-patient relationship?

Practical Implementation Guide:

Safe LLM Implementation Checklist

Pre-Implementation: - ✅ Institutional approval/policy review - ✅ HIPAA compliance verification - ✅ Malpractice insurance notification - ✅ Privacy officer consultation - ✅ Clear use case definition (documentation, education, etc.)

During Use: - ✅ Never enter patient identifiers into public LLMs - ✅ Always verify medical facts against authoritative sources - ✅ Treat as draft/assistant, never autonomous decision-maker - ✅ Document verification steps - ✅ Maintain physician oversight

Post-Implementation: - ✅ Monitor for errors, near-misses - ✅ Collect user feedback - ✅ Track outcomes (time savings, error rates, patient satisfaction) - ✅ Update policies based on experience - ✅ Stay current with evolving regulations

Teaching and Learning:

LLMs as educational tools:

✅ Appropriate uses: - Explaining complex concepts (physiology, pharmacology) - Generating practice questions - Summarizing research for journal clubs - Creating teaching cases (with fact-checking) - Language learning (medical terminology)

⚠️ Cautions: - Students may rely on LLMs instead of learning - Hallucinations can teach incorrect information - No substitute for clinical experience - Reinforces importance of verification

Training medical students/residents on LLM use: - When appropriate to use vs. avoid - How to prompt effectively - Recognizing hallucinations - Verification strategies - Ethical considerations

The Future of Medical LLMs:

Near-term (1-3 years): - EHR integration becomes standard - More medical-specific LLMs with better accuracy - Prospective validation studies - Regulatory frameworks clarify - Widespread documentation assistance

Medium-term (3-7 years): - Multimodal LLMs (text + images + genomics + EHR data) - Real-time clinical decision support - Personalized patient education at scale - Reduced hallucinations through better training - Better uncertainty quantification

Long-term (7+ years): - AI medical reasoning approaching expert level (with caveats) - Continuous learning from clinical practice - Seamless physician-AI collaboration - BUT: Human oversight likely always required for high-stakes decisions

Comparison: General vs. Medical-Specific LLMs:

Feature	GPT-4 (General)	Med-PaLM 2 (Medical)
Medical accuracy	Good	Better
Availability	Public API	Enterprise only
HIPAA options	Azure/API with BAA	Google Cloud with BAA
Cost	API fees	Enterprise licensing
Medical terminology	Good	Excellent
Citation quality	Poor (fabricates)	Poor (fabricates)
Hallucinations	Frequent	Somewhat reduced
Uncertainty expression	Poor	Better calibrated

The Clinical Bottom Line:

Key Takeaways for Physicians

LLMs are powerful assistants, not autonomous doctors: Always maintain human oversight
Hallucinations are the critical danger: Verify all medical facts, never trust blindly
HIPAA compliance essential: Public ChatGPT is NOT appropriate for patient data
Appropriate uses: Documentation drafts, literature synthesis, education materials (with review)
Inappropriate uses: Autonomous diagnosis/treatment, urgent decisions, generating citations without checking
Physician remains legally responsible: “AI told me to” is not a defense
Transparency matters: Consider informing patients when LLM assisted care
Evidence is evolving: Exam performance ≠ clinical utility; demand prospective validation
Privacy first: De-identify or use HIPAA-compliant systems only
The future is collaborative: Effective physician-LLM partnership, not replacement
Start small, learn, monitor: Pilot low-risk applications, collect data, expand cautiously
Stay informed: Field evolving rapidly, regulations emerging, best practices developing

Hands-On: Trying LLMs Safely:

Low-risk experimentation: 1. Use for literature summary (public papers, no patient data) 2. Draft patient education materials (verify accuracy before sharing) 3. Brainstorm differential diagnoses for teaching cases (fictional patients) 4. Generate medical documentation templates (review thoroughly)

Tools to try: - ChatGPT (free tier) for general experimentation (NO PATIENT DATA) - Claude (Anthropic) for reasoning tasks - Perplexity AI (includes citations, though still verify) - Glass Health (medical-specific, free tier)

Learning resources: - OpenAI documentation on medical use cases - AMIA resources on AI in medicine - Institutional policies at your hospital - Medical informatics literature

Next Chapter: We’ll examine how to rigorously evaluate any AI system before clinical deployment—essential skills for the AI-augmented physician.

25.1 References