25 Large Language Models in Clinical Practice
Large Language Models (LLMs) like ChatGPT, GPT-4, and Med-PaLM represent a new paradigm in medical AI. This chapter provides evidence-based guidance for safe, effective clinical use. You will learn to:
- Understand how LLMs work and their fundamental capabilities/limitations
 - Identify appropriate vs. inappropriate clinical use cases
 - Recognize and mitigate hallucinations and errors
 - Navigate privacy, liability, and ethical considerations
 - Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications)
 - Implement LLMs safely in clinical workflows
 - Communicate with patients about LLM-assisted care
 
Essential for all physicians considering LLM use in practice.
The Clinical Context: Large Language Models (ChatGPT, GPT-4, Med-PaLM) have exploded into medical practice since late 2022. Unlike narrow diagnostic AI, LLMs are general-purpose systems that can write notes, answer questions, summarize literature, draft patient education, and assist clinical reasoning. They’re powerful but dangerous if used incorrectly—capable of generating confident, plausible, but completely false medical information (“hallucinations”).
What Are LLMs?
Technical: Neural networks trained on massive text corpora (internet, books, journals) to predict next words. Learn statistical patterns in language, not truth or medical knowledge per se.
Clinical analogy: Like a medical student who has read everything but has: - No clinical experience - No ability to examine patients - No access to patient-specific data - No accountability for errors - Perfect memory but imperfect understanding
Key LLMs in Medicine (2024-2025):
GPT-4 (OpenAI): - General-purpose LLM, not medical-specific - Strong performance on USMLE-style questions (86%+ passing) - Widely accessible (ChatGPT Plus, API) - No HIPAA compliance for public ChatGPT
Med-PaLM 2 (Google): - Medical-specific LLM - 86.5% on MedQA (medical licensing questions) - Better medical accuracy than GPT-4 on some benchmarks (Singhal et al. 2023) - Not publicly available (research/enterprise only)
Claude (Anthropic): - General LLM with strong reasoning - Constitutional AI (safety-focused training) - Healthcare-specific enterprise offerings
Commercial Medical LLMs: - Glass Health (clinical decision support) - Nabla Copilot (medical documentation) - Various EHR vendors integrating LLMs
What LLMs Can Do Well:
✅ Literature synthesis: Summarizing research papers, guidelines ✅ Patient education materials: Generating health literacy-appropriate explanations ✅ Documentation assistance: Drafting clinical notes (with physician review) ✅ Differential diagnosis brainstorming: Generating possibilities for complex cases ✅ Medical coding: Suggesting ICD-10/CPT codes from clinical notes ✅ Language translation: Medical terminology across languages ✅ Teaching/explaining concepts: Medical education, CME content
What LLMs CANNOT Do (Critical Limitations):
❌ Access real-time patient data: No connection to EHR, labs, imaging ❌ Examine patients: No physical exam, vital signs, clinical gestalt ❌ Provide reliable citations: Often fabricates references ❌ Avoid hallucinations: Confidently generates plausible but false information ❌ Update in real-time: Training data has cutoff date (knowledge gaps for new info) ❌ Take responsibility: No medical license, no liability, no accountability ❌ Replace physician judgment: Context, nuance, patient preferences require human input
The Hallucination Problem (CRITICAL):
Definition: LLMs generate confident, coherent, plausible but factually incorrect text
Medical examples: - Fabricating drug dosages that look correct but are wrong - Inventing medical “facts” that sound authoritative - Creating fake citations to real journals (title, authors look real, paper doesn’t exist) - Contradicting itself between responses - Providing outdated treatment recommendations
Why hallucinations occur: LLMs predict plausible next words, not truth. No internal fact-checking, no database lookup, no uncertainty quantification.
Clinical danger: Physician trusts LLM output without verification → patient harm
Mitigation strategies: - Always verify medical facts against authoritative sources - Cross-check drug information with pharmacy databases - Validate citations (LLMs commonly fabricate references) - Use LLMs for drafts/ideas, never final medical decisions - Higher stakes = more verification required
Appropriate Clinical Use Cases:
✅ SAFE - With Physician Oversight:
1. Documentation Assistance: - Draft progress notes from physician dictation - Generate discharge summaries (physician reviews/edits) - Suggest ICD-10/CPT codes - Workflow: LLM drafts → Physician reviews, edits, verifies → Signs note - Risk mitigation: Physician remains responsible, reviews every detail
2. Literature Synthesis: - Summarize recent guidelines - Compare treatment options from multiple sources - Generate literature review drafts - Risk mitigation: Verify citations, cross-check facts, use as starting point
3. Patient Education: - Draft explanations of diagnoses, procedures - Create health literacy-appropriate materials - Translate medical jargon to plain language - Risk mitigation: Physician reviews for accuracy before sharing with patients
4. Clinical Reasoning Support: - Generate differential diagnoses for complex cases - Suggest diagnostic workup considerations - Identify potential drug interactions - Risk mitigation: Treat as brainstorming tool, verify all suggestions, physician makes final decisions
5. Medical Coding Assistance: - Suggest appropriate codes from clinical notes - Identify documentation gaps for coding - Risk mitigation: Compliance review, physician confirms codes match care delivered
6. Administrative Tasks: - Draft prior authorization letters - Generate referral summaries - Create patient handouts - Risk mitigation: Review for accuracy and completeness
❌ DANGEROUS - Do NOT Do:
1. Autonomous patient advice: - Patients asking LLMs medical questions without physician involvement - Risk: Hallucinations, outdated information, missing context
2. Medication dosing without verification: - LLMs can generate plausible but incorrect dosages - Risk: Overdose, underdose, contraindications missed
3. Urgent/emergent decisions: - Time-sensitive clinical decisions without verification - Risk: Delays or errors in critical care
4. Replacing specialist consultation: - Complex cases requiring expert judgment - Risk: Missing nuances, specialized knowledge
5. Generating citations without checking: - LLMs fabricate plausible-looking references - Risk: Academic dishonesty, spreading misinformation
6. Diagnosis without examination: - LLMs lack patient-specific data, physical exam - Risk: Misdiagnosis, missed critical findings
Privacy and HIPAA Considerations:
⚠️ CRITICAL: Public ChatGPT is NOT HIPAA-compliant
Public LLMs (ChatGPT, Claude, etc.): - Data may be stored, used for training - No Business Associate Agreement (BAA) - NEVER enter patient identifiers (names, MRNs, DOB, SSN) - De-identification required but risky (re-identification possible with detailed cases)
HIPAA-compliant alternatives: - GPT-4 API with BAA (enterprise agreements) - Azure OpenAI Service (healthcare tier) - Google Cloud Vertex AI with BAA - Vendor-specific medical LLMs with BAA
Safe practices: - Use only HIPAA-compliant systems for patient data - De-identify cases thoroughly - Institutional approval required - Document consent where appropriate
Evidence Base for Medical LLMs:
Performance on Medical Licensing Exams:
GPT-4: - USMLE Step 1: 86%+ (passing ~60%) - USMLE Step 2 CK: 86%+ - USMLE Step 3: 86%+ - Caveat: Multiple choice ≠ clinical practice
Med-PaLM 2: - MedQA (USMLE-style): 86.5% - Outperforms physicians on some benchmarks - Better calibration (knows when uncertain) than GPT-4 (Singhal et al. 2023)
Clinical Reasoning Tasks:
Mixed results: - Good at pattern matching, recall - Struggles with complex multi-step reasoning - Lacks clinical judgment, gestalt - Overconfident (doesn’t express uncertainty well)
Prospective Clinical Validation:
Limited data: - Most studies: retrospective chart review, simulated cases - Few prospective real-world clinical deployments - No RCTs showing improved patient outcomes - Evidence gap: Performance on exams ≠ clinical utility
Documentation Assistance:
Promising early evidence: - Physician satisfaction high - Time savings 30-50% - Quality concerns remain (accuracy, completeness) - Ongoing studies
Medical-Specific LLM Enhancements:
Med-PaLM (Google): - Fine-tuned on medical text - Better medical terminology understanding - Improved accuracy on medical questions - Status: Research/enterprise only
Clinical BERT/BioBERT: - Specialized for biomedical text understanding - Used for information extraction from notes - Not general conversational AI
Vendor Implementations:
Glass Health: - LLM-powered clinical decision support - Generates differential diagnosis, treatment plans - Physician review required - Evidence: User satisfaction, limited clinical validation
Nabla Copilot: - Medical documentation assistant - Ambient listening + LLM note generation - Evidence: Time savings, user satisfaction
Epic integrating LLMs: - Message drafting, note summarization - Rolling out to health systems - Evidence: Early deployment, validation ongoing
Prompt Engineering for Medical Use:
Effective prompting improves output quality:
✅ Good prompts: - Specific, detailed clinical scenarios - Request sourcing (“cite guidelines”) - Ask for differential diagnosis, not definitive diagnosis - Request uncertainty (“what are you uncertain about?”)
Example:
"Generate a differential diagnosis for a 45-year-old man
with acute chest pain, considering both cardiac and
non-cardiac causes. Include likelihood and key
differentiating features for each."
❌ Poor prompts: - Vague (“tell me about chest pain”) - Requesting definitive diagnosis without full info - No request for reasoning or uncertainty - Treating LLM as oracle rather than assistant
Iterative refinement: - Follow-up questions clarify, narrow focus - Request explanations for suggestions - Ask LLM to critique its own reasoning
Limitations and Failure Modes:
1. Knowledge Cutoff: - Training data ends at specific date - New drugs, guidelines, treatments unknown - Example: LLM unaware of 2024 guidelines published after its training
2. Reasoning Failures: - Appears logical but conclusions wrong - Misapplies guidelines to specific cases - Confuses similar conditions
3. Statistical Bias: - Reflects biases in training data - May perpetuate healthcare disparities - Underrepresentation of rare diseases, diverse populations
4. Context Window Limits: - Can only “remember” recent conversation - Loses context in long exchanges - May contradict earlier statements
5. Inability to Say “I Don’t Know”: - Tends to generate plausible answer even when uncertain - Rarely expresses appropriate uncertainty
Medical Liability Considerations:
Current Legal Landscape (evolving):
Physician remains responsible: - LLM is tool, not practitioner - Physician liable for all clinical decisions - “AI told me to” not a defense
Standard of care questions: - Is physician negligent for NOT using LLM if available? - Is physician negligent for USING LLM incorrectly? - Currently unclear, state-by-state variation
Documentation requirements: - Document LLM use where material to decisions - Document verification of LLM outputs - Informed consent for LLM-assisted care (emerging practice)
Malpractice insurance: - Check coverage for AI-assisted care - Some policies may exclude or limit - Notify insurer of AI tool use
Risk mitigation strategies: - Use only validated, HIPAA-compliant systems - Always verify LLM outputs - Maintain human oversight and final decision-making - Document verification steps - Obtain appropriate consents - Stay informed on evolving regulations
Ethical Considerations:
Transparency: - Should patients be told when LLMs assisted care? - Emerging consensus: Yes, transparency builds trust - Analogy: Disclosing use of other assistive technologies
Equity: - LLM performance may vary by demographics - Training data biases → biased outputs - Access disparities (who can afford LLM tools?)
Autonomy: - Patients should have option to decline LLM-assisted care - Respect patient preferences
Quality: - Benefit (efficiency) vs. risk (errors) - When do benefits outweigh risks? - Continuous monitoring essential
Professional integrity: - Is LLM use consistent with professionalism? - Does it enhance or diminish physician-patient relationship?
Practical Implementation Guide:
Pre-Implementation: - ✅ Institutional approval/policy review - ✅ HIPAA compliance verification - ✅ Malpractice insurance notification - ✅ Privacy officer consultation - ✅ Clear use case definition (documentation, education, etc.)
During Use: - ✅ Never enter patient identifiers into public LLMs - ✅ Always verify medical facts against authoritative sources - ✅ Treat as draft/assistant, never autonomous decision-maker - ✅ Document verification steps - ✅ Maintain physician oversight
Post-Implementation: - ✅ Monitor for errors, near-misses - ✅ Collect user feedback - ✅ Track outcomes (time savings, error rates, patient satisfaction) - ✅ Update policies based on experience - ✅ Stay current with evolving regulations
Teaching and Learning:
LLMs as educational tools:
✅ Appropriate uses: - Explaining complex concepts (physiology, pharmacology) - Generating practice questions - Summarizing research for journal clubs - Creating teaching cases (with fact-checking) - Language learning (medical terminology)
⚠️ Cautions: - Students may rely on LLMs instead of learning - Hallucinations can teach incorrect information - No substitute for clinical experience - Reinforces importance of verification
Training medical students/residents on LLM use: - When appropriate to use vs. avoid - How to prompt effectively - Recognizing hallucinations - Verification strategies - Ethical considerations
The Future of Medical LLMs:
Near-term (1-3 years): - EHR integration becomes standard - More medical-specific LLMs with better accuracy - Prospective validation studies - Regulatory frameworks clarify - Widespread documentation assistance
Medium-term (3-7 years): - Multimodal LLMs (text + images + genomics + EHR data) - Real-time clinical decision support - Personalized patient education at scale - Reduced hallucinations through better training - Better uncertainty quantification
Long-term (7+ years): - AI medical reasoning approaching expert level (with caveats) - Continuous learning from clinical practice - Seamless physician-AI collaboration - BUT: Human oversight likely always required for high-stakes decisions
Comparison: General vs. Medical-Specific LLMs:
| Feature | GPT-4 (General) | Med-PaLM 2 (Medical) | 
|---|---|---|
| Medical accuracy | Good | Better | 
| Availability | Public API | Enterprise only | 
| HIPAA options | Azure/API with BAA | Google Cloud with BAA | 
| Cost | API fees | Enterprise licensing | 
| Medical terminology | Good | Excellent | 
| Citation quality | Poor (fabricates) | Poor (fabricates) | 
| Hallucinations | Frequent | Somewhat reduced | 
| Uncertainty expression | Poor | Better calibrated | 
The Clinical Bottom Line:
LLMs are powerful assistants, not autonomous doctors: Always maintain human oversight
Hallucinations are the critical danger: Verify all medical facts, never trust blindly
HIPAA compliance essential: Public ChatGPT is NOT appropriate for patient data
Appropriate uses: Documentation drafts, literature synthesis, education materials (with review)
Inappropriate uses: Autonomous diagnosis/treatment, urgent decisions, generating citations without checking
Physician remains legally responsible: “AI told me to” is not a defense
Transparency matters: Consider informing patients when LLM assisted care
Evidence is evolving: Exam performance ≠ clinical utility; demand prospective validation
Privacy first: De-identify or use HIPAA-compliant systems only
The future is collaborative: Effective physician-LLM partnership, not replacement
Start small, learn, monitor: Pilot low-risk applications, collect data, expand cautiously
Stay informed: Field evolving rapidly, regulations emerging, best practices developing
Hands-On: Trying LLMs Safely:
Low-risk experimentation: 1. Use for literature summary (public papers, no patient data) 2. Draft patient education materials (verify accuracy before sharing) 3. Brainstorm differential diagnoses for teaching cases (fictional patients) 4. Generate medical documentation templates (review thoroughly)
Tools to try: - ChatGPT (free tier) for general experimentation (NO PATIENT DATA) - Claude (Anthropic) for reasoning tasks - Perplexity AI (includes citations, though still verify) - Glass Health (medical-specific, free tier)
Learning resources: - OpenAI documentation on medical use cases - AMIA resources on AI in medicine - Institutional policies at your hospital - Medical informatics literature
Next Chapter: We’ll examine how to rigorously evaluate any AI system before clinical deployment—essential skills for the AI-augmented physician.