Large Language Models in Clinical Practice
clinical LLM, clinical LLMs, large language models in medicine, GPT-4 medicine, Claude healthcare, medical AI chatbot, LLM hallucinations, HIPAA compliant AI, LLM reproducibility liability
Large Language Models (LLMs) like ChatGPT, GPT-4, Claude, and Med-PaLM represent a fundamentally different paradigm from narrow diagnostic AI. Unlike algorithms trained for single tasks (detect melanoma, predict sepsis), LLMs are general-purpose language systems that can write notes, answer questions, synthesize literature, draft patient education, and assist clinical reasoning. They’re extraordinarily powerful but also uniquely dangerous, capable of generating confident, plausible, but completely false medical information (“hallucinations”). This chapter provides evidence-based guidance for safe, effective clinical use.
After reading this chapter, you will be able to:
- Understand how LLMs work and their fundamental capabilities and limitations in medical contexts
- Identify appropriate vs. inappropriate clinical use cases based on risk-benefit assessment
- Recognize and mitigate hallucinations, citation fabrication, and knowledge cutoff problems
- Navigate privacy (HIPAA), liability, and ethical considerations specific to LLM use in medicine
- Evaluate medical-specific LLMs (Med-PaLM, GPT-4 medical applications) vs. general-purpose models
- Implement LLMs safely in clinical workflows with proper oversight and verification protocols
- Communicate transparently with patients about LLM-assisted care
- Apply vendor evaluation frameworks before adopting LLM tools for clinical practice
Introduction: A Paradigm Shift in Medical AI
Every previous chapter in this handbook examines narrow AI: algorithms trained for single, specific tasks.
- Radiology AI detects pneumonia on chest X-rays (and nothing else)
- Pathology AI grades prostate cancer histology (and nothing else)
- Cardiology AI interprets ECGs for arrhythmias (and nothing else)
Large Language Models are fundamentally different: general-purpose systems that perform diverse tasks through natural language interaction.
Ask GPT-4 to summarize a medical guideline and it does. Ask it to draft a patient education handout and it does. Ask it to generate a differential diagnosis for chest pain and it does. No task-specific retraining required.
This versatility is unprecedented in medical AI. It’s also what makes LLMs uniquely dangerous.
A narrow diagnostic AI fails in predictable ways: - Pneumonia detection AI applied to chest X-ray might miss a pneumonia (false negative) or flag normal lungs as abnormal (false positive) - Failure modes are bounded by the task
LLMs fail in unbounded ways: - Fabricate drug dosages that look correct but cause overdoses - Invent medical “facts” that sound authoritative but are false - Generate fake citations to real journals (paper doesn’t exist) - Provide confident answers to questions where uncertainty is appropriate - Contradict themselves across responses - Recommend treatments that were standard of care in training data but have been superseded
The clinical analogy: LLMs are like exceptionally well-read medical students who have: - Perfect recall of everything they’ve studied - No clinical experience - No ability to examine patients or access patient-specific data - No accountability for errors - Tendency to confidently bullshit when they don’t know the answer
This chapter teaches you to harness LLM capabilities while protecting patients from LLM failures.
Part 1: How LLMs Work (What Physicians Need to Know)
The Technical Basics (Simplified)
Training: 1. Ingest massive text corpora (internet, books, journals, Wikipedia, Reddit, medical textbooks, PubMed abstracts) 2. Learn statistical patterns: “Given these words, what word typically comes next?” 3. Scale to billions of parameters (weights connecting neural network nodes) 4. Fine-tune with human feedback (reinfor cement learning from human preferences)
Inference (when you use it): 1. You provide a prompt (“Generate a differential diagnosis for acute chest pain in a 45-year-old man”) 2. LLM predicts most likely next word based on learned patterns 3. Continues predicting words one-by-one until stopping criterion met 4. Returns generated text
Crucially: - LLMs don’t “look up” facts in a database - They don’t “reason” in the logical sense - They predict plausible text based on statistical patterns - Truth and plausibility are not the same thing
Why Hallucinations Happen
Definition: LLM generates confident, coherent, plausible but factually incorrect text.
Mechanism: The training objective is “predict next plausible word,” not “retrieve correct fact.” When uncertain, LLMs default to generating text that sounds correct rather than admitting uncertainty or refusing to answer.
Medical examples documented in literature:
- Fabricated drug dosages:
- Prompt: “What is the pediatric dosing for amoxicillin?”
- GPT-3.5 response: “20-40 mg/kg/day divided every 8 hours” (incorrect for many indications; standard is 25-50 mg/kg/day, some indications 80-90 mg/kg/day)
- Invented medical facts:
- Prompt: “What are the contraindications to beta-blockers in heart failure?”
- LLM includes “NYHA Class II heart failure” (false; beta-blockers are indicated, not contraindicated, in Class II HF)
- Fake citations:
- Prompt: “Cite studies showing benefit of IV acetaminophen for postoperative pain”
- GPT-4 generates: “Smith et al. (2019) in JAMA Surgery found 40% reduction in opioid use” (paper doesn’t exist; authors, journal, year all fabricated but plausible)
- Outdated recommendations:
- All LLMs have training data cutoffs (check the specific model’s documentation)
- May recommend drugs withdrawn from market after training
- Unaware of updated guidelines published post-training
Why this matters clinically: A physician who trusts LLM output without verification risks: - Incorrect medication dosing → patient harm - Reliance on outdated treatment → suboptimal care - Academic dishonesty from fabricated citations → career consequences
Mitigation strategies: - Always verify drug information against pharmacy databases (Lexicomp, Micromedex, UpToDate) - Cross-check medical facts with authoritative sources (guidelines, textbooks, PubMed) - Never trust LLM citations without looking up the actual papers - Use LLMs for drafts and idea generation, never final medical decisions - Higher stakes = more verification required
Part 2: The Major Failure Case Study, Hallucination Disasters
Case 1: The Fabricated Oncology Protocol
Scenario (reported 2023): Physician asked GPT-4 for dosing protocol for pediatric acute lymphoblastic leukemia (ALL) consolidation therapy.
LLM response: Generated detailed protocol with drug names, dosages, timing that looked professionally formatted and authoritative.
The problem: - Methotrexate dose: 50 mg/m² (LLM suggested) vs. actual protocol: 5 g/m² (100x difference) - Vincristine timing: Weekly (LLM) vs. protocol: Every 3 weeks during consolidation - Dexamethasone duration: 5 days (LLM) vs. protocol: 28 days
If followed without verification: Patient would have received 1% of intended methotrexate dose (treatment failure, disease progression) and excessive vincristine (neurotoxicity risk).
Why it happened: LLM trained on general medical text, not specialized oncology protocols. Generated plausible-sounding but incorrect regimen by combining fragments from different contexts.
The lesson: Never use LLMs for medication dosing without rigorous verification against authoritative sources (protocol handbooks, institutional guidelines, pharmacy consultation).
Case 2: The Confident Misdiagnosis
Scenario (published case study): Emergency physician used GPT-4 to generate differential diagnosis for “32-year-old woman with sudden-onset severe headache, photophobia, neck stiffness.”
LLM differential: 1. Migraine (most likely) 2. Tension headache 3. Sinusitis 4. Meningitis (mentioned fourth) 5. Subarachnoid hemorrhage (mentioned fifth)
The actual diagnosis: Subarachnoid hemorrhage (SAH) from ruptured aneurysm.
The problem: LLM ranked benign diagnoses (migraine, tension headache) above life-threatening emergencies (SAH, meningitis) despite classic “thunderclap headache + meningeal signs” presentation.
Why it happened: - Training data bias: Migraine is far more common than SAH in text corpora - LLMs predict based on frequency in training data, not clinical risk stratification - No understanding of “rule out worst-case-first” emergency medicine principle
The lesson: LLMs don’t triage by clinical urgency or risk. Physician must apply clinical judgment to LLM suggestions.
What the physician did right: Used LLM as brainstorming tool, not autonomous diagnosis. Recognized high-risk presentation and ordered CT + LP appropriately.
Case 3: The Citation Fabrication Scandal
Scenario: Medical student submitted literature review using GPT-4 to generate citations supporting statements about hypertension management.
LLM-generated citations (examples): 1. “Johnson et al. (2020). ‘Intensive blood pressure control in elderly patients.’ New England Journal of Medicine 383:1825-1835.” 2. “Patel et al. (2019). ‘Renal outcomes with SGLT2 inhibitors in diabetic hypertension.’ Lancet 394:1119-1128.”
The problem: Neither paper exists. Authors, journals, years, page numbers all plausible but fabricated.
Discovery: Faculty advisor attempted to retrieve papers for detailed review. None found in PubMed, journal archives, or citation databases.
Consequences: - Student received failing grade for academic dishonesty - Faculty implemented “verify all LLM-generated citations” policy - Medical school updated honor code to address AI-assisted writing
Why this matters: - Citation fabrication in grant applications = federal research misconduct - In publications = retraction, career damage - In clinical guidelines = propagation of misinformation
The lesson: Never trust LLM-generated citations. Always verify papers exist and actually support the claims attributed to them.
Part 3: The Success Story, Ambient Clinical Documentation
Nuance DAX: FDA-Cleared AI Scribe
The problem DAX solves: Physicians spend 2+ hours per day on documentation, often completing notes after-hours. EHR documentation contributes significantly to burnout.
How DAX works: 1. Physician wears microphone during patient encounter 2. DAX records conversation (with patient consent) 3. LLM transcribes speech → converts to structured clinical note 4. Note appears in EHR for physician review/editing 5. Physician reviews, makes corrections, signs note
Evidence base:
Regulatory status: Not FDA-regulated. Falls under CDS (Clinical Decision Support) exemption per 21st Century Cures Act because it generates documentation drafts, not diagnoses or treatment recommendations, and physicians independently review all output.
Clinical validation: Nuance-sponsored study (2023), 150 physicians, 5,000+ patient encounters: - Documentation time reduction: 50% (mean 5.5 min → 2.7 min per encounter) - Physician satisfaction: 77% would recommend to colleagues - Note quality: No significant difference from physician-written notes (blinded expert review) - Error rate: 0.3% factual errors requiring correction (similar to baseline physician error rate in dictation)
Real-world deployment: - 550+ health systems - 35,000+ clinicians using DAX - 85% user retention after 12 months
Cost-benefit: - DAX subscription: ~$369-600/month per physician (varies by contract; $700 one-time implementation fee) - Time savings: 1 hour/day × $200/hour physician cost = $4,000/month saved - ROI: Positive in 1-3 months depending on encounter volume
Pricing source: DAX Copilot pricing page, December 2024. Costs vary by volume and contract terms.
Why this works: - Well-defined task (transcription + note structuring) - Physician review catches errors before note finalization - Integration with EHR workflow - Patient consent obtained upfront - HIPAA-compliant (BAA with healthcare systems)
Limitations: - Requires patient consent (some decline) - Poor audio quality → transcription errors - Complex cases with multiple topics may require substantial editing - Subscription cost barrier for small practices
Abridge: AI-Powered Medical Conversations
Similar ambient documentation tool with comparable performance: - 65% documentation time reduction in pilot studies - Focuses on primary care and specialty clinics - Generates patient-facing visit summaries automatically
The lesson: When LLMs are used for well-defined tasks with physician oversight and proper integration, they deliver genuine value.
Part 4: Appropriate vs. Inappropriate Clinical Use Cases
SAFE Uses (With Physician Oversight)
1. Clinical Documentation Assistance
Use cases: - Draft progress notes from dictation - Generate discharge summaries - Suggest ICD-10/CPT codes - Create procedure notes
Workflow: 1. Physician provides input (dictation, conversation recording, bullet points) 2. LLM generates structured note 3. Physician reviews every detail, edits errors, adds clinical judgment 4. Physician signs final note
The dangerous reality: As AI accuracy improves, human vigilance drops. Studies show physicians begin “rubber-stamping” AI-generated content after approximately 3 months of successful use (Goddard et al., 2012).
The pattern: - Month 1: Physician carefully reviews every word, catches errors - Month 3: Physician skims notes, catches obvious errors - Month 6: Physician clicks “Sign” with minimal review, trusts the AI - Month 12: Errors slip through; patient harm possible
Counter-measures to maintain vigilance:
- Spot-check protocol: Verify at least one specific data point per note (e.g., check one lab value, one medication dose, one vital sign against the record)
- Rotation strategy: Vary which section you scrutinize each encounter
- Red flag awareness: Know the AI’s failure modes (medication names, dosing, dates, rare conditions)
- Scheduled deep review: Once weekly, do a line-by-line audit of a randomly selected AI-generated note
- Error tracking: Log every error you catch; if catches drop to zero, you may have stopped looking
The uncomfortable truth: “Physician in the loop” only works if the physician is actually paying attention. The AI doesn’t get tired; you do.
Risk mitigation: - Physician remains legally responsible for note content - Review catches hallucinations, errors, omissions - HIPAA-compliant systems only
Evidence: 50% time savings documented in multiple studies (see DAX above)
2. Literature Synthesis and Summarization
Use cases: - Summarize clinical guidelines - Compare treatment options from multiple sources - Generate literature review outlines - Identify relevant studies for research questions
Workflow: 1. Provide LLM with specific question and context 2. Request summary with citations 3. Verify all citations exist and support claims 4. Cross-check medical facts against primary sources
Example prompt:
"Summarize the 2023 AHA/ACC guidelines for management
of atrial fibrillation, focusing on anticoagulation
recommendations for patients with CHADS-VASc ≥2.
Include specific drug dosing and monitoring requirements.
Cite specific guideline sections."
Risk mitigation: - Verify citations before relying on summary - Cross-check facts with original guidelines - Use as starting point, not final analysis
3. Patient Education Materials
Use cases: - Explain diagnoses in health literacy-appropriate language - Create discharge instructions - Draft procedure consent explanations - Translate medical jargon to plain language
Workflow: 1. Specify reading level, key concepts, patient concerns 2. LLM generates draft 3. Physician reviews for medical accuracy 4. Edits for cultural sensitivity, individual patient factors 5. Shares with patient
Example prompt:
"Create a patient handout about type 2 diabetes management
for a patient with 6th grade reading level. Cover: medication
adherence, blood sugar monitoring, dietary changes, exercise.
Use simple language, avoid jargon, 1-page limit."
Risk mitigation: - Fact-check all medical information - Customize to individual patient (LLM generates generic content) - Consider health literacy, cultural factors
4. Differential Diagnosis Brainstorming
Use cases: - Generate possibilities for complex cases - Identify rare diagnoses to consider - Broaden differential when stuck
Workflow: 1. Provide detailed clinical vignette 2. Request differential with reasoning 3. Treat as idea generation, not diagnosis 4. Pursue appropriate diagnostic workup based on clinical judgment
Example prompt:
"Generate differential diagnosis for 45-year-old woman
with 3 months of progressive dyspnea, dry cough, and
fatigue. Exam: fine bibasilar crackles, no wheezing.
CXR: reticular infiltrates. Consider both common and
rare etiologies. Provide likelihood and key diagnostic
tests for each."
Risk mitigation: - LLM differential is brainstorming, not diagnosis - Verify each possibility clinically plausible for patient - Pursue workup based on pretest probability, not LLM ranking
5. Medical Coding Assistance
Use cases: - Suggest ICD-10/CPT codes from clinical notes - Identify documentation gaps for proper coding - Check code appropriateness
Workflow: 1. LLM analyzes clinical note 2. Suggests codes with reasoning 3. Coding specialist or physician reviews 4. Confirms codes match care delivered and documentation
Risk mitigation: - Compliance review essential (fraudulent coding = federal offense) - Physician confirms codes represent actual care - Regular audits of LLM-suggested codes
DANGEROUS Uses (Do NOT Do)
1. Autonomous Patient Advice
Why dangerous: - Patients ask LLMs medical questions without physician involvement - LLMs provide confident answers regardless of accuracy - Patients may delay appropriate care based on false reassurance
Documented harms: - Patient with chest pain asked ChatGPT “Is this heartburn or heart attack?” - ChatGPT suggested antacids (without seeing patient, knowing history, performing exam) - Patient delayed ER visit 6 hours, presented with STEMI
The lesson: Patients will use LLMs for medical advice regardless of physician recommendations. Educate patients about limitations, encourage them to contact you rather than rely on AI.
2. Medication Dosing Without Verification
Why dangerous: - LLMs fabricate plausible but incorrect dosages - Pediatric dosing especially error-prone - Drug interaction checking unreliable
Documented near-miss: - Physician asked GPT-4 for vancomycin dosing in renal failure - LLM suggested dose appropriate for normal renal function - Pharmacist caught error before administration
The lesson: Never use LLM-generated medication dosing without verification against pharmacy databases, dose calculators, or pharmacist consultation.
3. Urgent or Emergent Clinical Decisions
Why dangerous: - Time pressure precludes adequate verification - High stakes magnify consequence of errors - Clinical judgment + experience > LLM statistical patterns
The lesson: In emergencies, rely on clinical protocols, expert consultation, established guidelines, not LLM brainstorming.
4. Generating Citations Without Verification
Why dangerous: - LLMs fabricate 15-30% of medical citations - Using fake references = academic dishonesty, research misconduct - Propagates misinformation if not caught
The lesson: Never include LLM-generated citations in manuscripts, grants, presentations without verifying papers exist and support the claims.
Part 5: Privacy, HIPAA, and Legal Considerations
The HIPAA Problem
CRITICAL: Public ChatGPT is NOT HIPAA-compliant
Why: - OpenAI stores conversations - May use data for model training (unless opt-out configured) - No Business Associate Agreement (BAA) for free/Plus tiers - Data transmitted through OpenAI servers
Consequences of HIPAA violation: - Civil penalties: $100-$50,000 per violation - Criminal penalties: Up to $250,000 and 10 years imprisonment (for willful violations) - Institutional sanctions, career consequences
What NOT to enter into public ChatGPT: - Patient names, MRNs, DOB, addresses - Detailed clinical vignettes with rare diagnoses (re-identification possible) - Protected health information of any kind
HIPAA-compliant alternatives:
- Azure OpenAI Service
- GPT-4 via Microsoft Azure
- BAA available for healthcare customers
- Data not used for training
- Cost: API fees (usage-based)
- Google Cloud Vertex AI
- Med-PaLM 2, PaLM 2
- BAA for healthcare
- Enterprise controls
- Cost: Enterprise licensing
- Epic Integrated LLMs
- Built into EHR workflow
- HIPAA-compliant by design
- Deployment accelerating 2024-2025
- Vendor-specific medical LLMs
- Nuance DAX, Abridge, Glass Health
- BAA with healthcare systems
- Subscription models
Safe practices: - Use only HIPAA-compliant systems for patient data - De-identify cases before entering into public LLMs (but de-identification is imperfect) - Institutional approval before LLM deployment - Document patient consent where appropriate
Medical Liability Landscape
Current legal framework (evolving):
Physician remains responsible: - LLM is a tool, not a practitioner - Physician liable for all clinical decisions - “AI told me to” is not a malpractice defense
Standard of care questions: 1. Is physician negligent for NOT using available LLM tools? - Currently: No clear standard - Future: May become expected for documentation efficiency
- Is physician negligent for USING LLM incorrectly?
- Yes: Using public ChatGPT for patient data = HIPAA violation
- Yes: Following LLM recommendation without verification that causes harm
- Yes: Delegating clinical judgment to LLM
The reproducibility problem:
LLMs produce different outputs for identical prompts, creating unique liability challenges. Unlike traditional medical software that produces deterministic results (same input always yields same output), LLMs use probabilistic sampling, meaning the same clinical question asked twice may generate different recommendations.
Documentation implications (Maddox et al., 2025):
- If LLM-generated clinical note varies between runs, which version becomes the legal record?
- Peer review of LLM-assisted decisions becomes difficult when outputs aren’t reproducible
- Quality assurance audits cannot validate LLM recommendations after the fact if the system produces different outputs when tested
Defensive documentation strategies:
- Note the specific LLM version and timestamp (e.g., “GPT-4 Turbo via Azure, January 15, 2025, 14:32”)
- Document key LLM outputs verbatim in clinical notes when material to decisions
- Explicitly note verification steps taken (“Differential diagnosis generated by AI, reviewed against UpToDate guidelines, decision based on clinical judgment”)
- Save LLM conversation logs where institutional policy and technical capacity allow
Malpractice insurance: - Check policy coverage for AI-assisted care - Some policies may exclude AI-related claims - Ask explicitly: “Does this policy cover LLM use in clinical practice?” (Missouri Medicine, 2025) - Notify insurer of LLM tool use before deployment, not after adverse events
Testing LLM consistency before deployment:
Before adopting any LLM tool for clinical use, test reproducibility:
- Select 10-20 representative clinical prompts (differential diagnosis questions, treatment recommendations, documentation tasks)
- Run each prompt 5 times with identical inputs
- Assess variance: Do outputs differ substantively or only stylistically?
- Document acceptable thresholds: Stylistic variation (word choice) acceptable; factual variation (different drug dosages) unacceptable
- Red flags: If the same prompt yields contradictory recommendations (e.g., “start beta-blocker” vs. “beta-blockers contraindicated”), do not deploy without vendor explanation
For medication dosing, diagnostic recommendations, or high-stakes decisions, reproducibility testing is essential before clinical deployment.
Risk mitigation: - Use only validated, HIPAA-compliant systems - Always verify LLM outputs - Maintain human oversight for all decisions - Document verification - Obtain consent where appropriate - Monitor for errors continuously - Test reproducibility before deployment (see Chapter 21 for detailed liability framework)
Part 6: Vendor Evaluation Framework
Before Adopting an LLM Tool for Clinical Practice
Questions to ask vendors:
- “Is this system HIPAA-compliant? Can you provide a Business Associate Agreement?”
- Essential for any system touching patient data
- No BAA = no patient data entry
- “What is the LLM training data cutoff date?”
- Cutoff dates vary by model and version (check vendor documentation)
- Older cutoff = more outdated medical knowledge
- Models with web search can access current information but still require verification
- “What peer-reviewed validation studies support clinical use?”
- Demand JAMA, NEJM, Nature Medicine publications
- User satisfaction ≠ clinical validation
- Ask for prospective studies, not just retrospective benchmarks
- “What is the hallucination rate for medical content?”
- If vendor can’t quantify, they haven’t tested rigorously
- Accept that hallucinations are unavoidable; question is frequency
- “How does the system handle uncertainty?”
- Good LLMs express appropriate uncertainty (“I’m not certain, but…”)
- Bad LLMs confidently hallucinate when uncertain
- “What verification/oversight mechanisms are built into the workflow?”
- Best systems require physician review before acting on LLM output
- Dangerous systems allow autonomous LLM actions
- “How does this integrate with our EHR?”
- Practical integration essential for adoption
- Clunky workarounds fail
- “What is the cost structure and ROI evidence?”
- Subscription per physician? API usage fees?
- Request time-savings data, physician satisfaction metrics
- “What testing validates consistency of outputs across multiple runs?”
- Ask for reproducibility data: same input, how often does output differ?
- Critical for clinical decisions where consistency matters (dosing, treatment recommendations)
- If vendor hasn’t tested, they haven’t validated for clinical use
- “Does your malpractice insurance explicitly cover LLM use?”
- Many policies exclude AI-related claims or require explicit rider
- Ask insurer directly, don’t rely on vendor assurances
- Request coverage confirmation in writing before deployment
- “Who is liable if LLM output causes patient harm?”
- Most vendors disclaim liability in contracts
- Physician/institution bears risk
- “What data is retained, and can patients opt out?”
- Data retention policies
- Patient consent/opt-out mechanisms
Red Flags (Walk Away If You See These)
- No HIPAA compliance for clinical use (public ChatGPT marketed for medical decisions)
- Claims of “replacing physician judgment” (LLMs assist, don’t replace)
- No prospective clinical validation (only bench mark exam scores)
- Autonomous actions without physician review (medication ordering, diagnosis without oversight)
- Vendor refuses to discuss hallucination rates (hasn’t tested or hiding poor performance)
Part 7: Cost-Benefit Reality
What Does LLM Technology Cost?
Ambient documentation (Nuance DAX, Abridge): - Cost: ~$369-600/month per physician (varies by contract and volume) - Benefit: 1 hour/day time savings × $200/hour = $4,000/month - ROI: Positive in 1-3 months - Non-monetary benefit: Reduced burnout, improved work-life balance
GPT-4 API (HIPAA-compliant): - Cost: ~$0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens - Typical clinical note: 500 tokens input, 1,000 output = $0.075 per note - If 20 notes/day: $1.50/day = $30/month (cheaper than subscription) - But: Requires technical integration, institutional IT support
Glass Health (LLM clinical decision support): - Cost: Free tier available, paid tiers ~$100-300/month - Benefit: Differential diagnosis brainstorming, treatment suggestions - ROI: Unclear; depends on how often you use for complex cases
Epic LLM integration (message drafting, note summarization): - Cost: Bundled into EHR licensing for institutions - Benefit: Incremental time savings across multiple workflows
Do These Tools Save Money?
Ambient documentation: YES - 50% time savings is substantial - Reduced after-hours charting improves physician well-being - Cost-effective based on time saved - Caveat: Requires subscription commitment; per-physician cost limits small practice adoption
API-based documentation assistance: MAYBE - Much cheaper than subscriptions (~$30/month vs. $400-600/month) - But requires IT infrastructure, integration effort - ROI depends on institutional technical capacity
Literature summarization: UNCLEAR - Time savings real (10 min to read guideline vs. 2 min to review LLM summary) - But risk of hallucinations means verification still required - Net time savings modest
Patient education generation: PROBABLY - Faster than writing from scratch - But requires physician review - Best for high-volume needs (discharge instructions, common diagnoses)
Part 8: The Future of Medical LLMs
What’s Coming in the Next 3-5 Years
Likely developments:
- EHR-integrated LLMs become standard
- Epic, Cerner, Oracle already deploying
- Message drafting, note summarization, coding assistance
- HIPAA-compliant by design
- Multimodal medical LLMs
- Text + images + lab data + genomics
- “Show me this rash” + clinical history → differential diagnosis
- Radiology report + imaging → integrated assessment
- Reduced hallucinations
- Retrieval-augmented generation (LLM + medical database lookup)
- Better uncertainty quantification
- Improved factuality through constrained generation
- Prospective clinical validation
- RCTs showing improved outcomes (not just time savings)
- Cost-effectiveness analyses
- Comparative studies (LLM-assisted vs. standard care)
- Regulatory clarity
- FDA guidance on LLM medical devices
- State medical board policies on LLM use
- Malpractice liability precedents
Unlikely (despite hype):
- Fully autonomous diagnosis/treatment
- Too high-stakes for pure LLM decision-making
- Human oversight will remain essential
- Complete elimination of hallucinations
- Fundamental to how LLMs work
- Mitigation, not elimination, is realistic goal
- Replacement of physician-patient relationship
- LLMs assist communication, don’t replace human connection
- Empathy, trust, shared decision-making remain human domains
Part 9: Implementation Guide
Safe LLM Implementation Checklist
Pre-Implementation: - ☐ Institutional approval obtained (IT, privacy, compliance, legal) - ☐ HIPAA compliance verified (BAA in place) - ☐ Malpractice insurance notified - ☐ Clear use case defined (documentation, education, brainstorming) - ☐ Physician training completed (capabilities, limitations, verification) - ☐ Patient consent process established (if applicable)
During Use: - ☐ Never enter patient identifiers into public LLMs - ☐ Always verify medical facts against authoritative sources - ☐ Treat as draft/assistant, never autonomous decision-maker - ☐ Document verification steps in clinical notes - ☐ Maintain physician oversight for all decisions
Post-Implementation: - ☐ Monitor for errors, near-misses, hallucinations - ☐ Track time savings, physician satisfaction - ☐ Review random sample of LLM-assisted notes for quality - ☐ Update policies based on experience - ☐ Stay current with evolving regulations
Key Takeaways
10 Principles for LLM Use in Medicine
LLMs are assistants, not doctors: Always maintain human oversight and final decision-making
Hallucinations are unavoidable: Verify all medical facts, never trust blindly
HIPAA compliance is non-negotiable: Public ChatGPT is NOT appropriate for patient data
Appropriate uses: Documentation drafts, literature review, education materials (with review)
Inappropriate uses: Autonomous diagnosis/treatment, medication dosing without verification, urgent decisions
Physician remains legally responsible: “AI told me to” is not a malpractice defense
Evidence is evolving: USMLE performance ≠ clinical utility; demand prospective RCTs
Ambient documentation shows clearest benefit: 50% time savings with high satisfaction
Prompting quality matters: Specific, detailed prompts with sourcing requests yield better outputs
The future is collaborative: Effective physician-LLM partnership, not replacement
Clinical Scenario: LLM Vendor Evaluation
Scenario: Your Hospital Is Considering Glass Health for Clinical Decision Support
The pitch: Glass Health provides LLM-powered differential diagnosis and treatment suggestions. Marketing claims: - “Physician-level diagnostic accuracy” - “Evidence-based treatment recommendations” - “Saves 20 minutes per complex case” - Cost: $200/month per physician
The CMO asks for your recommendation.
Questions to ask:
- “What peer-reviewed validation studies support Glass Health?”
- Request JAMA, Annals, specialty journal publications
- User testimonials ≠ clinical validation
- “Is this HIPAA-compliant? Where is the BAA?”
- Essential for entering patient data
- “What is the hallucination rate?”
- If vendor hasn’t quantified, they haven’t tested properly
- “How does Glass Health handle diagnostic uncertainty?”
- Does it express appropriate uncertainty or confidently hallucinate?
- “What workflow oversight prevents acting on incorrect recommendations?”
- Best systems require physician review before actions
- “Can we pilot with 10 physicians before hospital-wide deployment?”
- Local validation essential
- “What happens if Glass Health recommendation causes harm?”
- Read liability disclaimers in contract
- “What is actual time savings data?”
- “20 minutes per complex case” claim: where’s the evidence?
Red Flags:
“Physician-level accuracy” without prospective validation
No discussion of hallucination rates or error modes
Marketing emphasizes speed over safety
No built-in verification mechanisms
Check Your Understanding
Scenario 1: The Medication Dosing Question
Clinical situation: You’re seeing a 4-year-old with otitis media requiring amoxicillin. You ask GPT-4 (via HIPAA-compliant API):
“What is the appropriate amoxicillin dosing for a 4-year-old child with acute otitis media?”
GPT-4 responds: “For acute otitis media in a 4-year-old, amoxicillin dosing is 40-50 mg/kg/day divided into two doses (every 12 hours). For a 15 kg child, this would be 300-375 mg twice daily.”
Question 1: Do you prescribe based on this recommendation?
Click to reveal answer
Answer: No, verify against authoritative source first.
Why:
The LLM response is partially correct but incomplete: - Standard-dose amoxicillin: 40-50 mg/kg/day divided BID (LLM correct) - But: AAP now recommends high-dose amoxicillin (80-90 mg/kg/day divided BID) for most cases of AOM due to increasing S. pneumoniae resistance - LLM likely trained on older guidelines pre-dating high-dose recommendations
Correct dosing (per 2023 AAP guidelines): - High-dose: 80-90 mg/kg/day divided BID (first-line for most cases) - For 15 kg child: 600-675 mg BID - Standard-dose: 40-50 mg/kg/day only for select cases (penicillin allergy evaluation, mild infection in low-resistance areas)
What you should do: 1. Check UpToDate, Lexicomp, or AAP guidelines directly 2. Confirm high-dose amoxicillin is indicated 3. Prescribe 600-650 mg BID (not the LLM-suggested 300-375 mg)
The lesson: LLMs may provide outdated recommendations or miss recent guideline updates. Always verify medication dosing against current pharmacy databases or guidelines.
If you had prescribed LLM dose: - Child receives 50% of intended amoxicillin - Higher risk of treatment failure - Potential for antibiotic resistance development
Scenario 2: The Patient Education Handout
Clinical situation: You’re discharging a patient newly diagnosed with type 2 diabetes. You use GPT-4 to generate patient education handout:
“Create a one-page patient handout for newly diagnosed type 2 diabetes, 8th-grade reading level. Cover: medications, blood sugar monitoring, diet, exercise.”
GPT-4 generates professional-looking handout with sections on metformin, glucometer use, carb counting, and walking recommendations.
Question 2: Can you give this handout to the patient as-is, or do you need to review/edit first?
Click to reveal answer
Answer: MUST review and edit before giving to patient.
Why:
Potential LLM errors to check:
- Medication information:
- Is metformin dosing correct? (LLMs sometimes hallucinate dosages)
- Are side effects accurately described?
- Are contraindications mentioned? (metformin contraindicated in advanced CKD)
- Blood sugar targets:
- ADA guidelines: Fasting <130 mg/dL, postprandial <180 mg/dL for most
- LLM may use outdated targets or not individualize
- Dietary advice:
- Is carb counting explained clearly?
- Any culturally inappropriate diet suggestions?
- Consistent with current ADA nutrition guidelines?
- Exercise recommendations:
- 150 min/week moderate exercise is standard
- Are contraindications mentioned? (avoid exercise if glucose >250 mg/dL + ketones)
- Individualization:
- LLM generates generic handout
- Does it fit THIS patient’s health literacy, culture, comorbidities?
What you should do: 1. Read entire handout for medical accuracy 2. Verify all medical facts (especially medication dosing, blood sugar targets) 3. Customize to patient (add their specific meds if not metformin, adjust for comorbidities) 4. Check cultural appropriateness 5. Review with patient, answer questions
The lesson: LLM-generated patient materials save time but require physician review for accuracy and individualization. Never hand out LLM content without verification.
Time savings: - Writing from scratch: 15-20 minutes - LLM draft + physician review/editing: 5-7 minutes - Net benefit: 10+ minutes saved while maintaining quality
Scenario 3: The Citation Fabrication
Clinical situation: You’re writing a quality improvement manuscript on reducing hospital readmissions. You ask GPT-4:
“Cite 3 recent studies (2020-2023) showing effectiveness of AI for predicting 30-day readmissions in heart failure patients.”
GPT-4 provides:
- “Chen et al. (2022). ‘Machine learning for heart failure readmission prediction.’ JAMA Cardiology 7:445-453.”
- “Rodriguez-Lopez et al. (2021). ‘Deep learning models outperform LACE index for readmission risk.’ Circulation 144:892-901.”
- “Patel et al. (2023). ‘Prospective validation of AI readmission tool in community hospitals.’ New England Journal of Medicine 388:1122-1131.”
Question 3: Can you include these citations in your manuscript?
Click to reveal answer
Answer: NO. You must verify each citation exists and actually supports your claim.
Why:
The LLM likely fabricated some or all of these citations. Here’s how to check:
Step 1: Search PubMed for each citation
For “Chen et al. (2022) JAMA Cardiology”: - Search: "Chen" AND "heart failure readmission" AND "machine learning" AND "JAMA Cardiology" AND 2022 - If found: Read abstract, confirm it supports your claim - If NOT found: Citation is fake
Step 2: Verify journal, volume, pages
Even if an author “Chen” published in JAMA Cardiology in 2022, check: - Is the title correct? - Is the volume/page number correct? - Does the paper actually discuss AI for HF readmissions?
Step 3: Read the actual papers
If citations exist: - Do they support the claim you’re making? - Are study methods sound? - Are conclusions being accurately represented?
Likely outcome: - 1-2 of these 3 citations are completely fabricated - Even if a paper exists, it may not say what you think (or LLM suggests)
What you should do instead:
Search PubMed yourself:
("heart failure" OR "HF") AND ("readmission" OR "rehospitalization") AND ("machine learning" OR "artificial intelligence" OR "AI") AND ("prediction" OR "risk score")Filter: Publication date 2020-2023, Clinical Trial or Review
Read abstracts, select relevant papers
Cite actual papers you’ve read
The lesson: Never trust LLM-generated citations. LLMs fabricate references 15-30% of the time. Always verify papers exist and support your claims.
Consequences of using fabricated citations: - Manuscript rejection - If published then discovered: Retraction - Academic dishonesty allegations - Career damage
Time comparison: - LLM citations (unverified): 30 seconds - Manual PubMed search + reading abstracts: 15-20 minutes - Worth the extra time to avoid fabricated references