AI Tools Every Physician Should Know

Over 1,300 AI medical devices have FDA authorization (FDA AI-Enabled Medical Devices), but most physicians don’t know which ones actually work. This chapter cuts through marketing hype to identify validated tools you can deploy today: diagnostic AI with prospective trial data, ambient scribes that save 1-2 hours daily, and specialty-specific applications backed by peer-reviewed evidence.

Learning Objectives

After reading this chapter, you will be able to:

  • Identify FDA-cleared and clinically validated AI tools by specialty
  • Understand capabilities and limitations of each tool category
  • Evaluate tools appropriate for your practice setting
  • Navigate privacy, liability, and reimbursement considerations
  • Distinguish evidence-based tools from marketing hype
  • Access hands-on resources for learning AI tools

The Clinical Context: Hundreds of AI tools market to physicians. Most lack rigorous validation. This chapter cuts through hype to identify evidence-based, FDA-cleared, or widely-adopted tools physicians can actually use safely and effectively.

Tool Categories:

  1. Clinical Decision Support
  2. Diagnostic AI (Imaging, Pathology, Dermatology)
  3. Documentation and Ambient Scribe
  4. Literature Search and Synthesis
  5. Patient Communication
  6. Specialty-Specific Tools

Critical Selection Criteria:

FDA clearance or peer-reviewed validation Evidence from prospective studies Real-world deployment at multiple institutions Clear clinical use case Reasonable cost and ROI EHR integration or minimal workflow disruption

Top Tier: Highest Evidence Tools

IDx-DR (Diabetic Retinopathy Screening)

  • FDA-cleared, autonomous diagnostic
  • Prospective RCT validation
  • CPT reimbursement established
  • Deployed widely in primary care

Viz.ai (Stroke, PE Detection)

  • FDA-cleared multiple indications
  • Reduces time-to-treatment
  • Deployed at 1700+ hospitals
  • Strong evidence base

Paige Prostate (Pathology AI)

  • FDA-cleared
  • Prospective validation
  • Deployed in clinical pathology labs

OpenEvidence (Clinical Decision Support)

  • Company reports daily use by 40% of U.S. physicians
  • 18 million monthly consultations (December 2025)
  • Evidence-based answers from peer-reviewed literature
  • Peer-reviewed validation in primary care settings

Doximity GPT (Documentation & Clinical Assistant)

  • Free, HIPAA-compliant
  • Built into platform most U.S. physicians already use
  • Documentation, coding, differential diagnosis support
  • 10,000+ Scribe beta testers

Nuance DAX / Ambient Documentation

  • High physician satisfaction
  • ~50% documentation time reduction (vendor studies); 8-15% (independent)
  • Widely adopted

Strong Evidence Tier:

Aidoc (Multiple Radiology Applications)

  • ICH, PE, pneumothorax, C-spine fractures
  • Multiple FDA clearances
  • Deployed at 1000+ sites

Arterys / Circle CVI (Cardiac MRI Quantification)

  • FDA-cleared
  • Improves measurement standardization
  • Integrated into scanner workflows

Lunit INSIGHT (Chest X-Ray, Mammography)

  • FDA-cleared
  • Strong performance in trials
  • International deployment

Emerging But Promising:

AI Scribes (Suki, Abridge, DeepScribe) - High user satisfaction - Limited long-term outcome data - Privacy considerations

UpToDate AI Features - Literature synthesis - Trusted source + AI enhancement - Validation ongoing

Low Evidence / Avoid:

Most direct-to-consumer symptom checkers - Poor accuracy Unvalidated chatbots for medical advice - Risk of misinformation “AI diagnoses everything” systems - Marketing > evidence Tools without peer-reviewed publications - Unproven claims

See Also: Specialty-Specific AI Tools for a comprehensive catalog of FDA-cleared devices by clinical specialty.

Introduction: Navigating the AI Tool Landscape

As of late 2025, the FDA has authorized over 1,300 AI/ML-based medical devices (FDA AI-Enabled Medical Devices), with hundreds more marketed without FDA oversight (clinical decision support, wellness applications). For physicians, the challenge isn’t finding AI tools, it’s identifying which ones actually work, have solid evidence, integrate into workflows, and provide clinical value.

This chapter provides: - Evidence-based tool recommendations by category - Specific product names and validation evidence - Implementation considerations - Cost and reimbursement information where available - Hands-on resources for evaluation

What this chapter is NOT: - Comprehensive product catalog (tools evolve rapidly) - Vendor endorsements (evidence-based assessments only) - Substitutes for your own due diligence


Category 1: Clinical Decision Support (CDS)

Traditional CDS (Pre-AI Era)

UpToDate - Type: Evidence-based clinical reference - AI Features: Recently adding AI-powered literature review, question answering - Evidence: Widely adopted, associated with improved outcomes in observational studies (Isaac et al., 2012) - Cost: Institutional/individual subscriptions ($500-700/year individual) - Strength: Trusted source, regularly updated, covers all major specialties - Limitation: Not “AI” in modern sense, though adding AI features

DynaMedex with Dyna AI - Type: Clinical decision support combining DynaMed (evidence-based clinical information) and Micromedex (drug information) with AI integration - AI Features: Dyna AI surfaces concise, evidence-based clinical information quickly from exclusively DynaMedex sources - Cost: Free for ACP members - Access: Part of the ACP AI Resource Hub (https://www.acponline.org/clinical-information/clinical-resources-products/artificial-intelligence-ai-resource-hub)

OpenEvidence - Type: AI copilot for evidence-based clinical decision support at point of care - Function: Natural language clinical questions answered with evidence from peer-reviewed literature (NEJM, JAMA, Cochrane, specialty journals) - Adoption: Company reports daily use by 40% of U.S. physicians across 10,000+ hospitals, with 18 million clinical consultations in December 2025 (Sermo, 2026; STAT News, January 2026) - Evidence: PMC-published study found OpenEvidence provided accurate, evidence-based recommendations aligned with physician treatment plans in primary care settings, with high ratings for clarity, relevance, and evidence-based support (PubMed, 2026) - Funding: $6 billion valuation after raising $200 million twice in 2025 (STAT News, January 2026) - Cost: Subscription-based; institutional licensing available - Strength: High physician adoption, evidence grounded in peer-reviewed literature, continuous updates - Limitation: Requires subscription; independent validation beyond primary care setting limited; evidence synthesis quality depends on underlying literature availability - Verdict: Widely adopted AI-powered CDS tool with peer-reviewed validation in primary care

DXplain (Massachusetts General Hospital) - Type: Differential diagnosis generator - Function: Enter findings → generates ranked differential - Evidence: Used since 1980s; comparative studies show solid diagnostic accuracy, scoring 3.45/5 tied with Isabel for top DDx generator performance (Bond et al., 2012); original system description (Barnett et al., 1987) - Cost: Free for medical professionals - Strength: Broad knowledge base - Limitation: Doesn’t narrow differential without clinical judgment

Isabel Healthcare - Type: Differential diagnosis support - Function: Enter patient presentation → suggests diagnoses - Evidence: Comparative evaluation showed Isabel tied with DXplain at 3.45/5 for diagnostic accuracy, outperforming other DDx generators (Bond et al., 2012); company reports 96% accuracy for including correct diagnosis in top suggestions - Cost: Subscription-based - Limitation: Accuracy variable, requires clinical interpretation

Modern AI-Enhanced CDS

Epic Sepsis Model - Type: EHR-integrated sepsis prediction - Function: Real-time risk score based on vital signs, labs - Evidence: CONTROVERSIAL - External validation showed 33% sensitivity (missed 67% of cases), 12% PPV (Wong et al., 2021) - Cost: Included with Epic EHR - Strength: Integrated workflow - Limitation: High false positive rates, mixed evidence for clinical benefit - Verdict: Use with caution, understand limitations

WAVE Clinical Platform (ExcelMedical) - Type: Continuous vital sign monitoring + early warning scores - Function: ICU/step-down monitoring, deterioration prediction - FDA Clearance: K171056 (January 2018, first FDA-cleared patient surveillance system) - Evidence: UPMC study showed detection of MET events 6.33 hours in advance and 58% reduction in critical instability events after implementation (Hravnak et al., 2008) - Cost: Institutional licensing - Use case: Hospital early warning systems


Category 2: Diagnostic AI

Radiology AI (Comprehensive List)

Intracranial Hemorrhage Detection

Aidoc has become a widely deployed solution for many radiology departments, with implementation at 1,000+ hospitals. The system detects intracranial hemorrhages, pulmonary embolisms, and C-spine fractures that might otherwise be missed or delayed in the queue.

Validation studies show 92.3% sensitivity and 97.7% specificity for ICH detection (Voter et al., JACR 2021). A real-world evaluation across 17 facilities (101,944 examinations) found 82.2% overall sensitivity and 97.6% specificity, with 95% sensitivity for large hemorrhages (npj Digit Med 2025). Unlike many AI tools that radiologists ignore, this one gets used because the mobile notifications integrate smoothly into existing PACS workflows. Cost is $20-50K per scanner annually, which adds up for multi-scanner facilities. However, if it catches one missed bleed or gets neurosurgery mobilized 30 minutes faster, the return on investment is substantial.

Viz.ai takes a different approach that focuses on care coordination. Beyond detection, when it flags a large vessel occlusion stroke, it simultaneously alerts the stroke team, neurology, and interventional radiology. The published data shows reduction in time-to-treatment (Figurelle et al., AJNR 2023), with the VISIION study demonstrating 39% door-to-groin reduction for off-hours large vessel occlusion cases.

The platform has since expanded to pulmonary embolism, aortic dissection, and other time-critical diagnoses. Deployed at 1,700+ hospitals, Viz.ai works best in systems where rapid team mobilization impacts patient outcomes. Cost varies by institutional contract.

RapidAI has significant presence in the neuroradiology space. Beyond hemorrhage detection, it provides ASPECTS scoring, perfusion analysis, and hemorrhage volume quantification. A CADTH health technology assessment reviewed 13 diagnostic accuracy studies (NCBI NBK611329). These tools help neurology and neurosurgery make treatment decisions faster. For institutions handling complex stroke cases, RapidAI’s feature set warrants consideration.

Chest X-Ray AI

Lunit INSIGHT CXR tackles the bread-and-butter of emergency and hospital radiology: pneumothorax, nodules, consolidation, pleural effusion, cardiomegaly. A head-to-head validation study in Radiology (2024) showed Lunit achieved highest AUC of 0.93 for lung nodule detection, surpassing other AI vendors and human readers. A European study demonstrated 36.2% workload reduction while maintaining 95% sensitivity for urgent cases.

What’s interesting about Lunit is they’ve focused on doing common things well rather than promising to detect everything. That focus shows in the performance metrics. The company has published over 100 peer-reviewed papers across SCI journals including The Lancet Digital Health, JAMA Oncology, and Radiology.

Oxipit ChestLink is a comprehensive autonomous chest X-ray reporting system that classifies studies as normal versus abnormal across multiple pathologies. Validation studies show 99.1% sensitivity for abnormal findings versus 72.3% for radiologists, with 99.8% sensitivity for critical findings (Plesner et al., 2023). The system autonomously reports normal cases without radiologist involvement, making it suitable for high-volume screening settings where radiologist bandwidth is limited.

qXR from Qure.ai detects 29 different chest X-ray findings. Validation studies show 96.38% sensitivity for abnormal CXR classification (BJR|Open 2024), and the RADICAL study in BMJ Open evaluated performance in a UK lung cancer screening setting (BMJ Open 2024). Their strongest evidence comes from TB-endemic regions, which makes sense given the company’s focus on emerging markets. The pricing is competitive, and for resource-limited settings where radiologist access is scarce, qXR provides real value. Just understand that a tool optimized for India’s public health challenges may perform differently in US community hospitals.

Mammography AI

iCAD ProFound AI dominates the US mammography AI market, and the dominance is earned. A Massachusetts General Hospital study found ProFound AI 3.0 achieved AUC of 0.93, with 100% sensitivity and 67.0% specificity at the rule-out threshold (Resch et al., Radiol Imaging Cancer 2024). Integration with major mammography vendors means most radiology practices can deploy it without replacing existing infrastructure.

The evidence base is solid. Widespread US deployment gives you the advantage of learning from other institutions’ implementation experiences. If you’re evaluating mammography AI, iCAD is the benchmark against which you’ll compare alternatives.

Lunit INSIGHT MMG offers a credible alternative, particularly if you’re in Europe or Asia where it has stronger market presence. A JAMA Oncology study found Lunit achieved highest accuracy among AI solutions for screening mammography (Salim et al., JAMA Oncol 2020). A BreastScreen Norway study of 660,000+ examinations showed AUC of 0.93. What makes Lunit interesting is their international validation. If your patient population is demographically diverse, evidence from European and Asian deployments matters.

Hologic Genius AI is your choice if you already use Hologic mammography equipment. A Massachusetts General Hospital retrospective analysis of 7,500 DBT screening exams found AI flagged 32% of false-negative cases and correctly localized 90% of previously identified cancers. The smooth native integration is compelling: no middleware, no separate workstation, just AI baked into your existing workflow. Performance is solid, though not necessarily better than standalone options. The decision here is mostly about workflow simplicity versus best-in-class performance from independent vendors.

Other Imaging Modalities

Arterys / Tempus Pixel (Cardiac MRI, CT Angiography) - FDA Clearance: Multiple (first FDA clearance for cloud-based AI medical imaging, January 2017) - Function: Automated cardiac chamber quantification, vessel analysis - Note: Arterys was acquired by Tempus Labs; the cardiac MRI AI product is now called Tempus Pixel Cardio (FDA-cleared K203744) - Evidence: Multi-vendor comparison study validated performance (Ruijsink et al., 2022) - Deployment: Academic medical centers, cardiology practices - Verdict: Leading cardiac imaging AI

HeartFlow FFR-CT - FDA Clearance: Yes (De Novo) - Function: CT-based fractional flow reserve (non-invasive) - Evidence: PRECISE trial (n=2,103) demonstrated reduced unnecessary catheterization without increase in death or MI (JAMA Cardiol 2023); PACIFIC trial showed 90% sensitivity, 86% specificity per vessel (JACC 2018) - Reimbursement: CPT codes established - Cost-effectiveness: FISH&CHIPS study (>90,000 NHS patients) demonstrated prognostic value and cost savings - Verdict: Excellent evidence, clinically impactful

Paige Prostate represents what medical AI should look like: narrow scope, prospective validation, clear clinical use case. FDA granted De Novo clearance in 2021 after the company demonstrated that their prostate biopsy cancer detection algorithm actually improves detection of high-grade cancer (Perincheri et al., Mod Pathol 2021).

This isn’t replacing pathologists. It’s flagging suspicious regions for their review, reducing false negatives while maintaining human-in-the-loop oversight. Several pathology labs have deployed it clinically, which tells you the evidence convinced the people who actually have to stake their reputation on the results.

PathAI and Proscia are worth watching. PathAI’s AIM-MASH received FDA Drug Development Tool (DDT) Qualification in December 2025, the first AI-powered pathology tool qualified for MASH clinical trials; clinical validation published in Nature Medicine (Sanyal et al., 2025). Proscia focuses on digital pathology infrastructure plus AI modules; their platform validation showed robust performance on uncurated multi-site data (Ianni et al., Sci Rep 2020). Workflow integration often determines whether AI gets used or ignored, and both companies are building systematically rather than rushing to market.

Dermatology AI

3Derm (now part of Digital Diagnostics) received FDA Breakthrough Device Designation in January 2020 for autonomous detection of melanoma, squamous cell carcinoma, and basal cell carcinoma. The system is still under clinical investigation and has not yet received FDA clearance as of January 2026. The validation studies that exist show performance varies significantly by skin type, a recurring problem with dermatology AI trained predominantly on lighter skin.

SkinVision and similar direct-to-consumer smartphone apps? I wouldn’t recommend them clinically. Variable validation, inconsistent performance, and the systematic review by Freeman et al. (Freeman et al., BMJ 2020) showed most consumer dermatology apps lack rigorous validation, with sensitivity ranging from 7-73%. Note: SkinVision is not available in the US; it holds CE marking as a Class IIa medical device under EU MDR. Patients will use these apps anyway, so you should know what they are. Just don’t endorse them.

Ophthalmology AI

IDx-DR / LumineticsCore was the first autonomous AI diagnostic system FDA cleared (De Novo pathway, April 2018). The prospective RCT showed 87.2% sensitivity and 90.7% specificity for diabetic retinopathy screening (Abràmoff et al., npj Digit Med 2018). The product was rebranded to LumineticsCore in 2023.

Factors contributing to IDx-DR’s clinical adoption include narrow application (referable diabetic retinopathy: yes or no), clear clinical need (primary care physicians need retinal screening but lack ophthalmology expertise), reimbursement pathway (CPT 92229, approximately $50-80), and validation in real primary care settings, not just retrospective datasets.

Deployment has expanded in primary care offices, endocrinology clinics, and federally qualified health centers. For practices seeing diabetic patients with limited ophthalmology referral access, this system warrants consideration.

EyeArt from Eyenuk offers comparable performance, with FDA 510(k) clearance (August 2020). A pivotal evaluation in JAMA Network Open showed high accuracy for autonomous detection of referrable and vision-threatening DR (Ipp et al., JAMA Netw Open 2021); a study of 100,000+ patients validated real-world performance (Bhaskaranand et al., Diabetes Technol Ther 2019). RetCAD tackles both diabetic retinopathy and age-related macular degeneration, with validation across European settings (González-Gonzalo et al., Acta Ophthalmol 2020). Note: RetCAD is CE marked but not FDA cleared for the US market.


Category 3: Documentation and Ambient Scribe AI

FDA Note: These are NOT FDA-regulated (clinical decision support, not diagnostic)

Ambient documentation represents a significant application of AI for reducing physician administrative burden. These systems address documentation time rather than diagnostic accuracy.

Nuance DAX (Dragon Ambient eXperience) is widely deployed in the ambient scribe market. The system listens to patient encounters, transcribes conversations, extracts clinical information, and auto-generates SOAP note drafts. Physicians review, edit, and sign the generated notes.

Vendor-funded studies report ~50% reduction in documentation time, though independent peer-reviewed studies show more modest but significant results. A cohort study at Intermountain Health found 28.8% lower documentation time per encounter among high users (those using DAX in >60% of visits), translating to 1.8 minutes saved per visit (Haberle et al., JAMIA 2024). A Stanford study of 45 physicians found median daily documentation time decreased by 6.89 minutes, after-hours EHR time by 5.17 minutes, and total EHR time by 19.95 minutes per day (Ma et al., JAMIA 2025). Thousands of physicians across specialties have adopted ambient scribe technology. Cost runs approximately $600 per physician per month plus setup fees.

Critical caveat: physicians must review AI-generated notes. These systems make errors, miss nuance, and occasionally misinterpret what was said. However, editing a partially correct note is faster than generating one from scratch.

Abridge creates patient-shareable visit summaries in addition to clinical documentation. Recordings and written summaries allow patients to review encounters afterward. A quality improvement study at University of Kansas Medical Center found clinicians using Abridge were 7x more likely to find their workflow easy and 5x more likely to complete notes before the next patient visit; 67% reported feeling less at risk of burnout (Albrecht et al., JAMIA Open 2025). A randomized crossover trial showed 46.6% reduction in cognitive load as measured by NASA-TLX (Hudson et al., Mayo Clin Proc Digit Health 2025). Cost is competitive with DAX.

Suki adds features beyond transcription, including order placement, ICD/CPT code lookup, and voice-enabled EHR navigation. A peer-reviewed validation study assessed note quality using the modified PDQI-9 metric across general medicine, pediatrics, OB/GYN, orthopedics, and cardiology, showing high interrater agreement for most specialties (Palm et al., Front Artif Intell 2025). The AAFP Innovation Laboratory found 60% of participating family physicians adopted the solution after a 30-day trial. Deployment is growing, particularly among physicians seeking complete voice-enabled workflows.

Doximity GPT is a free, HIPAA-compliant AI assistant integrated into the Doximity platform. Unlike standalone products, Doximity GPT is already accessible to the majority of U.S. physicians who use Doximity for professional networking. The tool generates clinical documentation (SOAP notes, discharge summaries, patient education materials), assists with insurance appeals and referral letters, provides evidence-based differential diagnosis suggestions citing medical literature (PubMed, UpToDate, clinical guidelines), and supports ICD-10 coding.

Doximity acquired Pathway Medical for $63 million to integrate clinical datasets and AI capabilities (FierceHealthcare, 2025). The combined product is in beta testing. Doximity’s AI suite includes Scribe (ambient documentation), GPT (clinical assistant), and Pathway (clinical datasets). Company reports Scribe alone has 10,000+ beta testers, with 75% using it weekly (Yahoo Finance, 2025).

Unlike general-purpose chatbots (ChatGPT, Claude, Gemini), Doximity GPT is specifically designed for medical use with HIPAA security infrastructure. The platform reported first-quarter fiscal 2026 revenues of $145.9 million (up 15% year-over-year) with 55% adjusted EBITDA margin (Yahoo Finance, 2025).

Cost: Free for Doximity members (Scribe has tiered pricing for heavy users) Strength: Zero-cost entry point, HIPAA-compliant, already integrated into existing physician workflow Limitation: Requires Doximity account; Scribe features require separate subscription for high-volume use; independent clinical validation studies not yet published Verdict: Most accessible AI clinical assistant for U.S. physicians; no financial barrier to trial

DeepScribe and Freed AI target smaller practices. DeepScribe offers ambient transcription for primary care and specialty clinics. Freed AI has a free tier, making it accessible for solo practitioners or small groups testing ambient documentation.

Implementation note: Ambient scribe technology can reduce documentation burden substantially. Physician satisfaction appears genuine, and time savings are measurable. Start with a trial period, expect an initial learning curve, and maintain the practice of reviewing all AI-generated content.

Implementation Considerations:

Benefits: - Significant time savings (1-2 hours/day documentation) - Improved patient eye contact - Reduced burnout - After-hours documentation reduced

Considerations: - Requires physician review (AI makes errors) - Patient consent for recording - Privacy/security (HIPAA-compliant vendors only) - Cost (ROI depends on time saved, productivity gains) - Learning curve (initial weeks slower as physician adapts)


Category 4: Literature Search and Synthesis

PubMed / MEDLINE (with AI enhancements) - Free, covers most clinical scenarios - New features: AI-powered search refinement (limited) - Verdict: Still the gold standard, but time-consuming

Consensus - Function: AI searches 220+ million peer-reviewed papers, compiles findings - Use case: Quick literature review, evidence synthesis - Evidence: A systematic review found limited peer-reviewed validation; researchers raised concerns about accuracy and transparency (Apata et al., Cureus 2025) - Cost: Free tier, paid for advanced features - Verdict: Useful for rapid evidence gathering; verify key findings independently

Elicit - Function: AI research assistant - finds papers, extracts key info from 125+ million papers - Use case: Literature review, research questions - Evidence: Validation studies show average sensitivity of 39.5% (systematic reviews require ≥90%); useful as complement, not replacement for traditional searching (Lau & Golder, Cochrane Evid Synth Methods 2025) - Cost: Free tier, paid plans - Verdict: Helpful for scoping searches; insufficient alone for systematic reviews

Scite.ai - Function: Citation analysis - shows how papers cite each other (supporting, contrasting); 1.5 billion citation statements indexed - Use case: Evaluating strength of evidence, finding contradictory studies - Evidence: Technical validation showed citation matching F-score of 95.4% (Nicholson et al., Quant Sci Stud 2021); however, independent evaluation found low accuracy for classifying supporting vs. contrasting citations (F-measures 0.0-0.58) (Bakker et al., Hypothesis 2023) - Cost: Subscription (now part of Research Solutions) - Verdict: Valuable for finding citation context; interpret supporting/contrasting classifications with caution

ResearchRabbit - Function: Literature mapping, citation networks (uses PubMed for medical sciences, Semantic Scholar for other fields) - Note: Acquired by Litmaps in November 2025 - Cost: Free - Verdict: Excellent for exploring research landscapes; no peer-reviewed validation studies

Connected Papers - Function: Visual citation networks using co-citation and bibliographic coupling analysis - Use case: Finding related papers - Cost: Free tier (limited to 5 graphs/month) - Verdict: Great visualization tool for exploratory discovery; not suitable for systematic reviews due to coverage limitations


Category 5: Patient Communication

Patient Education

ChatGPT / GPT-4 (with extreme caution) - Capabilities: Generate patient education materials, explain diagnoses - Evidence: Large language models can produce accurate information for common medical topics; Med-PaLM achieved expert-level performance on medical licensing exams (Singhal et al., 2023) - Critical limitations: - Hallucinates (makes up plausible-sounding false information) - No access to patient-specific data - No liability/accountability - May generate outdated or incorrect guidance - Appropriate use: - Draft patient education materials (physician reviews/edits) - Simplify complex medical concepts (verify accuracy) - NOT for patient-specific medical advice - Verdict: Useful tool with physician oversight, NEVER autonomous patient advice

Google Med-PaLM 2 / MedLM - Medical-specific LLM - Evidence: Better performance than GPT-4 on medical licensing exams - Status: Now commercially available through MedLM to allowlisted Google Cloud healthcare customers in the U.S., not approved as a medical device - Verdict: Limited commercial availability, requires Google Cloud approval for medical use cases

Symptom Checkers (Patient-Facing)

Ada Health - Function: Symptom assessment, triage guidance - Evidence: Comparative study showed 70.5% condition coverage rate and 97.0% safe triage; accuracy improved to 78.5% with patient input for chief complaint (Gilbert et al., BMJ Open 2020) - Use case: Patient triage (ED vs. urgent care vs. PCP) - Verdict: Triage tool, not diagnostic

Buoy Health - Function: Symptom checker + care navigation (developed at Harvard Innovation Laboratory) - Evidence: Published validation in peer-reviewed literature limited; company reports 95% safe triage recommendations - Partnerships: Major health systems integrating for patient navigation - Verdict: Promising for patient navigation; independent validation needed

K Health - Function: AI symptom assessment + telemedicine - Model: Subscription-based primary care ($12/month for unlimited doctor visits) - Evidence: Claims based on 2+ billion clinical data points; limited peer-reviewed validation - Verdict: Integrated care model; combines AI triage with physician access

Caution on Symptom Checkers: - Accuracy limited (patients may not describe symptoms accurately) - Liability unclear if patients rely on recommendations - Best use: Triage, not diagnosis - Physicians should be cautious recommending specific tools


Category 6: Specialty-Specific Tools

Cardiology

HeartFlow FFR-CT (covered above)

Caption Health (GE HealthCare) - FDA Clearance: Yes (510(k) K190887; acquired by GE HealthCare in February 2023) - Function: AI-guided cardiac ultrasound acquisition - Use case: Point-of-care echo by non-experts - Evidence: Pivotal study showed novice nurses using Caption AI achieved diagnostically equivalent images to expert sonographers with 10+ years experience (Narang et al., JAMA Cardiol 2021) - Verdict: Democratizes cardiac ultrasound; enables screening where sonographers unavailable

Eko Analysis - FDA Clearance: Yes (510(k) cleared for murmur detection) - Function: Digital stethoscope + AI murmur detection - Use case: Primary care screening for valvular heart disease - Evidence: EXPAND trial (n=3,456) showed 85.6% sensitivity and 84.4% specificity for detecting structural heart disease; compared to traditional auscultation sensitivity of 44.9% (Chorba et al., JACC 2021) - Verdict: Significant improvement over standard auscultation for VHD screening

Oncology

Tempus - Function: Genomic sequencing, AI-driven treatment matching, clinical data analysis - Use case: Precision oncology decision support - Evidence: Referenced in NCCN guidelines; xT assay validated for comprehensive solid tumor genomic profiling; AI platform matches patients to clinical trials (Beaubier et al., JCO Precis Oncol 2019) - Verdict: Leading precision oncology platform with extensive real-world data

Foundation Medicine - Function: Comprehensive genomic profiling (CGP) - Use case: Cancer treatment selection based on tumor mutations - Evidence: FoundationOne CDx received FDA approval (P170019) as companion diagnostic for multiple targeted therapies; analytical validation published (Frampton et al., Nat Biotechnol 2013) - FDA Status: FoundationOne CDx (PMA approved), FoundationOne Liquid CDx (510(k) cleared) - Verdict: Gold standard tumor profiling with established FDA approval

IBM Watson for Oncology (FAILED)

  • Status: DISCONTINUED after failures
  • Lesson: Marketing ≠ clinical validity

Emergency Medicine

Viz.ai suite (covered above - stroke, PE)

Epic Deterioration Index - Function: Patient deterioration prediction - Evidence: Variable - some validation, implementation challenges - Cost: Included with Epic - Verdict: Use with caution, understand limitations

Gastroenterology

Medtronic GI Genius - FDA Clearance: Yes (De Novo DEN200055, April 2021; first AI device for colonoscopy in US) - Function: Real-time AI-assisted polyp detection during colonoscopy - Evidence: Randomized controlled trial (n=685) showed 50% relative increase in adenoma detection rate (34.4% vs 22.7%) without increase in procedure time (Wallace et al., Gastroenterology 2022); meta-analysis of 13 RCTs (n=17,354) confirmed ADR improvement (Spadaccini et al., Gut 2024) - Use case: Improving adenoma detection during screening/surveillance colonoscopy - Verdict: Strong RCT evidence; improves polyp detection for quality colonoscopy


Hands-On: Evaluating AI Tools for Your Practice

Step 1: Identify Clinical Need

Ask: - What problem am I trying to solve? - Is this a real workflow pain point? - Will AI solution improve patient outcomes, efficiency, or satisfaction?

Step 2: Evidence Review

Essential questions: - FDA-cleared? (Check FDA database: accessdata.fda.gov/scripts/cdrh/cfdocs/cfPMN/pmn.cfm) - Peer-reviewed publications? (PubMed search) - Prospective validation? (Not just retrospective) - External validation? (Multiple institutions, populations) - Performance in MY setting? (Demographics, EHR, workflow)

Step 3: Workflow Assessment

Integration: - EHR-integrated or standalone? - Number of clicks? - Time added or saved? - Who operates it? (Physician, MA, nurse?)

Step 4: Financial Analysis

Costs: - Licensing fees (annual, per-study, per-patient) - Hardware (servers, cameras, specialized equipment) - Personnel (training, IT support, clinical champions) - Maintenance and updates

ROI: - Time saved (value your time) - Reimbursement (CPT codes available?) - Quality metrics (value-based care bonuses) - Risk reduction (fewer malpractice claims) - Patient satisfaction (retention, referrals)

Step 5: Pilot Testing

Before full deployment: - Retrospective testing on YOUR data - Small pilot with limited users - Collect feedback (physician, patient, staff) - Measure impact (time, accuracy, satisfaction) - Identify failure modes

Step 6: Continuous Monitoring

Post-deployment: - Quarterly performance reviews - User feedback collection - False positive/negative tracking - Clinical outcome monitoring - Vendor support responsiveness


Enterprise AI Platforms for Healthcare Organizations

While consumer health AI products like ChatGPT Health target patients directly, a parallel category of enterprise healthcare AI platforms is emerging for deployment within health systems. These products offer HIPAA-compliant infrastructure, Business Associate Agreements (BAAs), and integration with clinical workflows.

OpenAI for Healthcare (January 2026)

OpenAI launched OpenAI for Healthcare in January 2026, a set of enterprise products designed for healthcare organizations, including ChatGPT for Healthcare and API access with BAA support.

Key distinction from ChatGPT Health:

Product Target HIPAA Status BAA Available
ChatGPT Health (consumer) Patients directly Not compliant No
ChatGPT for Healthcare (enterprise) Healthcare organizations Supports compliance Yes

ChatGPT for Healthcare features:

  • Evidence retrieval with citations: Responses grounded in peer-reviewed research, clinical guidelines, and public health guidance with transparent citations including titles, journals, and publication dates
  • Institutional policy integration: Connects with enterprise tools (Microsoft SharePoint) to incorporate organization-approved policies and care pathways
  • Reusable templates: Shared templates for discharge summaries, patient instructions, prior authorization support
  • Role-based access controls: Centralized workspace with SAML SSO, SCIM user management
  • Data controls: Audit logs, customer-managed encryption keys, data residency options

HIPAA and compliance:

  • BAA available for healthcare customers through ChatGPT for Healthcare
  • Content not used for model training
  • PHI remains under organization control
  • This represents a significant shift from public ChatGPT (see HIPAA considerations)

Early hospital partners (as of January 2026):

OpenAI lists eight hospital partners, though independent verification varies:

  • Boston Children’s Hospital (verified): John Brownstein (SVP/Chief Innovation Officer) confirmed adoption of ChatGPT Team with custom OpenAI-powered solution and governance foundations
  • UCSF (verified): Chancellor’s announcement confirms ChatGPT Enterprise deployment for 9,000 users in early 2026
  • Stanford Medicine Children’s Health, AdventHealth, HCA Healthcare, Baylor Scott & White Health, Cedars-Sinai Medical Center, Memorial Sloan Kettering Cancer Center (listed by OpenAI; independent press releases not located as of this writing)

Underlying model: GPT-5.2

OpenAI for Healthcare runs on GPT-5.2 models, which OpenAI describes as “optimized for healthcare.” Performance claims cite:

  • HealthBench scores (vendor-developed benchmark; see caveats in the Evaluation chapter)
  • GDPval performance (internal OpenAI benchmark claiming 70.9% win/tie rate versus human baselines, not “better across every role” as sometimes characterized)

Clinical evidence: Penda Health study

OpenAI cites a study conducted with Penda Health in Kenya (OpenAI et al., 2025, preprint):

  • Results: 16% relative reduction in diagnostic errors, 13% reduction in treatment errors across 39,849 patient visits (15 clinics in Nairobi)
  • Critical caveats:
    • 35% of critical safety alerts were initially ignored by clinicians
    • Two patient deaths during the study were “deemed potentially preventable if AI alerts had been followed”
    • Clinicians using AI spent more time per patient (16.4 vs. 13.0 minutes median)
    • Non-randomized design; a randomized controlled trial with PATH is underway
    • Preprint, not peer-reviewed
Enterprise AI Platform Evaluation

When evaluating enterprise healthcare AI platforms:

  1. Verify BAA terms: What exactly is covered? What are the liability provisions?
  2. Understand data handling: Where is PHI stored? Who has access? Is it used for any training?
  3. Check independent validation: Vendor benchmarks (HealthBench, GDPval) are not substitutes for peer-reviewed clinical validation
  4. Assess integration requirements: What EHR/enterprise tool integration is needed?
  5. Calculate total cost of ownership: Licensing, implementation, training, ongoing support

The same due diligence applied to any clinical system applies to enterprise AI platforms.

Claude for Healthcare (Anthropic, January 2026)

Anthropic launched Claude for Healthcare in January 2026, introducing HIPAA-ready enterprise tools for healthcare providers and payers. The announcement followed OpenAI’s healthcare launch by one week and was made at the J.P. Morgan Healthcare Conference.

Key distinction from consumer Claude:

Product Target HIPAA Status BAA Available
Claude Pro/Max (consumer) Individual users Not compliant No
Claude for Healthcare (enterprise) Healthcare organizations Supports compliance Yes

Healthcare connectors:

Claude for Healthcare includes connectors that allow Claude to pull information from industry-standard systems:

  • CMS Coverage Database: Local and National Coverage Determinations for prior authorization verification, coverage requirements, and claims appeals
  • ICD-10: Diagnosis and procedure code lookup via CMS and CDC data for medical coding and billing accuracy
  • National Provider Identifier Registry: Provider verification, credentialing, and claims validation
  • PubMed: Access to 35+ million biomedical literature citations for literature reviews and evidence retrieval

Agent skills:

  • FHIR development: Skill for building interoperable healthcare data exchanges using the HL7 FHIR standard
  • Prior authorization review: Sample skill template (customizable to organization policies) for cross-referencing coverage requirements, clinical guidelines, and patient records

Use cases highlighted by Anthropic:

  1. Prior authorization: Pull coverage requirements, check clinical criteria against patient records, propose determinations with supporting materials
  2. Claims appeals: Assemble documentation from patient records, coverage policies, clinical guidelines for appeal preparation
  3. Care coordination: Triage patient portal messages, identify urgent items, track referrals and handoffs

HIPAA and compliance:

  • BAA available for Claude for Enterprise healthcare customers
  • Health data accessed through connectors not stored in Claude’s memory
  • Data not used for model training
  • PHI remains under organization control

Early adopter (verified):

Additional partners listed by Anthropic (independent confirmation not located): Stanford Healthcare, Novo Nordisk, Sanofi, AbbVie, Genmab

Performance benchmarks:

Anthropic reports Claude Opus 4.5 with extended thinking (64k tokens) achieves:

  • MedCalc-Bench: 98.1% accuracy on medical calculations (Anthropic-reported). MedCalc-Bench is an NIH-developed benchmark testing LLMs across 55 medical calculator tasks (CHADS-VASc, GFR equations, risk scores, dosing calculations) with 1,000+ patient scenarios (Khandekar et al., NeurIPS 2024). Context: Original benchmark research found GPT-4 achieved 50.9% accuracy, suggesting substantial improvement if Claude’s 98.1% is independently validated. See Medical Calculation Limitations for clinical implications
  • MedAgentBench: 91.4% on Stanford’s medical agent benchmark. MedAgentBench tests LLM agent capabilities across 300 clinically-derived tasks in a realistic EHR environment (Schmidgall et al., NEJM AI 2025)

Life sciences features:

Anthropic also expanded Claude for Life Sciences with connectors to Medidata (clinical trial data), ClinicalTrials.gov, bioRxiv/medRxiv, ChEMBL, and Open Targets. These features target pharmaceutical R&D and clinical trial operations rather than clinical care delivery.

API Access with BAA

Organizations building custom healthcare AI applications can access OpenAI’s API with BAA coverage:

  • API access to GPT-5.2 models
  • Eligible customers can apply for BAA through OpenAI
  • Used by ambient documentation companies (Ambience, EliseAI; note that Abridge uses primarily proprietary models)
  • Enterprise API customers contact account teams for access

Consumer Health AI: What Your Patients Are Using

Beyond clinical AI tools that physicians deploy, a growing category of consumer health AI products is reaching patients directly. Understanding these tools helps physicians contextualize patient-reported AI recommendations and have informed conversations about their use.

ChatGPT Health (OpenAI, January 2026)

OpenAI launched ChatGPT Health as a dedicated health experience within ChatGPT, representing a significant expansion of consumer health AI from the world’s largest AI company.

What it does:

ChatGPT Health allows users to connect their health information and wellness apps to ChatGPT for personalized health conversations. OpenAI reports that health is already one of the most common ways people use ChatGPT, with over 230 million people globally asking health-related questions weekly.

Key features:

  • Medical record integration: Connection to medical records via b.well’s health data network (U.S. only; supports Epic, Cerner, Meditech EHRs)
  • Wellness app integrations: Apple Health (iOS required), Function, MyFitnessPal, Weight Watchers, AllTrails, Instacart, Peloton
  • Personalized responses: Conversations grounded in user’s connected health data
  • Separate health space: Health conversations isolated from regular ChatGPT chats with separate memories
  • Health insurance navigation: OpenAI reports 1.6-1.9 million health insurance questions weekly (plan comparisons, claims, billing, coverage queries)

Development approach:

OpenAI collaborated with 262 physicians across 60 countries and 26 medical specialties during development. The company created HealthBench, an open-source evaluation framework with 5,000 multi-turn health conversations and 48,562 rubric criteria developed with physician input (OpenAI et al., 2025, preprint). See AI Evaluation and Validation for detailed analysis of HealthBench methodology.

Privacy and security:

  • Health conversations not used for model training (by default)
  • Purpose-built encryption and isolation for health data
  • Conversations stored separately from non-health chats
  • Multi-factor authentication recommended
  • Important: OpenAI explicitly states ChatGPT Health is not HIPAA compliant, as it operates outside the covered entity framework

Clinical considerations:

Aspect Details
FDA status Not FDA-cleared; positioned as wellness/information tool, not medical device
HIPAA Not compliant; consumer product outside covered entity framework
Availability U.S. medical records; excluding EEA, Switzerland, UK
Target users Consumers, not healthcare providers
For Physicians: Patient Conversations About ChatGPT Health

Patients may present with AI-interpreted lab results, health recommendations, or questions derived from ChatGPT Health conversations. Consider:

  1. Ask about AI tool usage: “Have you looked up anything about this online or with AI tools?”
  2. Review source material: If patient shares AI-generated interpretation, compare to actual clinical data
  3. Provide context: Explain limitations of consumer AI tools for medical decision-making
  4. Document appropriately: Note when clinical discussion addresses patient-generated AI content
  5. Stay current: ChatGPT Health and similar tools will evolve; understand what patients are using

The goal is not to dismiss patient engagement with health AI, but to ensure clinical decisions are based on appropriate medical evaluation.

Independent research context:

A study published in JAMA Network Open (January 2026) found that LLMs systematically mishandle probabilistic risk communication in medical contexts. When LLMs defined risk terms numerically, they drifted significantly from medical standards: “rare” meant up to 4% (vs. medical standard of 0.1%), “common” meant up to 36% (vs. standard of 10%) (Jackson et al., 2026). Physicians should be aware that patients may receive risk information that differs from clinical conventions.

Claude Health Integrations (Anthropic, January 2026)

Anthropic introduced consumer health data integrations for Claude Pro and Max subscribers alongside its enterprise healthcare launch (Anthropic, January 2026).

Health data connections (U.S. only, beta):

  • Apple Health: iOS integration for fitness and health metrics (rolling out)
  • Android Health Connect: Android equivalent for health data access (rolling out)
  • HealthEx: Lab results and health records connector (available in beta)
  • Function: Health data integration (available in beta)

What users can do:

  • Summarize medical history from connected records
  • Explain lab results in plain language
  • Detect patterns across fitness and health metrics
  • Prepare questions for medical appointments

Privacy protections:

  • Users must explicitly opt in to enable access
  • Users can disconnect or edit permissions at any time
  • Health data excluded from Claude’s memory
  • Data not used for model training

Clinical considerations:

Aspect Details
FDA status Not FDA-cleared; wellness/information positioning
HIPAA Not compliant; consumer product
Target users Consumers (Pro/Max subscribers), not providers

As with ChatGPT Health, patients may present with Claude-interpreted health information. The same guidance applies: ask about AI tool usage, review source material, and ensure clinical decisions are based on appropriate medical evaluation.

Other Consumer Health AI Products

Apple Health AI features: Integrated health insights within Apple’s ecosystem, leveraging Apple Watch and iPhone sensor data.

Google Health initiatives: Various consumer health tools including AI-powered features in Google Fit and experimental health applications.

Symptom checker apps: Babylon Health, Ada Health, K Health, and others provide AI-driven symptom assessment with varying levels of validation.

Direct-to-consumer lab interpretation: Services that apply AI to interpret lab results outside clinical context.

Consumer AI vs. Clinical AI: Key Distinctions
Consumer Health AI Clinical AI
Direct to patients Deployed by healthcare systems
Not HIPAA compliant HIPAA compliant (with BAA)
Wellness positioning Medical device regulation
Self-reported symptoms Clinical data integration
No physician oversight Physician review required

Consumer health AI fills a different niche than clinical AI. Understanding both helps physicians navigate conversations with patients who use these tools.


Red Flags: When to Avoid AI Tools

No FDA clearance for diagnostic applications (wellness/CDS exceptions)

No peer-reviewed publications (only vendor whitepapers)

No external validation (only tested at vendor institution)

Vendor refuses to share performance data (lack of transparency)

Claims that seem too good to be true (“99.9% accuracy,” “replaces physicians”)

Unclear data use policies (who owns data, how is it used)

Poor customer references (other physicians had negative experiences)

Overly complex integration (requires major workflow changes)

No clear clinical value proposition (solution looking for problem)


Resources for Staying Current

FDA AI Device Database: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-enabled-medical-devices

Medical AI Research: - npj Digital Medicine (Nature) - The Lancet Digital Health - JAMA Network Open (AI sections) - Radiology: Artificial Intelligence

Professional Organizations: - Society for Imaging Informatics in Medicine (SIIM) - American Medical Informatics Association (AMIA) - Radiological Society of North America (RSNA) AI sessions

Conferences: - RSNA (annual AI showcase) - HIMSS (health IT focus) - ML4H (Machine Learning for Health - NeurIPS workshop)

AI Safety Evaluation Tools:

  • Petri (Anthropic, open-source, 2025): Automated behavioral audit tool for evaluating LLM safety characteristics including sycophancy, deception, power-seeking, and crisis handling. Uses multi-turn conversations with simulated users to probe model behavior across diverse scenarios. Useful for comparing safety profiles across different LLMs. Note: Developed by Anthropic; interpret results with awareness that tool design may favor Claude models (Anthropic Research, 2025)
  • Bloom (Anthropic, open-source, 2025): Complementary tool that generates in-depth evaluation suites for specific behaviors, quantifying severity and frequency. Benchmarks available for delusional sycophancy, self-preservation, and self-preferential bias (Anthropic Research, 2025)

The Clinical Bottom Line

Key Takeaways
  1. Prioritize evidence: FDA clearance, peer-reviewed validation, prospective studies

  2. Start with proven applications: Diabetic retinopathy screening, ambient documentation, specific radiology tasks

  3. Evaluate for YOUR setting: External validation data, your patient population, your workflow

  4. Calculate real ROI: Time savings, quality metrics, reimbursement, risk reduction

  5. Pilot before full deployment: Test on your data, collect feedback, identify failures

  6. Avoid red flags: No evidence, no transparency, too-good-to-be-true claims

  7. Continuous monitoring essential: Performance can drift, vigilance required

  8. Patient communication matters: Transparency, consent, addressing concerns

  9. You remain responsible: AI is tool, liability stays with physician

  10. Field evolving rapidly: Stay current, re-evaluate tools regularly

Next Chapter: We’ll dive deep into Large Language Models (ChatGPT, GPT-4, Med-PaLM) for clinical practice: capabilities, limitations, and safe usage guidelines.