[Cardiology and Cardiothoracic Surgery]{.chapter-title}

doi:10.5281/zenodo.18251405

Cardiology and Cardiothoracic Surgery

Cardiology has used AI longer than most physicians realize. Automated ECG interpretation algorithms have analyzed billions of heartbeats since the 1990s. Today’s AI can detect hidden patterns in normal-appearing ECGs: low ejection fraction, hyperkalemia, even biological age. But alongside these validated tools, unproven heart failure prediction models generate 80% false positives, and smartwatch AFib detection creates clinical dilemmas that evidence-based guidelines don’t address.

Learning Objectives

After reading this chapter, you will be able to:

Evaluate AI systems for ECG interpretation and arrhythmia detection, including FDA-cleared algorithms and emerging applications
Critically assess AI applications in echocardiography, cardiac MRI, and coronary CT angiography
Understand heart failure prediction models and their clinical limitations, including false positive rates
Analyze wearable device AI for atrial fibrillation detection and cardiovascular monitoring
Recognize major failures in cardiovascular AI, including IBM Watson for Oncology’s cardiology applications
Apply evidence-based frameworks for evaluating cardiology AI tools before clinical adoption
Navigate medico-legal implications of AI-assisted cardiovascular decision-making

Chapter Summary (TL;DR)

The Clinical Context:

Cardiovascular AI leverages rich physiologic signals (ECG, echocardiography, cardiac MRI, wearables) and decades of outcome data from millions of patients. Applications range from well-validated and widely deployed (ECG interpretation algorithms from the 1990s) to clinically promising (automated echocardiography) to controversial and sometimes harmful (deep learning risk prediction without external validation).

The abundance of cardiovascular data (structured ECG waveforms, standardized imaging protocols, well-defined clinical endpoints) makes cardiology theoretically ideal for AI. But clinical validation, algorithmic equity, integration with existing workflows, and honest acknowledgment of limitations remain critical challenges that vendor marketing often obscures.

Key Applications:

ECG interpretation AI: FDA-cleared algorithms in most ECG machines (1990s technology), high accuracy for STEMI/AFib/LVH, well-validated
AI-enabled hidden ECG patterns: Detect low EF, hyperkalemia, age from normal-appearing ECGs (Mayo Clinic studies, Lancet/JACC publications)
Echocardiography automation: Automated EF calculation, chamber quantification, reduces inter-observer variability (JASE evidence)
Cardiac MRI/CT analysis: Strong technical performance but clinical validation ongoing, equity concerns
Heart failure prediction models: AUC 0.75-0.85 but high false positive rates (70-80%) limit utility
Smartwatch AFib detection: Apple Watch/Fitbit FDA-cleared, high sensitivity but low PPV creates clinical management challenges
IBM Watson cardiology recommendations: Unsafe suggestions, never validated in RCTs, withdrawn after oncology failures
Autonomous risk stratification without validation: Proprietary algorithms deployed without external validation studies

What Actually Works:

Traditional ECG algorithms: Decades of validation, integrated into clinical practice, accepted as standard of care for basic interpretation
Mayo Clinic AI-ECG for low EF detection: Sensitivity 86.3%, specificity 85.7% for EF ≤35%, published in Nature Medicine (2019)
Automated echocardiography measurements: Reduces variability in EF calculation from ±15% to ±5%, FDA-cleared systems available
Apple Heart Study AFib detection: 84% positive predictive value confirmed by ECG patch in 450,000-patient study (NEJM 2019)

What Doesn’t Work:

IBM Watson for Cardiology: Unsafe treatment recommendations, no RCT validation, multiple institutions reported dangerous suggestions (2013-2018 deployment failures)
Unvalidated HF readmission models: Many proprietary systems with AUC 0.75-0.80 achieve this through demographic proxies (age, comorbidities) rather than novel insights
Wearable PPG-based blood pressure: Most consumer devices show poor correlation with oscillometric measurements, not FDA-cleared for clinical decisions
Autonomous coronary artery stenosis grading: High inter-algorithm variability, poor performance in calcified vessels

Critical Insights:

Validation ≠ clinical utility: An algorithm with AUC 0.85 for HF prediction still generates 75% false positives at clinically useful sensitivity thresholds

Hidden ECG patterns are real: AI can detect conditions (low EF, hyperkalemia, age >65) from ECGs that appear normal to cardiologists. This isn’t hype, it’s validated science

Integration matters more than accuracy: ECG algorithms achieve >95% accuracy for STEMI detection, but alert fatigue and poor EHR integration cause missed diagnoses

Wearables create new clinical dilemmas: Detecting asymptomatic paroxysmal AFib in millions of people raises treatment questions that CHADS-VASc wasn’t designed to answer

Equity gaps are substantial: ECG and echo algorithms trained predominantly on white populations show 5-15% worse performance in Black and Hispanic patients

Proprietary algorithms are black boxes: Most commercial cardiovascular AI tools don’t publish validation studies or share performance metrics stratified by race/ethnicity

Clinical Bottom Line:

Cardiology AI shows tremendous promise, with some applications ready for widespread clinical use (ECG interpretation, automated echocardiography measurements) and others requiring prospective trials before adoption (HF prediction models, wearable integration into treatment algorithms).

Demand prospective validation studies. Ask vendors: “Where is the published RCT showing this algorithm improves patient outcomes?” Most can’t provide one. Until they can, treat AI as hypothesis-generating, not decision-making.

The real implementation challenge isn’t accuracy. It’s workflow integration, alert fatigue, equity, and honest communication about limitations.

Medico-Legal Considerations:

Document all AI-assisted cardiovascular decisions in medical record
Understand that you remain legally responsible for all clinical decisions, regardless of AI recommendations
FDA clearance ≠ clinical validation: 510(k) clearance requires only substantial equivalence to existing devices, not outcome trials
Informed consent for experimental AI tools (investigational algorithms not yet FDA-cleared)
Malpractice risk exists for both following incorrect AI recommendations AND ignoring correct AI alerts (failure to act on algorithm-detected STEMI)
Many cardiovascular AI tools lack published external validation studies. Using them may constitute off-label use
Liability for diagnostic delays: If your institution’s ECG AI misses a STEMI due to poor integration, who is liable? (Spoiler: Usually the physician)

Essential Reading:

Attia ZI et al. (2019). “Screening for cardiac contractile dysfunction using an artificial intelligence–enabled electrocardiogram.” Nature Medicine 25:70-74. [The Mayo Clinic AI-ECG low EF detection study]
Perez MV et al. (2019). “Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation.” New England Journal of Medicine 381:1909-1917. [Apple Heart Study with 450,000 participants]
Hannun AY et al. (2019). “Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.” Nature Medicine 25:65-69. [Stanford deep learning ECG algorithm]
Ross EG et al. (2022). “The use of machine learning for the identification of peripheral artery disease and future mortality risk.” Journal of Vascular Surgery 75:1321-1330. [ML for cardiovascular risk prediction]
Sengupta PP et al. (2020). “Cognitive Machine-Learning Algorithm for Cardiac Imaging.” Circulation: Cardiovascular Imaging 13:e009357. [Comprehensive review of cardiac imaging AI]

Introduction: Cardiovascular AI’s Promise and Pitfalls

Cardiology generates more structured data than perhaps any other medical specialty. Every heartbeat produces electrical signals. Every cardiac cycle can be imaged with ultrasound, MRI, or CT. Decades of epidemiologic studies have linked cardiovascular biomarkers to outcomes in millions of patients.

This data richness makes cardiology theoretically ideal for AI applications. And indeed, AI has been used in cardiology longer than most physicians realize. Automated ECG interpretation algorithms have been FDA-cleared since the 1990s, analyzing billions of ECGs over three decades.

But the history of cardiovascular AI includes spectacular failures alongside well-validated successes. IBM Watson’s cardiology applications produced unsafe recommendations. Proprietary heart failure prediction models achieve impressive AUC scores while generating 80% false positives. Smartwatch AFib detection creates clinical management dilemmas that evidence-based guidelines don’t address.

Understanding which tools have earned clinical trust, which have failed, and how to evaluate new offerings requires examining the evidence across each application domain.

Part 1: ECG Interpretation AI

The 30-Year History You Didn’t Know About

If you’ve ordered an ECG in the past two decades, AI has already interpreted it. The automated interpretation printed at the top of every ECG (“Sinus rhythm,” “Acute anterior STEMI,” “Left ventricular hypertrophy”) comes from algorithms developed in the 1980s-1990s and refined over millions of ECGs.

These aren’t new AI tools. They’re three-decade-old expert systems and pattern recognition algorithms that have become so ubiquitous that we forget they’re algorithmic at all.

Performance of traditional ECG algorithms: - STEMI detection: Sensitivity 80-90%, specificity 95-98% Willems et al., 2009 - Atrial fibrillation: Sensitivity >95%, specificity >98% - Left ventricular hypertrophy: Sensitivity 60-70%, specificity 85-95%

These algorithms work. They’re validated. Guidelines from the American College of Cardiology support their use. They’ve become standard of care.

But they have limitations: - Sensitivity-specificity tradeoffs: STEMI algorithms optimized for sensitivity (to avoid missing MIs) produce false positives that experienced clinicians routinely override - Population-specific performance: Algorithms trained on predominantly white populations show worse performance in Black patients for LVH detection (Sokolow-Lyon criteria) - Over-reading and under-reading: Automated interpretations sometimes flag “abnormal ECG” for clinically insignificant findings while missing subtle ST changes that experienced cardiologists detect

The clinical lesson: Even 30-year-old, well-validated ECG AI requires physician review. Blindly accepting automated interpretations causes errors.

The New AI: Hidden Patterns in ECGs

Modern deep learning has discovered something remarkable: ECGs contain information about cardiac function that’s invisible to human cardiologists. AI can detect low ejection fraction, hyperkalemia, and even patient age from ECGs that appear completely normal to expert readers.

This isn’t vendor hype. It’s published in Nature Medicine, Lancet, and JACC.

Mayo Clinic AI-ECG for Low Ejection Fraction Detection

The Study: Attia et al. (2019) trained a convolutional neural network on 44,959 ECGs paired with echocardiograms from Mayo Clinic patients. The algorithm learned to identify ECGs associated with left ventricular ejection fraction ≤35%, even when the ECG appeared normal to cardiologists Attia et al., 2019.

Performance: - Sensitivity: 86.3% - Specificity: 85.7% - AUC: 0.93 - Crucially: The algorithm detected low EF in patients whose ECGs showed normal sinus rhythm with no obvious abnormalities

Clinical validation: In a prospective validation cohort of 52,870 patients, the algorithm identified 420 patients with low EF whose echocardiograms would have been delayed without AI screening. Median time to diagnosis: 475 days earlier than standard care.

What the algorithm detects: The AI identifies subtle repolarization patterns, QRS morphology variations, and axis shifts that correlate with reduced systolic function but aren’t perceptible to human readers. We don’t fully understand what the algorithm sees (it’s a black box), but the statistical association is robust across external validation cohorts.

Current clinical use: Mayo Clinic deployed this algorithm in routine practice in 2019. Patients with positive AI screens are referred for echocardiography. The algorithm has now analyzed over 500,000 ECGs, identifying thousands of patients with previously undiagnosed left ventricular dysfunction.

Limitations: - External validation needed: Developed and validated primarily at Mayo Clinic; performance in other populations uncertain - Positive predictive value: At 4% prevalence of low EF, even 85.7% specificity yields many false positives (though echo is low-risk test) - Equity unknown: Race/ethnicity-specific performance not reported in initial publication - Mechanism unclear: Black box algorithm makes clinical interpretation difficult

AI Detection of Hyperkalemia from ECG

Galloway et al. (2019) demonstrated that AI can detect serum potassium >5.5 mEq/L from ECGs with AUC 0.86, potentially identifying hyperkalemia before lab results return (Galloway et al., 2019).

Why this matters: Severe hyperkalemia causes sudden cardiac death. Dialysis patients and heart failure patients on ACE inhibitors + spironolactone are at high risk. An ECG-based early warning system could trigger stat potassium checks and prevent fatal arrhythmias.

Why this doesn’t work yet: - No prospective implementation studies: Algorithm hasn’t been deployed in clinical practice to see if it improves outcomes - False positive rate: 86% AUC still means many false alerts, potentially causing alert fatigue - Confounders: Algorithm may detect CKD, heart failure, or medication use rather than hyperkalemia directly

Current status: Research tool, not clinical tool. Needs prospective validation.

Implementation Reality: Why Accurate Algorithms Still Miss STEMIs

ECG algorithms achieve >90% sensitivity for STEMI detection. So why do hospitals still miss STEMIs?

The implementation failures: 1. Alert fatigue: False positive STEMI alerts (especially in patients with old infarcts, LBBB, LVH) cause providers to ignore or delay response to true positives 2. EHR integration problems: STEMI alerts buried in EHR notifications alongside medication warnings and “patient census updated” messages 3. Workflow design failures: ECG interpretation printed on paper that doesn’t trigger emergency response protocols 4. Over-reliance on AI: Providers skip careful ECG review because “the computer would have caught it”

A multicenter study of STEMI detection across 7 U.S. emergency departments found: - 12.8% of STEMIs were missed on initial ECG screening despite meeting criteria for early ECG - Mean delay to catheterization lab: 31 minutes longer for missed cases - Performance varied dramatically by site: 3.4% to 32.6% miss rates across facilities - Cause: Variation in triage protocols and ECG screening criteria implementation (Yiadom et al., 2017)

The lesson: Implementation > accuracy. A 95% accurate algorithm is worthless if clinicians don’t see or trust its alerts.

Part 2: Cardiac Imaging AI

Echocardiography: Where AI Actually Helps

Echocardiography is operator-dependent, time-consuming, and plagued by inter-observer variability. EF measurements by different sonographers on the same patient can vary by ±15%.

AI-assisted echocardiography addresses these problems:

FDA-cleared automated echo analysis systems: - Caption Health (acquired by GE): Autonomous EF calculation from parasternal and apical views, FDA-cleared 2020 - Ultromics (UK): Automated strain analysis for coronary artery disease detection - Bay Labs Echo IQ: Autonomous EF measurement and view optimization - EchoNet-Dynamic (Stanford/Cedars-Sinai): Video-based deep learning for beat-to-beat EF assessment, FDA-cleared 2024 (Ouyang et al., 2020)

Performance: - Inter-observer variability reduction: From ±15% to ±5% for EF measurement Omar et al., 2023 - Time savings: 5-10 minutes per study for standard views and measurements - Accuracy: Correlation with expert cardiologist readings r=0.93-0.96

EchoNet-Dynamic: RCT-validated echo AI. Unlike most echo AI systems, EchoNet-Dynamic was evaluated in a randomized clinical trial, one of the first for AI in cardiology. The algorithm traced LV borders more accurately than human sonographers (mean absolute error 4.1% for EF vs. higher variability for manual tracings). The associated open dataset (10,030 labeled echocardiogram videos) enables external validation, addressing a key limitation of proprietary systems.

Echocardiography Foundation Models

Single-task echo AI is giving way to foundation models that interpret complete studies across multiple views, closer to how cardiologists read echocardiograms. Three models define the current landscape.

EchoPrime was the first large-scale vision-language foundation model for echocardiography, published in Nature (Vukadinovic et al., 2025).

Training scale: EchoPrime was trained on 12,124,168 echocardiogram videos paired with cardiologist reports from 275,442 studies across 108,913 patients at Cedars-Sinai Medical Center. This is over 10 times the data used to train previous echocardiography foundation models like EchoCLIP.

Multi-site validation: Performance was validated across five international health systems: Cedars-Sinai, Stanford Healthcare, Beth Israel Deaconess (MIMIC), Chang Gung Memorial Hospital (Taiwan), and Kaiser Permanente. Mean AUC ranged from 0.85 to 0.92 across 17 classification tasks.

Key innovations: - Multi-view synthesis: Unlike single-view models, EchoPrime integrates information from all echocardiogram videos in a study using an anatomical attention module that weights views based on the cardiac structure being assessed - Retrieval-augmented interpretation: Generates comprehensive reports by matching input videos to similar historical cases, enabling prediction of hundreds of pathologies across 15 anatomical sections - View classification: Identifies 58 standard echocardiographic views with AUC of 0.997

Performance vs. task-specific models: - LVEF estimation: MAE 4.79% (vs. 4.1% for EchoNet-Dynamic), with 2% improvement in R² score - Aortic regurgitation detection: AUC 0.88 (vs. 0.68 for EchoCLIP) - Mitral regurgitation: 4% improvement over EchoNet-MR - Agreement with cardiologists (balanced accuracy 0.89) comparable to inter-cardiologist agreement (0.82)

Clinical implications: EchoPrime addresses the access problem in echocardiography. Many hospitals and clinics, particularly in rural areas, lack cardiologists or sonographers for interpretation. Automated preliminary assessment could expand access to this diagnostic technology while maintaining accuracy comparable to expert readers.

Limitations: - Excludes spectral Doppler and M-mode images (relies on video encoder) - All validation was retrospective; prospective clinical trials comparing EchoPrime reports to cardiologist interpretations are ongoing - Performance in point-of-care ultrasound settings (lower image quality) requires further study

Open resources: Code and model weights are publicly available at github.com/echonet/EchoPrime, enabling external validation and research use.

PanEcho takes a different approach: supervised multitask learning across 39 reporting tasks, trained on over 1 million echocardiographic videos at Yale-New Haven Health System. Published in JAMA, PanEcho is view-agnostic (processes any echo video without requiring view identification) and achieves median AUC 0.91 across 18 classification tasks with LVEF MAE 4.4% on internal data, 5.5% externally (Holste et al., 2025). Code available at github.com/CarDS-Yale/PanEcho.

EchoJEPA challenges a fundamental assumption in echo AI: that models should learn by reconstructing masked pixels. In ultrasound, pixel-level reconstruction forces models to memorize stochastic speckle noise, because faithful reconstruction requires reproducing it. EchoJEPA instead predicts embeddings (learned representations) of masked video regions from visible context, using a slowly-updating teacher network that reinforces only temporally coherent structures like chamber geometry and wall motion, not random speckle. This latent prediction approach, built on Meta’s V-JEPA2 architecture (Assran et al., 2025), is pretrained on 18 million echocardiogram videos across 300,000 patients (Munim, Fallahpour, Szasz et al., 2026, preprint).

The controlled comparison is the strongest evidence: when EchoJEPA and a pixel-reconstruction baseline (VideoMAE) are trained with identical architecture, data, augmentations, and compute, changing only the training objective, latent prediction reduces LVEF error by 27% (MAE 5.97 vs. 8.15) and improves view classification by 45 percentage points. The full model (ViT-Giant, 1.1B parameters) achieves LVEF MAE 4.26 on internal data and 3.97 on the public EchoNet-Dynamic dataset.

Two results have direct clinical implications. First, robustness: under physics-informed perturbations simulating depth attenuation and acoustic shadows (the degradation modes that affect obese patients and those with poor acoustic windows), EchoJEPA degrades by 2.3% versus 16.8% for EchoPrime, an 86% reduction in sensitivity to acquisition artifacts. The patients most likely to benefit from automated echo analysis are those whose images deviate most from training distributions. Second, sample efficiency: the model reaches 79% view classification accuracy with 1% of labeled data, compared to 42% for the best baseline at 100%, lowering the annotation burden required to deploy echo AI in new clinical settings.

Caveats: EchoJEPA is a preprint; EchoPrime (Nature) and PanEcho (JAMA) are peer-reviewed. The strongest results come from a proprietary 18M-video corpus; the public model (EchoJEPA-L, 525K MIMIC-IV-Echo videos) shows smaller advantages. Robustness was tested with synthetic perturbations, not prospective data from patients with genuinely poor acoustic windows. The evaluation uses EchoJEPA’s standardized frozen-backbone protocol across all models; head-to-head comparisons using each model’s own evaluation paradigm have not been performed.

What AI does well in echo: 1. Automated endocardial border detection: Traces LV cavity more consistently than manual tracing 2. View optimization: Guides sonographer to acquire standard views correctly 3. Quantitative measurements: Chamber volumes, wall thickness, valve areas more reproducibly than manual calipers 4. Strain analysis: Automated global longitudinal strain calculation (time-consuming manually)

What AI doesn’t do well yet: - Complex valve pathology: AI struggles with multiple jets, eccentric regurgitation, prosthetic valves - Technically difficult studies: Poor acoustic windows, obesity, COPD remain challenging for most echo AI. Latent prediction architectures (EchoJEPA) show improved robustness to acoustic degradation in synthetic testing, but prospective validation in genuinely difficult patients is pending - Novel findings: AI detects what it was trained to detect; won’t identify rare pathology

Current clinical use: Growing adoption in community hospitals and primary care clinics, where access to expert echo readers is limited. Academic medical centers use AI for efficiency (automated measurements) but rely on cardiologist over-reads for complex cases.

Equity concerns: Most echo AI systems trained predominantly on white populations. Performance in Black patients (who have higher rates of hypertensive heart disease with different remodeling patterns) not well-studied.

Cardiac MRI and CT: Technical Excellence, Clinical Validation Pending

AI for cardiac MRI and CT shows impressive technical performance but lacks the decades of clinical validation that ECG algorithms have.

Applications: - Automated segmentation: LV/RV/atrial volume calculation from cine MRI - Perfusion defect detection: Stress MRI ischemia analysis - Coronary CT angiography analysis: Stenosis grading, plaque characterization, FFR-CT - Calcium scoring: Automated Agatston score calculation

Performance: Technical accuracy rivals expert readers (correlation r=0.90-0.95), but:

Problems: 1. No outcome studies: Do these algorithms improve patient outcomes? Unknown. 2. Vendor lock-in: Most algorithms proprietary, embedded in scanner software, can’t be independently validated 3. Overdiagnosis risk: Highly sensitive algorithms may detect “abnormalities” of uncertain clinical significance 4. Cost: CT-FFR costs $1,500-2,000 per study; clinical benefit over standard CCTA uncertain

Clinical bottom line: Use AI-assisted cardiac MRI/CT for efficiency (automated measurements save radiologist time), but don’t change clinical management based on AI findings without expert review.

Part 3: Heart Failure Prediction and the 80% False Positive Problem

Heart failure readmission prediction is a classic AI overpromise story.

The pitch: “Our proprietary machine learning algorithm predicts 30-day HF readmission with AUC 0.85! Identify high-risk patients for intensive case management!”

The reality: An AUC of 0.85 sounds impressive. But at 20% HF readmission prevalence, achieving clinically useful sensitivity (e.g., 80% to catch most readmissions) requires accepting 75% false positives.

The math: - Population: 1,000 HF discharges - Actual readmissions: 200 (20% rate) - Algorithm at 80% sensitivity: Detects 160/200 true positives - But also flags 600 false positives (from the 800 who won’t be readmitted) - Result: 760 patients flagged, of whom 160 (21%) actually readmit

Why this matters: Intensive case management costs $500-1,000 per patient. Applying it to 760 patients to prevent 160 readmissions costs $380,000-760,000. Many of those readmissions are unpreventable (sudden cardiac death, acute MI, etc.).

These algorithms detect demographic proxies rather than novel clinical insights. Angraal et al. (2020) analyzed HF readmission prediction models and found most HF models achieve AUC through demographic proxies: age, CKD, COPD, prior admissions Angraal et al., 2020.

In other words, the algorithm isn’t discovering novel insights. It’s learning that 85-year-old patients with CKD stage 4, COPD, and three prior HF admissions are high-risk. You didn’t need machine learning to know that.

CardioMEMS demonstrates clinical utility where EHR-based prediction models do not. CardioMEMS (implantable pulmonary artery pressure sensor) reduced HF hospitalizations by 37% in RCT Abraham et al., 2016. But this is a device that enables early intervention based on hemodynamic data, not a prediction algorithm based on EHR data.

Clinical bottom line: Be skeptical of HF readmission prediction algorithms. Ask: 1. “What is the false positive rate at the sensitivity threshold you recommend?” 2. “What interventions will we apply to algorithm-flagged patients, and what’s the evidence those interventions prevent readmissions?” 3. “How does this algorithm perform in our specific patient population?” (Most are validated only in development cohort)

Part 4: Wearable Device AI and the Asymptomatic AFib Dilemma

Apple Heart Study: 450,000 Participants, 84% PPV, Massive Clinical Uncertainty

The Apple Heart Study (Perez et al., 2019) was the largest prospective study of wearable AFib detection Perez et al., 2019.

Study design: - 419,297 participants wore Apple Watch with photoplethysmography (PPG)-based irregular pulse detection - When algorithm detected irregular pulse, participant received ECG patch to confirm AFib - Primary outcome: PPV of algorithm (what percentage of alerts were true AFib)

Results: - 2,161 participants (0.52%) received irregular pulse notifications - 450 returned ECG patches - 153 of those patches showed AFib - PPV: 84% (better than expected for screening test)

But here’s the clinical problem: - 84% of 0.52% = 0.44% of participants had confirmed AFib - Most were asymptomatic - Most had paroxysmal AFib (brief episodes) - Clinical question: Should asymptomatic paroxysmal AFib detected by smartwatch be treated with anticoagulation?

CHADS-VASc doesn’t answer this: CHADS-VASc was developed for AFib detected clinically or on ECG, not for asymptomatic device-detected episodes. Stroke risk for smartwatch-detected paroxysmal AFib is uncertain.

Ongoing trials: - HEARTLINE: Apple Watch AFib detection for stroke prevention (results pending) - GUARD-AF: Impact of early detection on outcomes

Current clinical management: No consensus. Some cardiologists anticoagulate all AFib regardless of how detected. Others require symptoms or prolonged episodes (>24 hours). Many patients end up in a clinical gray zone.

The lesson: Technology often runs ahead of evidence. We can detect things we don’t know how to manage.

Part 5: The IBM Watson Cardiology Disaster

What Went Wrong With Watson for Oncology (and Its Cardiology Applications)

IBM Watson for Oncology promised “AI-powered treatment recommendations” based on analysis of medical literature and clinical guidelines. It was deployed in oncology, but IBM also developed Watson applications for cardiology and other specialties.

What happened: - Watson produced unsafe treatment recommendations contradicting evidence-based guidelines Ross and Swetlitz, 2018 - Recommended chemotherapy for patients unlikely to benefit - Suggested medications with dangerous drug interactions - Never validated in randomized controlled trials - Never published peer-reviewed evidence of clinical benefit

Why did hospitals buy it? Aggressive marketing, partnerships with major academic centers (Memorial Sloan Kettering), and promises of “AI-assisted decision-making” that sounded impressive to hospital executives.

Why did it fail? 1. Training data problem: Watson was trained on expert preferences (what MSK oncologists recommended) not evidence (what RCTs showed worked) 2. No clinical validation: Deployed without prospective trials showing benefit 3. Black box: Physicians couldn’t understand why Watson made recommendations, eroding trust 4. Overpromising: Marketed as “thinking like a doctor” when it was really “echoing MSK treatment patterns”

Watson cardiology applications: IBM developed Watson-based tools for: - HF treatment optimization - Cardiovascular risk prediction - Medication management in complex cardiac patients

None were validated in prospective trials. All were withdrawn by 2019 after oncology failures.

The lessons for cardiovascular AI: 1. Demand RCT evidence: If a vendor can’t show published outcomes studies, don’t deploy their tool 2. Beware proprietary algorithms: Black boxes hide methodological flaws 3. Marketing ≠ evidence: Partnerships with prestigious institutions don’t prove clinical benefit 4. Physician judgment remains essential: No AI should make autonomous treatment recommendations

Part 6: Equity in Cardiovascular AI

The Pulse Oximetry Problem Extends to ECG and Echo

In 2020, researchers discovered that pulse oximeters systematically overestimate oxygen saturation in Black patients, leading to delayed treatment for hypoxemia Sjoding et al., 2020.

Similar equity problems exist in cardiovascular AI:

ECG algorithms: - Sokolow-Lyon criteria for LVH (embedded in automated ECG algorithms) have lower sensitivity in Black patients - QTc prolongation thresholds don’t account for race-specific differences in baseline QT intervals - Most ECG AI trained on predominantly white populations from tertiary care centers

Echo AI: - Black patients have different LV remodeling patterns (more concentric hypertrophy vs. eccentric dilatation) - AI trained on predominantly white populations may miscategorize Black patients’ LV geometry - Race-specific performance metrics rarely reported in FDA submissions

Smartwatch AFib detection: - PPG accuracy varies with skin tone (melanin affects light absorption) - Apple Heart Study: 88% white participants, performance in other populations uncertain

What can you do? 1. Ask vendors for race-stratified performance metrics: If they can’t provide them, the algorithm wasn’t validated equitably 2. Validate locally: Test algorithm performance in your patient population before widespread deployment 3. Monitor outcomes by race: Track algorithm errors, false positives, false negatives by race/ethnicity 4. Maintain clinical skepticism: AI is a tool, not truth. Your clinical judgment remains essential.

Part 7: Implementation Framework

Before Adopting Cardiovascular AI Tools

Questions to ask vendors:

“Where is the peer-reviewed publication showing this algorithm improves patient outcomes?”
- If answer is “we have internal validation data,” that’s insufficient
- Demand New England Journal of Medicine, JAMA, Circulation, Lancet publications
“What is the algorithm’s performance stratified by race, age, and sex?”
- If vendor doesn’t have this data, the algorithm wasn’t validated equitably
- Don’t accept overall performance metrics
“What is the false positive rate at your recommended sensitivity threshold?”
- AUC alone is meaningless
- Need to understand PPV/NPV at clinically relevant operating points
“How does the algorithm integrate with our EHR and clinical workflow?”
- Poor integration causes alert fatigue and missed diagnoses
- Demand live demonstrations in your specific EHR environment
“What happens when the algorithm fails? What are the failure modes?”
- All algorithms fail sometimes
- You need to understand when and how to recognize failures
“Can we validate this algorithm on our patient population before deployment?”
- Local validation is essential
- Algorithm trained at Mayo Clinic may not work at your community hospital
“What is the cost per analysis, and what’s the evidence of cost-effectiveness?”
- CT-FFR costs $1,500-2,000 per study
- Has it been shown to improve outcomes or reduce costs compared to standard care?
“Who is liable if the algorithm produces an incorrect result that harms a patient?”
- Most vendor contracts disclaim liability
- Liability falls on the ordering physician
“Can you provide references from cardiologists at hospitals similar to ours who use this tool?”
- Talk to actual users, not marketing testimonials
- Ask about problems, workflow disruptions, false positives
“Is the algorithm FDA-cleared? If yes, through what pathway?”
- 510(k) clearance ≠ clinical validation
- 510(k) requires only “substantial equivalence” to existing device

Red Flags (Walk Away If You See These)

Vendor refuses to share peer-reviewed publications (“Our algorithm is proprietary”)
No external validation studies (validated only on development cohort)
Performance metrics not stratified by demographics (equity not assessed)
Black box with no explainability (can’t understand why algorithm made recommendation)
Vendor claims algorithm is “better than cardiologists” without RCT evidence

Part 8: Cost-Benefit Reality

What Does Cardiovascular AI Actually Cost?

ECG AI: - Most automated ECG interpretation: Included in ECG machine purchase (no marginal cost) - Mayo Clinic AI-ECG for low EF screening: Not commercially available yet (research tool)

Echo AI: - Caption Health: ~$1,000/month subscription + per-study fees - Ultromics: ~$50-100 per study - Value proposition: Saves sonographer/cardiologist time, reduces variability

Cardiac MRI/CT AI: - CT-FFR (HeartFlow): $1,500-2,000 per study - Automated MRI segmentation: Bundled into scanner software

Wearable AFib detection: - Apple Watch: $400-800 (consumer device, not medical expense) - ECG patch for confirmation: $150-300

HF prediction algorithms: - Most proprietary, bundled into population health contracts - Difficult to assess cost-effectiveness without published studies

Do These Tools Save Money?

Theoretically yes: - Earlier detection of low EF → initiate GDMT → prevent HF progression → fewer hospitalizations - Automated echo measurements → save cardiologist time → increase throughput

In practice: Uncertain: - No published cost-effectiveness analyses for most cardiovascular AI tools - Mayo Clinic AI-ECG: No data on whether earlier EF detection reduces downstream costs - CT-FFR: Cost-effectiveness vs. invasive FFR shown, but vs. standard care unclear Hlatky et al., 2015

The implementation cost no one talks about: - IT integration: $50,000-200,000 depending on complexity - Workflow redesign: Cardiologist/administrator time - Training: Sonographer/tech/physician education - Maintenance: Software updates, troubleshooting - Alert management: Triaging false positives

A realistic scenario: - Hospital purchases echo AI for $50,000/year - Saves 10 minutes per study × 5,000 studies/year = 833 hours - At $200/hour cardiologist cost = $166,600 saved - Assuming automation doesn’t reduce quality or increase errors

The question: Are those assumptions valid? We don’t have good data.

Part 9: The Future of Cardiovascular AI

What’s Coming in the Next 5 Years

Likely to reach clinical use: 1. Expanded ECG AI applications: Detection of pulmonary hypertension, aortic stenosis, HCM from ECG 2. Wearable integration with EHR: Smartwatch data flowing into medical records with clinical decision support 3. Automated echo AI in primary care: Point-of-care echo by non-cardiologists with AI guidance 4. Predictive models for sudden cardiac death: Risk stratification for ICD placement beyond EF alone

Promising but uncertain: 1. AI-guided medication optimization: Automated titration of GDMT in HF 2. Real-time procedural guidance: AI-assisted PCI, ablation, structural interventions 3. Precision medicine for CAD: Genetic + imaging + clinical data to personalize revascularization decisions

Overhyped and unlikely: 1. Autonomous cardiovascular diagnosis: AI replacing cardiologist clinical judgment 2. Smartwatch-only AFib management: Anticoagulation decisions without ECG confirmation

The rate-limiting step: Not algorithmic accuracy. Prospective randomized trials showing improved outcomes.

Most cardiovascular AI has impressive technical performance. What we lack is evidence that deploying these tools actually helps patients live longer or better.

Professional Society Guidelines on AI in Cardiology

ACC/AHA on AI in Cardiovascular Medicine

The 2023 ESC and 2025 ACC/AHA guidelines have not yet provided specific recommendations for clinical use of artificial intelligence, highlighting a significant evidence gap. However, recent guidelines acknowledge AI’s emerging role:

2024 ACC/AHA Perioperative Guidelines: “Incorporation of artificial intelligence and machine-learning may improve risk assessment, but future studies are needed to evaluate risk-reduction strategies.”

Key Observations from Recent Guidelines: - Prospective RCTs needed to confirm AI’s efficacy and cost-effectiveness - Deep learning algorithms show promise in outperforming clinicians and conventional software for ECG diagnosis - AI enables accurate quantitative and qualitative plaque evaluation with coronary CT angiography and OCT - Multicenter validation and standardization essential before guideline integration

AHA Scientific Sessions AI Highlights (2024)

At AHA 2024, AI in cardiology featured prominently:

AI-ECHO and PanEcho Studies: - Machine learning algorithms trained on millions of echocardiographic images - Promise for automating and improving diagnostic accuracy - Potential to streamline imaging for large patient populations

Endorsed Risk Calculators

The ACC/AHA endorse several validated risk calculators that incorporate statistical modeling:

ASCVD Risk Estimator Plus: 10-year and lifetime cardiovascular risk
Pooled Cohort Equations: Primary prevention statin therapy decisions
CHA2DS2-VASc: Stroke risk in atrial fibrillation
HAS-BLED: Bleeding risk with anticoagulation
TIMI Risk Score for STEMI: 30-day mortality risk at presentation (Morrow et al., 2000)
TIMI Risk Score for UA/NSTEMI: Risk stratification for unstable angina and non-ST elevation MI (Antman et al., 2000)
GRACE Score: Hospital mortality and 6-month post-discharge mortality in acute coronary syndromes (Granger et al., 2003)

These represent validated, guideline-integrated predictive tools that precede modern AI but establish the framework for algorithmic clinical decision support.

European Society of Cardiology (ESC)

ESC has engaged with AI through:

Digital Health Committee guidance on AI validation
Position papers on wearable device data integration
Framework for evaluating AI-enhanced diagnostic tools

Implementation Principle: ESC emphasizes that AI tools must demonstrate clinical utility beyond improved accuracy metrics, including impact on patient outcomes, workflow efficiency, and cost-effectiveness.

Heart Rhythm Society (HRS)

HRS has addressed AI in the context of:

Automated ECG interpretation algorithms
Wearable device-detected arrhythmias
AI-assisted electrophysiology mapping

Clinical Guidance: HRS notes that AI-detected arrhythmias from consumer devices require clinical confirmation and that management pathways for AI-flagged findings remain under development.

Key Takeaways

10 Principles for Cardiovascular AI

ECG algorithms work, but require physician review: Decades of validation, but false positives and equity gaps exist
Hidden ECG patterns are real, not hype: Mayo Clinic AI-ECG detecting low EF from normal-appearing ECGs is validated science
Echo AI reduces variability: Automated measurements more reproducible than manual, but can’t replace expert interpretation for complex cases
Wearable AFib detection creates clinical dilemmas: High sensitivity, but asymptomatic paroxysmal AFib management uncertain
HF prediction models have 75-80% false positive rates: AUC 0.85 sounds great until you calculate PPV
Demand prospective outcome trials: Technical accuracy ≠ clinical benefit
Equity gaps are substantial: Most algorithms trained on predominantly white populations
Implementation > accuracy: Poor EHR integration causes missed diagnoses despite accurate algorithms
IBM Watson failed. Learn from it: No RCT evidence = don’t deploy, no matter how prestigious the vendor
You remain responsible: AI assists, but all clinical decisions and their consequences are yours

Clinical Scenario: Vendor Evaluation

Scenario: Your Cardiology Department Is Considering Purchasing an AI Tool

The pitch: A vendor demonstrates an AI tool that predicts 30-day cardiovascular mortality risk for hospitalized cardiology patients. They show you: - AUC 0.92 in internal validation - “Outperforms traditional risk scores” - Integration with your EHR - Cost: $150,000/year

The department chair asks for your recommendation.

Questions to Ask Before Recommending Purchase:

“What peer-reviewed publications support this algorithm?”
- Look for Circulation, JACC, JAMA Cardiology publications
- Internal validation white papers are insufficient
“What is the positive predictive value at clinically useful sensitivity thresholds?”
- If 30-day mortality is 3%, even 92% AUC may yield terrible PPV
- Ask for sensitivity/specificity table at multiple thresholds
“How does this algorithm perform in patient populations similar to ours?”
- Algorithm validated at academic medical center may fail at community hospital
- Request performance stratified by age, race, sex, comorbidities
“What interventions will we apply to high-risk patients identified by this algorithm?”
- If answer is “closer monitoring,” what’s the evidence that prevents deaths?
- Many high-risk patients die despite optimal care
“What are this algorithm’s failure modes?”
- Does it underestimate risk in young patients? Overestimate in elderly?
- What clinical situations does it handle poorly?
“Can we pilot this on 500 patients before committing to $150,000/year?”
- Local validation essential
- Compare algorithm predictions to actual outcomes in your population
“Who is liable if a patient predicted low-risk by the algorithm dies unexpectedly?”
- Read the vendor contract carefully
- Most disclaim all liability
“What is the cost-effectiveness compared to existing risk stratification?”
- How many lives saved per $150,000 spent?
- Any published cost-effectiveness analyses?
“How will this integrate with nursing workflow? Who triages the high-risk alerts?”
- Implementation costs often exceed purchase price
- Alert fatigue is real
“Can I speak with cardiologists at 3 other hospitals who use this tool?”
- Get real user experiences, not marketing testimonials

Red Flags in This Scenario:

AUC 0.92 reported without PPV/NPV: Useless without knowing false positive rate

“Outperforms traditional risk scores”: Were comparisons done in same patient cohort? Published?

No mention of prospective validation: If algorithm hasn’t been tested prospectively, it’s experimental

High annual cost without cost-effectiveness data: $150K/year is substantial; where’s the ROI evidence?

Vendor can’t explain what the algorithm learned: Black box = red flag

Check Your Understanding

Scenario 1: The AI-Detected Low EF

Clinical situation: A 58-year-old woman with hypertension presents to primary care for annual physical. ECG ordered as part of routine screening shows normal sinus rhythm, normal intervals, no ST/T changes. However, the ECG machine’s AI algorithm flags: “Low ejection fraction predicted. Recommend echocardiogram.”

Patient is asymptomatic. No dyspnea, no edema, no chest pain. Physical exam normal. You’ve never seen this AI alert before.

Question 1: Do you order the echocardiogram based on this AI prediction?

Click to reveal answer

Answer: Yes, order the echocardiogram.

Reasoning: The Mayo Clinic AI-ECG for low EF detection has been prospectively validated and published in Nature Medicine. The algorithm has 86.3% sensitivity and 85.7% specificity for detecting EF ≤35% Attia et al., 2019.

Key points: - This is validated technology, not experimental AI - Echocardiogram is low-risk test with potential high yield (early detection of reduced EF enables initiation of GDMT) - The ECG appearing normal doesn’t invalidate the algorithm. The whole point is that AI detects hidden patterns that cardiologists can’t see - Many patients with asymptomatic reduced EF benefit from early ACE inhibitor/beta-blocker therapy

However: - Counsel patient that this is a screening test and may be false positive - Explain that AI detected subtle ECG patterns suggesting possible heart dysfunction - Don’t alarm patient unnecessarily before echo confirms

If echo confirms reduced EF: Initiate GDMT (ACE-I, beta-blocker, consider SGLT2i)

If echo normal: Reassure patient, document that AI alert was false positive

Bottom line: AI-ECG low EF screening has sufficient validation to act on, especially in low-risk test like echo.

Scenario 2: The Apple Watch AFib Alert

Clinical situation: A 52-year-old man with CHADS-VASc score of 1 (age ≥50) presents with his Apple Watch showing irregular pulse notifications. He received 3 alerts over past week, all while asymptomatic. No palpitations, no dyspnea, no dizziness.

You order ECG in office: Normal sinus rhythm. You order 24-hour Holter: Shows 2-hour episode of atrial fibrillation at 3 AM (patient asleep, asymptomatic).

Question 2: Do you start anticoagulation for asymptomatic, device-detected paroxysmal AFib?

Click to reveal answer

Answer: Unclear. This is a genuine clinical gray zone.

Arguments FOR anticoagulation: - CHADS-VASc ≥1 in male patients generally indicates anticoagulation benefit - AFib is AFib regardless of how detected; stroke mechanism (atrial stasis → thrombus → embolism) doesn’t require symptoms - Subclinical AFib detected by pacemakers has been associated with increased stroke risk (though episodes were typically >24 hours) - Apple Heart Study showed 84% PPV for AFib detection. This is real AFib, not artifact

Arguments AGAINST anticoagulation: - CHADS-VASc was derived from symptomatic AFib populations; applicability to device-detected asymptomatic AFib uncertain - Paroxysmal AFib (2-hour episodes) may carry lower stroke risk than persistent AFib - Bleeding risk with anticoagulation (1-2% major bleeding/year) may outweigh benefit in very low-risk patient - No RCT evidence that treating device-detected AFib reduces stroke risk

Ongoing trials: - HEARTLINE: Apple Watch AFib detection for stroke prevention (results pending) - GUARD-AF: Impact of early detection and treatment

Current practice: - Reasonable approach 1: Anticoagulate based on CHADS-VASc ≥1 (guideline-concordant) - Reasonable approach 2: Extended monitoring (30-day patch) to assess AFib burden, anticoagulate if >6-24 hours/day (expert opinion threshold) - Reasonable approach 3: Shared decision-making with patient about uncertain benefit

What I would do: Discuss with patient: - “You have real AFib, detected by your watch and confirmed on Holter monitor” - “The stroke risk is uncertain because you’re asymptomatic and episodes are brief” - “Standard guidelines would recommend blood thinners for CHADS-VASc ≥1” - “But those guidelines weren’t designed for smartwatch-detected AFib” - “We’re waiting for research studies to clarify this, but they’re not done yet” - “Options: Start apixaban now, or extend monitoring to see how much AFib you’re having”

Bottom line: This is cutting-edge medicine where technology has outpaced evidence. Either approach (anticoagulate or monitor) is defensible. Document your reasoning carefully.

Scenario 3: The Proprietary HF Readmission Model

Clinical situation: Your hospital’s population health team purchased a proprietary AI tool that predicts 30-day HF readmission risk. The vendor claims AUC 0.87. The tool flags 35% of HF discharges as “high-risk.”

Your case management team asks: Should we apply intensive post-discharge interventions (home visits, daily phone calls, nurse case management) to all algorithm-flagged patients?

Cost of intensive intervention: $800 per patient. Your hospital discharges 400 HF patients/year.

Question 3: Do you implement the algorithm-driven intervention program?

Click to reveal answer

Answer: No, not without further analysis.

Problems with this scenario:

1. False positive rate is likely 70-80%: - Baseline HF readmission rate: ~20% - Algorithm flags 35% of patients (140 of 400 discharges) - If AUC 0.87 with sensitivity 80%, it will detect ~64 of the 80 actual readmissions (true positives) - But also flag ~76 patients who won’t readmit (false positives) - Only 64/140 = 46% of flagged patients will actually readmit

2. Cost-effectiveness is questionable: - 140 patients × $800 = $112,000 annual cost - To break even, need to prevent readmissions that cost >$112,000 - Average HF readmission cost: ~$10,000 - Need to prevent >11 readmissions (14% of actual readmissions) - Many HF readmissions are unpreventable (sudden cardiac death, acute MI, progression despite optimal therapy)

3. Intervention evidence is weak: - What’s the evidence that home visits + phone calls prevent HF readmissions? - Some studies show benefit, others don’t - Even in positive studies, NNT is typically 20-30 patients to prevent 1 readmission

4. Algorithm transparency is absent: - What features is the algorithm using? - If it’s primarily age + comorbidities (which most HF models are), you could achieve similar performance with simpler rule: “Flag all patients >80 with CKD and COPD” - Paying for proprietary algorithm to learn what you already know is wasteful

What to do instead:

Request vendor provide:
- Peer-reviewed publication of algorithm validation
- Performance stratified by demographics
- PPV/NPV at multiple sensitivity thresholds
- Feature importance (what is the algorithm learning?)
Pilot study:
- Apply algorithm to 100 consecutive HF discharges
- Track: How many flagged? How many actually readmit?
- Calculate: PPV in your population (may differ from vendor’s validation)
Evaluate intervention evidence:
- Systematic review of transitional care interventions for HF
- What actually works? (Hint: Early post-discharge cardiology follow-up, medication reconciliation, patient education)
Consider simpler approach:
- Apply intensive interventions to all HF discharges (not algorithm-selected subset)
- If intervention costs $800 and prevents even 5% of readmissions, it’s cost-effective for entire population
- Simpler than algorithmic triage

Bottom line: Proprietary algorithms with impressive AUCs often provide minimal value over clinical judgment. Demand evidence of clinical benefit and cost-effectiveness before implementation.

The algorithm isn’t necessarily wrong. It’s just not clear it adds value over existing approaches.