Appendix G — Clinical AI Vendor Evaluation Toolkit

Appendix G: AI Vendor Evaluation Checklist for Physicians

TipPurpose

This appendix provides a comprehensive, actionable framework for evaluating commercial AI products for clinical practice. Use this before purchasing or deploying any vendor AI system in your hospital, clinic, or health system.

Who should use this: - Hospital administrators evaluating AI vendors - Department chairs considering AI tools for their specialty - Chief Medical Information Officers (CMIOs) assessing AI products - Physician practice leaders making technology decisions - IRB/ethics committees reviewing AI deployments

What you’ll find: - Structured evaluation scorecards - Red flag identification guides - Sample questions to ask vendors - Decision frameworks for Go/No-Go - Procurement contract language recommendations


Introduction: Why You Need This Checklist

The Clinical AI Morgue (Appendix E) documented $100M+ in failed AI investments. Many failures were predictable. Warning signs existed but were ignored. Vendor claims went unquestioned. Due diligence was insufficient.

This toolkit helps you avoid those mistakes.

The Vendor-Physician Information Asymmetry

Vendor knows: - Where the model was trained - What its clinical limitations are - Where external validation failed - What patient populations it doesn’t work for - What the false alarm rate is in real-world use - Which hospitals abandoned the system after pilot

You know: - What the vendor tells you in sales materials - (Often: not much more)

This checklist helps level the playing field.


Quick Reference: The 6-Domain Evaluation Framework

Use this framework to systematically evaluate any clinical AI vendor:

Domain Key Questions Red Flags
1. Clinical Validation External validation? Peer-reviewed publication? Improved outcomes? Internal validation only; no publications; only AUC reported
2. Patient Safety Safety testing? Adverse events tracked? Failure mode analysis? No safety data; no mention of harms; “100% accurate” claims
3. Fairness & Equity Performance across demographics? Bias audits? No fairness testing; “we don’t use race so it’s fair”
4. Privacy & Security HIPAA compliant? BAA provided? Encryption? Vague privacy claims; no BAA; data sent to foreign servers
5. Workflow Integration End-user testing? EHR integration? Training provided? No physician research; “plug and play”; minimal training
6. Business Viability Company financially stable? Hospital references? FDA clearance? Startup with no revenue; no references; unclear roadmap

Domain 1: Clinical Validation 🔍

The Questions to Ask

ImportantCritical Validation Questions
  1. Training Data
  2. Validation Studies
  3. Clinical Performance Metrics
  4. Clinical Outcomes (MOST IMPORTANT)

Red Flags 🚩

Proceed with extreme caution if:

  • “Validated on 100,000 patients” - But all from the same institution (not external validation)
  • “95% accuracy” - On cherry-picked test set; no external validation
  • “AUC 0.92” - But no data on whether clinical outcomes improved
  • “Deployed in 150+ hospitals” - Deployment ≠ Effectiveness; no outcome data
  • “Proprietary validation” - No peer-reviewed publications; “trust us”
  • Internal validation only - Models always perform better on development data
  • Vendor-funded validation studies - Conflicts of interest
  • Only technical metrics reported - AUC, accuracy without clinical outcomes

Lessons from Epic Sepsis Model

The Epic sepsis model had: - ✅ High AUC (0.76-0.83) - Impressive technical performance - ❌ But detected only 7% of sepsis cases before clinical recognition in external validation - ❌ 67% of sepsis cases never triggered alert - ❌ No evidence of improved patient outcomes (mortality, length of stay) - ❌ Alert fatigue caused clinicians to ignore warnings

Don’t repeat this mistake. Demand outcome data, not just AUC.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
External Validation None or internal only 1 site 3+ independent sites
Peer-Reviewed Publication No Conference abstract Full peer-reviewed paper in major journal
Independent Researchers Vendor employees only Partial independence Fully independent validation
Clinical Outcomes None reported Retrospective outcomes Prospective RCT or quasi-experimental
Generalizability Evidence No evidence Similar populations Validated on YOUR patient population

Scoring: - 8-10 points: Strong validation evidence - 5-7 points: Moderate evidence; pilot testing recommended - 0-4 points: Insufficient evidence; do NOT deploy


Domain 2: Patient Safety 🏥

The Questions to Ask

ImportantCritical Safety Questions
  1. Safety Testing
  2. Clinical Outcomes
  3. Alert Burden
  4. Human Factors

Red Flags 🚩

Proceed with extreme caution if:

  • “No reported adverse events” - Likely means no monitoring system, not that it’s safe
  • “Clinicians love it” - No quantitative data on alert fatigue or response rates
  • “Seamless integration” - No workflow analysis or physician research
  • High false positive rate (>20%) - Will cause alert fatigue
  • No outcome data - Only technical performance (AUC) reported
  • “100% accurate” - Overconfident claims; every system has failure modes
  • No failure mode analysis - Every AI system can fail; what happens when it does?

Lessons from IBM Watson for Oncology

IBM Watson for Oncology: - ✅ Massive training on medical literature - Impressive technology - ❌ But recommended chemotherapy for patients with bleeding (contraindicated!) - ❌ Trained on synthetic hypothetical cases, not real patient outcomes - ❌ Contradicted evidence-based guidelines - ❌ Multiple hospitals canceled contracts after recognizing unsafe recommendations

Don’t repeat this mistake. Demand safety testing and real-world validation.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Safety Testing None mentioned Some testing Comprehensive failure mode analysis
Outcome Evidence None Retrospective analysis Prospective RCT or quasi-experimental
Alert Burden Unknown <50% false positive <20% false positive + manageable alert rate
Human Factors No testing Limited usability testing Comprehensive workflow analysis with physicians
Adverse Event Monitoring No system Passive reporting Active surveillance system (like drug safety)

Scoring: - 8-10 points: Strong safety evidence - 5-7 points: Moderate evidence; close monitoring required - 0-4 points: Insufficient safety evidence; do NOT deploy


Domain 3: Fairness & Equity ⚖️

The Questions to Ask

ImportantCritical Fairness Questions
  1. Bias Testing
  2. Training Data Representativeness
  3. Proxy Variables
  4. Health Equity Impact

Red Flags 🚩

Proceed with extreme caution if:

  • “We don’t use race as a feature, so it’s fair” - Fairness through unawareness doesn’t work; race correlated with many features
  • “Our algorithm is objective” - Algorithms encode human biases in historical data
  • “High accuracy means fair” - Accuracy ≠ Fairness (see OPTUM case)
  • No fairness testing - Bias is default; fairness must be tested, not assumed
  • Using costs as proxy for health needs - See OPTUM case: costs reflect access barriers, not just illness severity
  • Trained on non-representative data - Academic medical centers only, commercially insured only, etc.

Lessons from OPTUM Algorithmic Bias

The OPTUM algorithm for care management: - ✅ Accurately predicted healthcare costs - High technical performance - ❌ But costs ≠ health needs, especially for Black patients - ❌ Systematically underestimated Black patients’ health needs - ❌ Result: Black patients less likely to receive needed care coordination - ❌ 46.5% more Black patients should have been enrolled for equity

Don’t repeat this mistake. Test for fairness explicitly across all patient demographics.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Fairness Audit None conducted Internal audit Independent external audit
Subgroup Analysis No reporting Some subgroups Comprehensive (race, ethnicity, age, sex, insurance)
Training Data Diversity Homogeneous (AMCs only) Somewhat diverse Highly representative of your population
Proxy Variable Assessment No assessment Acknowledged Validated against direct clinical outcomes
Equity Impact Plan No plan Monitoring planned Active mitigation strategies for disparities

Scoring: - 8-10 points: Strong fairness evidence - 5-7 points: Moderate evidence; continuous monitoring essential - 0-4 points: High bias risk; do NOT deploy without mitigation


Domain 4: Privacy & Security 🔒

The Questions to Ask

ImportantCritical Privacy & Security Questions
  1. Regulatory Compliance
  2. Data Handling
  3. Security Measures
  4. Privacy by Design
  5. Transparency & Accountability

Red Flags 🚩

Proceed with extreme caution if:

  • Refuses to sign BAA - Non-starter for HIPAA compliance
  • “We’ll sign BAA later” - Must be in place BEFORE any PHI access
  • Vague about data storage location - “Cloud” is not specific enough; which cloud? Which region?
  • Data sent to foreign servers - Compliance and privacy risks
  • “We anonymize data so HIPAA doesn’t apply” - Re-identification risk; HIPAA still applies to most health data
  • No SOC 2 or security certification - Unvetted security practices
  • “Trust us with your data” - Trust requires verification
  • Unclear data retention/deletion - Your patients’ data may persist indefinitely

Lessons from DeepMind Streams

DeepMind Streams (AKI detection): - ❌ Collected entire medical histories (not just kidney-related data) - Data minimization failure - ❌ Patients not informed - Consent violation - ❌ No proper legal basis for data sharing - ❌ Result: Ruled unlawful by UK Information Commissioner’s Office

Lesson: Privacy promises must be legally binding and technically enforced. Data minimization is mandatory, not optional.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
HIPAA Compliance No BAA or refuses Will sign BAA BAA + SOC 2 + HITRUST
Data Minimization Collects everything Some minimization Strict minimization; edge deployment option
Security Certifications None SOC 2 Type I SOC 2 Type II + penetration testing
Transparency Vague policies Clear policies Detailed + third-party audit
Data Control Vendor retains indefinitely Retention period defined You control data; deletion guaranteed

Scoring: - 8-10 points: Strong privacy & security - 5-7 points: Moderate; additional safeguards needed - 0-4 points: Unacceptable risk; do NOT proceed


Domain 5: Workflow Integration 🔄

The Questions to Ask

ImportantCritical Workflow Integration Questions
  1. Physician Research
  2. EHR Integration
  3. Training & Support
  4. Customization
  5. Monitoring & Feedback

Red Flags 🚩

Proceed with extreme caution if:

  • “Plug and play” - Clinical medicine is complex; no system is truly plug-and-play
  • “Works with all EHRs” - Each EHR integration is custom; this claim is implausible
  • “No training needed” - Physicians always need training for clinical decision support tools
  • “One-size-fits-all” - Different hospitals have different workflows and patient populations
  • “We can implement in 2 weeks” - Unrealistic for complex clinical systems; implementation takes months
  • No physician research - Designed in isolation from actual clinical workflows
  • Minimal support - Email-only support; no phone; no dedicated account manager
  • Black box, no customization - Can’t adjust thresholds or workflows to fit your practice

Lessons from Google Health India

Google’s diabetic retinopathy AI in India: - ✅ 96% accuracy in lab with research-grade cameras - ❌ But 55% of images ungradable in field with portable cameras - ❌ Nurses couldn’t operate system effectively (2-hour training inadequate) - ❌ 5 min/patient workflow disruption overwhelmed clinics - ❌ No offline mode; internet connectivity required (unreliable in rural clinics) - ❌ Result: Pilot abandoned

Lesson: Lab performance ≠ Field performance. Real-world workflow integration is critical.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Physician Research None Some physician testing Extensive ethnographic research with your specialty
EHR Integration No integration or manual entry Some EHR support Native integration with YOUR specific EHR
Training Program Minimal (<2 hours) Half-day training Comprehensive with ongoing support
Customization Black box, no customization Limited adjustments Highly customizable to your workflows
Support Quality Email only Email + phone Dedicated account manager + on-site clinical support

Scoring: - 8-10 points: Strong workflow integration - 5-7 points: Moderate; expect implementation challenges - 0-4 points: High risk of failure; don’t proceed without extensive pilot


Domain 6: Business Viability 💼

The Questions to Ask

ImportantCritical Business Viability Questions
  1. Company Stability
  2. Customer Base
  3. Product Maturity
  4. Regulatory Status
  5. Pricing & Contracts

Red Flags 🚩

Proceed with extreme caution if:

  • Early-stage startup, no revenue - High risk of going out of business; your investment lost
  • Can’t provide physician references - No one willing to vouch for them; bad sign
  • Version 1.0 product - Expect bugs and instability; you’re the beta tester
  • Vague about pricing - “It depends”; no transparency; potential for unexpected costs
  • Long-term contract with no exit clause - You’re locked in even if it doesn’t work
  • No regulatory clearance when required - Legal risk; FDA may force you to discontinue
  • Leadership with no healthcare experience - Tech team with no clinical domain expertise
  • Recent layoffs or leadership turnover - Financial instability

Lessons from IBM Watson for Oncology

IBM Watson: - ✅ IBM is a massive, financially stable company - ❌ But even IBM couldn’t make Watson Oncology work clinically - ❌ Hospitals that bought in early lost millions in licensing fees - ❌ Time wasted training staff on system that was ultimately abandoned - ❌ IBM sold Watson Health division (2021), acknowledging failure

Lesson: Big company ≠ Good product. Clinical validation matters more than brand name.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Company Stability Startup, no revenue Funded startup or small profitable company Established, profitable, >5 years in healthcare
Customer Base <5 customers or no references 5-20 hospital customers, some references 20+ hospital customers, multiple physician references
Product Maturity V1.0 V2-3 V4+ with track record
Regulatory No clearance (when required) Clearance in progress FDA cleared/approved
Pricing Transparency Vague or hidden Somewhat clear Fully transparent, fair terms

Scoring: - 8-10 points: Financially stable, low risk - 5-7 points: Moderate risk; negotiate favorable contract terms - 0-4 points: High financial risk; consider waiting for product maturity


Putting It All Together: The Overall Evaluation Matrix

Use this to synthesize scores across all domains:

Domain Weight Your Score (0-10) Weighted Score
1. Clinical Validation 30% _____ _____
2. Patient Safety 25% _____ _____
3. Fairness & Equity 20% _____ _____
4. Privacy & Security 15% _____ _____
5. Workflow Integration 5% _____ _____
6. Business Viability 5% _____ _____
TOTAL 100% _____ / 10

Decision Framework

Overall Score Interpretation:

  • 8.0 - 10.0: Proceed with Pilot Deployment
    • Strong evidence across all domains
    • Still: Start with pilot in 1-2 clinical units before hospital-wide deployment
    • Monitor closely for first 6 months
    • Measure clinical outcomes, not just technical metrics
  • 6.0 - 7.9: Conditional Pilot with Mitigation
    • Identify weak domains and create mitigation plans
    • Example: Weak fairness score → Implement continuous bias monitoring
    • Pilot with intensive monitoring and frequent assessment
    • Re-evaluate after 6 months
    • Negotiate performance guarantees in contract
  • 4.0 - 5.9: Do NOT Deploy; Negotiate Improvements
    • Too many gaps in evidence
    • Go back to vendor with requirements:
      • “We need external validation study at 3+ hospitals before we’ll consider”
      • “We need fairness audit with subgroup analysis before we’ll proceed”
      • “We need prospective outcome data showing improved patient outcomes”
    • Consider waiting for product maturity
  • 0 - 3.9: Do NOT Deploy
    • Insufficient evidence
    • High risk of failure or patient harm
    • Wait for better products or invest in developing your own solution
    • Document decision for institutional records

Sample Questions for Vendor Meetings

Use these scripts to extract critical information:

Clinical Validation Questions

Script: > “Can you provide the peer-reviewed publication of your external validation study? We’d like to see performance metrics at hospitals not involved in development, with results stratified by patient demographics and clinical outcomes data.”

Follow-ups if vendor hesitates: - “If there’s no peer-reviewed external validation, when do you plan to conduct one?” - “Can you share the names of hospitals where validation occurred so we can contact physician references?” - “What were the clinical outcomes (mortality, complications, readmissions) at hospitals using this system?”

Fairness Questions

Script: > “We’re committed to health equity. Can you show us the fairness audit results? Specifically, we need sensitivity, specificity, and PPV broken down by race, ethnicity, age, sex, and insurance type.”

Follow-ups: - “If no fairness audit has been done, why not?” - “What is your plan for ongoing bias monitoring after deployment?” - “What happens if we discover bias affecting our patient population after deployment?”

Privacy Questions

Script: > “Walk us through exactly what patient data leaves our hospital, where it goes, how it’s stored, and how we can verify this. Can we see the Business Associate Agreement and SOC 2 Type II report?”

Follow-ups: - “What specific PHI elements does your system need?” - “Can the system work on-premise without sending data to the cloud?” - “What happens to our patients’ data if we terminate the contract?” - “Have there been any data breaches or security incidents?”

Safety Questions

Script: > “What patient outcomes have improved at hospitals using your system? Can you provide data on mortality, complications, length of stay, readmissions, or quality of life from prospective studies?”

Follow-ups if vendor only cites AUC or accuracy: - “AUC is a technical metric. Has deployment demonstrably improved patient outcomes?” - “What is the false positive rate in real-world clinical use?” - “What are the failure modes? What happens when the model fails?” - “Have there been any adverse events or patient harm attributed to the system?”

Workflow Integration Questions

Script: > “Has your system been tested with physicians and nurses at hospitals like ours? What did the usability testing reveal? How much time does it add or save per patient?”

Follow-ups: - “What is the typical implementation timeline from contract signing to go-live?” - “What ongoing clinical support do you provide?” - “Can we speak with 3 attending physician users at other hospitals about their experience?”


Procurement Contract Language Recommendations

If you decide to proceed with a pilot, include these provisions in your contract:

1. Performance Guarantees

Vendor guarantees that the AI system will achieve the following performance metrics
at [Hospital/Health System Name] during the pilot period:

- Sensitivity ≥ [threshold]% (or other appropriate metric)
- Specificity ≥ [threshold]%
- Positive Predictive Value ≥ [threshold]%
- False positive rate ≤ [threshold]%
- Physician satisfaction ≥ [threshold]/5 (measured by survey)
- [Clinical outcome metric] improved by ≥ [threshold]% vs. baseline

If performance falls below these thresholds for [timeframe], [Hospital] may
terminate the contract without penalty and receive full refund.

2. Fairness Requirements

Vendor warrants that the AI system has undergone bias testing and demonstrates
equitable performance across patient demographic groups (race, ethnicity, age, sex,
insurance status).

Vendor will provide [Hospital] with quarterly bias audit reports showing
performance metrics (sensitivity, specificity, PPV) stratified by:
- Race/ethnicity (White, Black, Hispanic, Asian, Other)
- Age (<65, ≥65)
- Sex (Male, Female)
- Insurance (Commercial, Medicare, Medicaid, Uninsured)

If disparate impact is identified (performance difference >10% across groups),
Vendor will work with [Hospital] to mitigate bias within [timeframe] or
[Hospital] may terminate without penalty.

3. Data Privacy & Security

Vendor agrees to:
- Sign HIPAA Business Associate Agreement (BAA) prior to any PHI access
- Store all data in HIPAA-compliant infrastructure in [Country/Region]
- Encrypt data at rest (AES-256 minimum) and in transit (TLS 1.3+ minimum)
- Provide SOC 2 Type II audit report annually
- Not use [Hospital] data for Vendor's own R&D without explicit written consent
- Delete all [Hospital] patient data within 30 days of contract termination
- Provide audit logs of all data access quarterly
- Notify [Hospital] within 24 hours of any data breach or security incident

4. Clinical Validation & Monitoring

[Hospital] has the right to:
- Conduct independent validation studies of the AI system
- Publish validation results (positive or negative) in peer-reviewed journals
- Access model performance dashboards in real-time
- Receive quarterly performance reports from Vendor
- Audit Vendor's quality management system

Vendor will provide:
- Technical documentation for independent validation
- API access for performance monitoring
- Support for [Hospital]'s evaluation efforts
- Notification of any FDA adverse event reports or regulatory actions

5. Liability & Indemnification

Vendor agrees to indemnify [Hospital] for:
- Any patient harm caused by AI system errors or failures
- Regulatory fines resulting from Vendor's non-compliance (HIPAA, FDA, etc.)
- Data breaches resulting from Vendor's security failures
- Malpractice claims arising from AI system recommendations

Liability cap: $[Amount] (no less than annual contract value x 10)

Vendor maintains professional liability insurance of at least $[Amount] and will
provide certificate of insurance to [Hospital].

6. Termination Rights

[Hospital] may terminate this agreement:
- For cause (breach of contract): Immediate termination, full refund
- For convenience: 90-day notice, pro-rated refund for remaining term
- For patient safety concerns: Immediate termination if AI system poses risk
- For non-performance: If system fails to meet performance guarantees
- For regulatory action: If FDA issues warning letter or requires modification

Upon termination:
- Vendor must delete all [Hospital] patient data within 30 days
- Vendor must provide data export in standard format (CSV, FHIR, etc.)
- [Hospital] retains all rights to its data and any derivatives
- Vendor returns all documentation and provides transition assistance

Pilot Implementation Plan

Even after thorough evaluation, always start with a pilot:

Phase 1: Controlled Pilot (Months 1-3)

Scope: - 1-2 clinical units (e.g., one ICU, one medical floor, one outpatient clinic) - 20-100 patients/day - Intensive monitoring - Dedicated clinical champion

Pre-Defined Success Criteria (DECIDE BEFORE PILOT): - Technical: Sensitivity ≥ [X]%, Specificity ≥ [Y]%, PPV ≥ [Z]% - Clinical: [Primary outcome] improved by ≥ [X]% vs. baseline - User: Physician satisfaction ≥ 4/5, Response rate ≥ 80% - Safety: Zero patient harm incidents - Equity: No performance disparities >10% across demographic groups

Metrics to Track: - Technical performance (sensitivity, specificity, PPV, NPV, AUC) - Alert burden (alerts/day, false positive rate, response time) - Physician experience (satisfaction, time spent per alert, override rate) - Workflow impact (time added/saved per patient) - Clinical outcomes (compare to baseline: mortality, complications, length of stay) - Equity impact (outcomes by race, ethnicity, age, sex, insurance) - Adverse events (any patient harm attributed to AI)

Go/No-Go Decision (Month 3): - ✅ Proceed to Phase 2 if ALL success criteria met - ⚠️ Iterate/adjust if most criteria met (address specific gaps) - ❌ Terminate if major criteria not met (don’t escalate commitment)

Phase 2: Expanded Pilot (Months 4-9)

Scope: - 5-10 units across multiple departments - Continue intensive monitoring - Broader physician engagement

Objectives: - Validate Phase 1 results at larger scale - Test in diverse clinical settings (ICU, floor, ED, outpatient) - Identify implementation challenges - Refine workflows and alert thresholds

Phase 3: Full Deployment (Month 10+)

Scope: - Hospital-wide or health system-wide

Requirements: - Phase 2 demonstrated sustained clinical benefit - Physician training completed for all users - Ongoing monitoring system in place - Regular re-auditing planned (quarterly bias audits, performance monitoring) - Governance structure for AI oversight

Never skip the pilot phase.


Real-World Case Study: Using the Checklist

Example: Evaluating a Hypothetical Sepsis Prediction Tool

Vendor Claims: - “AI predicts sepsis 6 hours before clinical recognition” - “92% sensitivity, 87% specificity” - “Deployed in 200+ hospitals” - “$400K/year for hospital-wide license”

Your Evaluation Using This Checklist:

Domain 1: Clinical Validation (Score: 4/10)

  • ✅ Published in peer-reviewed journal
  • ❌ Internal validation only (same health system, 3 hospitals)
  • ❌ No independent external validation
  • ❌ 92% sensitivity in paper, but what about at external sites?
  • ❌ No prospective outcome data (mortality, length of stay)
  • Red flag: “Deployed in 200+ hospitals” ≠ Evidence of effectiveness (Epic sepsis model lesson!)

Domain 2: Patient Safety (Score: 3/10)

  • ✅ Safety mentioned in paper
  • ❌ No prospective outcome studies showing mortality benefit
  • ❌ No data on whether deployment reduced deaths or complications
  • ❌ False positive rate not clearly reported for real-world use
  • ❌ Alert burden unknown
  • Red flag: Only technical metrics (AUC, sensitivity), no patient outcomes

Domain 3: Fairness & Equity (Score: 2/10)

  • ❌ No fairness audit mentioned
  • ❌ No performance stratified by race/ethnicity
  • ❌ When asked, vendor says “we don’t use race as a feature, so it’s fair”
  • Red flag: Fairness through unawareness (doesn’t work!)

Domain 4: Privacy & Security (Score: 7/10)

  • ✅ Will sign BAA
  • ✅ SOC 2 Type II certified
  • ✅ Data encrypted at rest and in transit
  • ⚠️ Data stored in vendor cloud (no on-premise option)
  • ⚠️ No HITRUST certification

Domain 5: Workflow Integration (Score: 5/10)

  • ✅ Integrates with your EHR (Epic)
  • ⚠️ Implementation takes 3-6 months
  • ⚠️ Training: 2-hour online module (seems insufficient)
  • ❌ No customization; one-size-fits-all alert thresholds
  • ❌ No usability testing data shared

Domain 6: Business Viability (Score: 8/10)

  • ✅ Established company, 7 years in business
  • ✅ 200 hospital customers (they claim)
  • ✅ Willing to provide 2 physician references
  • ⚠️ Pricing seems high ($400K/year)
  • ✅ FDA 510(k) cleared

Overall Weighted Score: 4.5 / 10

Decision: Do NOT Deploy - Insufficient clinical validation (internal only, no external validation) - No outcome evidence (AUC/sensitivity are not enough; need mortality/LOS data) - No fairness testing (high bias risk) - Workflow concerns (alert burden unknown, may cause alert fatigue)

Recommendation to Hospital Leadership:

“We evaluated [Vendor]’s sepsis prediction tool using the Clinical AI Vendor Evaluation Framework. The system scores 4.5/10, below our threshold for deployment.

Key concerns: - No external validation (validation only within vendor’s own health system) - No evidence of improved patient outcomes (only sensitivity/specificity reported, no mortality or LOS data) - No fairness audit (risk of bias affecting minority patients, similar to Epic sepsis model) - High cost ($400K/year) without demonstrated ROI

We recommend: 1. Request external validation study at 3+ independent hospitals 2. Request fairness audit with performance by race/ethnicity/insurance 3. Request prospective outcome data (mortality, length of stay, time to antibiotics) 4. Pilot at 2-3 peer hospitals before we consider 5. Re-evaluate in 12 months if vendor addresses these gaps

Alternative: Continue using existing sepsis protocols (qSOFA, SIRS) while monitoring the field for better-validated AI systems. The $400K could fund 2 additional ICU nurses, which has proven mortality benefit.”


Summary: Key Principles for Vendor Evaluation

  1. Clinical validation is non-negotiable - External validation at 3+ hospitals required
  2. Outcomes > Accuracy - Sensitivity/specificity don’t save lives; improved patient outcomes do
  3. Demand prospective evidence - Retrospective studies always look better than prospective
  4. Fairness testing is mandatory - Bias is the default; fairness must be proven
  5. Start small, scale slowly - Pilot → Evaluate → Scale only if successful
  6. Negotiate strong contracts - Performance guarantees, termination rights, data control
  7. You can say no - Bad AI is worse than no AI; don’t deploy systems that aren’t ready

The most important lesson: You are not obligated to buy a product just because it has “AI” in the name. Demand evidence. Ask hard questions. Walk away if the evidence isn’t there.

Your patients deserve better than unvalidated AI systems.


Additional Resources


Remember: The best AI system is one that improves patient outcomes, operates fairly, respects privacy, integrates into clinical workflows, and has strong evidence supporting its use. Don’t settle for less.