What questions should hospitals ask AI vendors?

Critical questions: Where was the model externally validated? What are the clinical outcomes (not just AUC)? Has fairness been tested across demographics? Will you sign a BAA? Can we speak with physician references?

What is external validation for medical AI?

External validation tests AI on data from institutions not involved in development. Internal validation (same institution) overestimates real-world performance. Demand validation at 3+ independent sites.

How do I know if an AI vendor is trustworthy?

Trustworthy vendors provide: peer-reviewed publications, external validation data, physician references, signed BAA, SOC 2 certification, transparent pricing, and contractual performance guarantees.

What score should AI vendors achieve on evaluation frameworks?

Using the 6-domain framework: 8-10 = proceed with pilot; 6-7.9 = conditional with mitigation; 4-5.9 = do not deploy without improvements; below 4 = walk away. Never skip evaluation.

What contract terms should hospitals require for AI purchases?

Essential terms: performance guarantees with termination rights, fairness monitoring requirements, data privacy and deletion provisions, liability/indemnification clauses, and right to publish validation results.

Appendix G — Clinical AI Vendor Evaluation Toolkit

TL;DR

6-Domain Evaluation Framework: 1. Clinical Validation (30%): External validation at 3+ sites, peer-reviewed outcomes 2. Patient Safety (25%): Failure modes, alert burden, adverse event tracking 3. Fairness & Equity (20%): Bias audits across demographics 4. Privacy & Security (15%): BAA, SOC 2, data minimization 5. Workflow Integration (5%): Physician research, EHR integration, training 6. Business Viability (5%): Financial stability, references, FDA status

Score interpretation: 8-10 = pilot; 6-7.9 = conditional with mitigation; 4-5.9 = do not deploy; <4 = walk away

Critical red flags: Internal validation only, no outcome data, “we don’t use race so it’s fair,” refuses BAA

Purpose

This appendix provides a structured, actionable framework for evaluating commercial AI products for clinical practice. Use this before purchasing or deploying any vendor AI system in your hospital, clinic, or health system.

Who should use this: - Hospital administrators evaluating AI vendors - Department chairs considering AI tools for their specialty - Chief Medical Information Officers (CMIOs) assessing AI products - Physician practice leaders making technology decisions - IRB/ethics committees reviewing AI deployments

What you’ll find: - Structured evaluation scorecards - Red flag identification guides - Sample questions to ask vendors - Decision frameworks for Go/No-Go - Procurement contract language recommendations

Introduction: Why You Need This Checklist

The Clinical AI Morgue (Appendix E) documented $100M+ in failed AI investments. Many failures were predictable. Warning signs existed but were ignored. Vendor claims went unquestioned. Due diligence was insufficient.

This toolkit helps you avoid those mistakes.

The Vendor-Physician Information Asymmetry

Vendor knows: - Where the model was trained - What its clinical limitations are - Where external validation failed - What patient populations it doesn’t work for - What the false alarm rate is in real-world use - Which hospitals abandoned the system after pilot

You know: - What the vendor tells you in sales materials - (Often: not much more)

This checklist helps level the playing field.

Quick Reference: The 6-Domain Evaluation Framework

Use this framework to systematically evaluate any clinical AI vendor:

Domain	Key Questions	Red Flags
1. Clinical Validation	External validation? Peer-reviewed publication? Improved outcomes?	Internal validation only; no publications; only AUC reported
2. Patient Safety	Safety testing? Adverse events tracked? Failure mode analysis?	No safety data; no mention of harms; “100% accurate” claims
3. Fairness & Equity	Performance across demographics? Bias audits?	No fairness testing; “we don’t use race so it’s fair”
4. Privacy & Security	HIPAA compliant? BAA provided? Encryption?	Vague privacy claims; no BAA; data sent to foreign servers
5. Workflow Integration	End-user testing? EHR integration? Training provided?	No physician research; “plug and play”; minimal training
6. Business Viability	Company financially stable? Hospital references? FDA clearance?	Startup with no revenue; no references; unclear roadmap

Domain 1: Clinical Validation

The Questions to Ask

Red Flags

Proceed with extreme caution if:

“Validated on 100,000 patients” - But all from the same institution (not external validation)
“95% accuracy” - On cherry-picked test set; no external validation
“AUC 0.92” - But no data on whether clinical outcomes improved
“Deployed in 150+ hospitals” - Deployment ≠ Effectiveness; no outcome data
“Proprietary validation” - No peer-reviewed publications; “trust us”
Internal validation only - Models always perform better on development data
Vendor-funded validation studies - Conflicts of interest
Only technical metrics reported - AUC, accuracy without clinical outcomes

Lessons from Epic Sepsis Model

The Epic sepsis model had: - High AUC (0.76-0.83) - Impressive technical performance - But detected only 7% of sepsis cases before clinical recognition in external validation - 67% of sepsis cases never triggered alert - No evidence of improved patient outcomes (mortality, length of stay) - Alert fatigue caused clinicians to ignore warnings

Don’t repeat this mistake. Demand outcome data, not just AUC.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
External Validation	None or internal only	1 site	3+ independent sites
Peer-Reviewed Publication	No	Conference abstract	Full peer-reviewed paper in major journal
Independent Researchers	Vendor employees only	Partial independence	Fully independent validation
Clinical Outcomes	None reported	Retrospective outcomes	Prospective RCT or quasi-experimental
Generalizability Evidence	No evidence	Similar populations	Validated on YOUR patient population

Scoring: - 8-10 points: Strong validation evidence - 5-7 points: Moderate evidence; pilot testing recommended - 0-4 points: Insufficient evidence; do NOT deploy

Domain 2: Patient Safety

The Questions to Ask

Red Flags

Proceed with extreme caution if:

“No reported adverse events” - Likely means no monitoring system, not that it’s safe
“Clinicians love it” - No quantitative data on alert fatigue or response rates
“Seamless integration” - No workflow analysis or physician research
High false positive rate (>20%) - Will cause alert fatigue
No outcome data - Only technical performance (AUC) reported
“100% accurate” - Overconfident claims; every system has failure modes
No failure mode analysis - Every AI system can fail; what happens when it does?

Lessons from IBM Watson for Oncology

IBM Watson for Oncology: - Massive training on medical literature - Impressive technology - But recommended chemotherapy for patients with bleeding (contraindicated!) - Trained on synthetic hypothetical cases, not real patient outcomes - Contradicted evidence-based guidelines - Multiple hospitals canceled contracts after recognizing unsafe recommendations

Don’t repeat this mistake. Demand safety testing and real-world validation.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
Safety Testing	None mentioned	Some testing	Comprehensive failure mode analysis
Outcome Evidence	None	Retrospective analysis	Prospective RCT or quasi-experimental
Alert Burden	Unknown	<50% false positive	<20% false positive + manageable alert rate
Human Factors	No testing	Limited usability testing	Comprehensive workflow analysis with physicians
Adverse Event Monitoring	No system	Passive reporting	Active surveillance system (like drug safety)

Scoring: - 8-10 points: Strong safety evidence - 5-7 points: Moderate evidence; close monitoring required - 0-4 points: Insufficient safety evidence; do NOT deploy

Domain 3: Fairness & Equity

The Questions to Ask

Red Flags

Proceed with extreme caution if:

“We don’t use race as a feature, so it’s fair” - Fairness through unawareness doesn’t work; race correlated with many features
“Our algorithm is objective” - Algorithms encode human biases in historical data
“High accuracy means fair” - Accuracy ≠ Fairness (see OPTUM case)
No fairness testing - Bias is default; fairness must be tested, not assumed
Using costs as proxy for health needs - See OPTUM case: costs reflect access barriers, not just illness severity
Trained on non-representative data - Academic medical centers only, commercially insured only, etc.

Lessons from OPTUM Algorithmic Bias

The OPTUM algorithm for care management: - Accurately predicted healthcare costs - High technical performance - But costs ≠ health needs, especially for Black patients - Systematically underestimated Black patients’ health needs - Result: Black patients less likely to receive needed care coordination - 46.5% more Black patients should have been enrolled for equity

Don’t repeat this mistake. Test for fairness explicitly across all patient demographics.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
Fairness Audit	None conducted	Internal audit	Independent external audit
Subgroup Analysis	No reporting	Some subgroups	Comprehensive (race, ethnicity, age, sex, insurance)
Training Data Diversity	Homogeneous (AMCs only)	Somewhat diverse	Highly representative of your population
Proxy Variable Assessment	No assessment	Acknowledged	Validated against direct clinical outcomes
Equity Impact Plan	No plan	Monitoring planned	Active mitigation strategies for disparities

Scoring: - 8-10 points: Strong fairness evidence - 5-7 points: Moderate evidence; continuous monitoring essential - 0-4 points: High bias risk; do NOT deploy without mitigation

Domain 4: Privacy & Security

The Questions to Ask

Red Flags

Proceed with extreme caution if:

Refuses to sign BAA - Non-starter for HIPAA compliance
“We’ll sign BAA later” - Must be in place BEFORE any PHI access
Vague about data storage location - “Cloud” is not specific enough; which cloud? Which region?
Data sent to foreign servers - Compliance and privacy risks
“We anonymize data so HIPAA doesn’t apply” - Re-identification risk; HIPAA still applies to most health data
No SOC 2 or security certification - Unvetted security practices
“Trust us with your data” - Trust requires verification
Unclear data retention/deletion - Your patients’ data may persist indefinitely
Vendor claims rights to “derivative data” or “improvements” - Your corrections become their IP
No data export provision - Vendor lock-in; can’t leave without losing your data

Lessons from DeepMind Streams

DeepMind Streams (AKI detection): - Collected entire medical histories (not just kidney-related data) - Data minimization failure - Patients not informed - Consent violation - No proper legal basis for data sharing - Result: Ruled unlawful by UK Information Commissioner’s Office

Lesson: Privacy promises must be legally binding and technically enforced. Data minimization is mandatory, not optional.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
HIPAA Compliance	No BAA or refuses	Will sign BAA	BAA + SOC 2 + HITRUST
Data Minimization	Collects everything	Some minimization	Strict minimization; edge deployment option
Security Certifications	None	SOC 2 Type I	SOC 2 Type II + penetration testing
Transparency	Vague policies	Clear policies	Detailed + third-party audit
Data Control	Vendor retains indefinitely	Retention period defined	You control data; deletion guaranteed

Scoring: - 8-10 points: Strong privacy & security - 5-7 points: Moderate; additional safeguards needed - 0-4 points: Unacceptable risk; do NOT proceed

Domain 5: Workflow Integration

The Questions to Ask

Red Flags

Proceed with extreme caution if:

“Plug and play” - Clinical medicine is complex; no system is truly plug-and-play
“Works with all EHRs” - Each EHR integration is custom; this claim is implausible
“No training needed” - Physicians always need training for clinical decision support tools
“One-size-fits-all” - Different hospitals have different workflows and patient populations
“We can implement in 2 weeks” - Unrealistic for complex clinical systems; implementation takes months
No physician research - Designed in isolation from actual clinical workflows
Minimal support - Email-only support; no phone; no dedicated account manager
Black box, no customization - Can’t adjust thresholds or workflows to fit your practice

Lessons from Google Health India

Google’s diabetic retinopathy AI in India: - 96% accuracy in lab with research-grade cameras - But 55% of images ungradable in field with portable cameras - Nurses couldn’t operate system effectively (2-hour training inadequate) - 5 min/patient workflow disruption overwhelmed clinics - No offline mode; internet connectivity required (unreliable in rural clinics) - Result: Pilot abandoned

Lesson: Lab performance ≠ Field performance. Real-world workflow integration is critical.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
Physician Research	None	Some physician testing	Extensive ethnographic research with your specialty
EHR Integration	No integration or manual entry	Some EHR support	Native integration with YOUR specific EHR
Training Program	Minimal (<2 hours)	Half-day training	Comprehensive with ongoing support
Customization	Black box, no customization	Limited adjustments	Highly customizable to your workflows
Support Quality	Email only	Email + phone	Dedicated account manager + on-site clinical support

Scoring: - 8-10 points: Strong workflow integration - 5-7 points: Moderate; expect implementation challenges - 0-4 points: High risk of failure; don’t proceed without extensive pilot

Domain 6: Business Viability

The Questions to Ask

Red Flags

Proceed with extreme caution if:

Early-stage startup, no revenue - High risk of going out of business; your investment lost
Can’t provide physician references - No one willing to vouch for them; bad sign
Version 1.0 product - Expect bugs and instability; you’re the beta tester
Vague about pricing - “It depends”; no transparency; potential for unexpected costs
Long-term contract with no exit clause - You’re locked in even if it doesn’t work
No regulatory clearance when required - Legal risk; FDA may force you to discontinue
Leadership with no healthcare experience - Tech team with no clinical domain expertise
Recent layoffs or leadership turnover - Financial instability

Lessons from IBM Watson for Oncology

IBM Watson: - IBM is a massive, financially stable company - But even IBM couldn’t make Watson Oncology work clinically - Hospitals that bought in early lost millions in licensing fees - Time wasted training staff on system that was ultimately abandoned - IBM sold Watson Health division (2021), acknowledging failure

Lesson: Big company ≠ Good product. Clinical validation matters more than brand name.

Scoring Rubric

Rate each item (0-2 points):

Criterion	0 Points	1 Point	2 Points
Company Stability	Startup, no revenue	Funded startup or small profitable company	Established, profitable, >5 years in healthcare
Customer Base	<5 customers or no references	5-20 hospital customers, some references	20+ hospital customers, multiple physician references
Product Maturity	V1.0	V2-3	V4+ with track record
Regulatory	No clearance (when required)	Clearance in progress	FDA cleared/approved
Pricing Transparency	Vague or hidden	Somewhat clear	Fully transparent, fair terms

Scoring: - 8-10 points: Financially stable, low risk - 5-7 points: Moderate risk; negotiate favorable contract terms - 0-4 points: High financial risk; consider waiting for product maturity

Putting It All Together: The Overall Evaluation Matrix

Use this to combine scores across all domains:

Domain	Weight	Your Score (0-10)	Weighted Score
1. Clinical Validation	30%	_____	_____
2. Patient Safety	25%	_____	_____
3. Fairness & Equity	20%	_____	_____
4. Privacy & Security	15%	_____	_____
5. Workflow Integration	5%	_____	_____
6. Business Viability	5%	_____	_____
TOTAL	100%		_____ / 10

Decision Framework

Overall Score Interpretation:

8.0 - 10.0: Proceed with Pilot Deployment
- Strong evidence across all domains
- Still: Start with pilot in 1-2 clinical units before hospital-wide deployment
- Monitor closely for first 6 months
- Measure clinical outcomes, not just technical metrics
6.0 - 7.9: Conditional Pilot with Mitigation
- Identify weak domains and create mitigation plans
- Example: Weak fairness score → Implement continuous bias monitoring
- Pilot with intensive monitoring and frequent assessment
- Re-evaluate after 6 months
- Negotiate performance guarantees in contract
4.0 - 5.9: Do NOT Deploy; Negotiate Improvements
- Too many gaps in evidence
- Go back to vendor with requirements:
  - “We need external validation study at 3+ hospitals before we’ll consider”
  - “We need fairness audit with subgroup analysis before we’ll proceed”
  - “We need prospective outcome data showing improved patient outcomes”
- Consider waiting for product maturity
0 - 3.9: Do NOT Deploy
- Insufficient evidence
- High risk of failure or patient harm
- Wait for better products or invest in developing your own solution
- Document decision for institutional records

Sample Questions for Vendor Meetings

Use these scripts to extract critical information:

Clinical Validation Questions

Script: > “Can you provide the peer-reviewed publication of your external validation study? We’d like to see performance metrics at hospitals not involved in development, with results stratified by patient demographics and clinical outcomes data.”

Follow-ups if vendor hesitates: - “If there’s no peer-reviewed external validation, when do you plan to conduct one?” - “Can you share the names of hospitals where validation occurred so we can contact physician references?” - “What were the clinical outcomes (mortality, complications, readmissions) at hospitals using this system?”

Fairness Questions

Script: > “We’re committed to health equity. Can you show us the fairness audit results? Specifically, we need sensitivity, specificity, and PPV broken down by race, ethnicity, age, sex, and insurance type.”

Follow-ups: - “If no fairness audit has been done, why not?” - “What is your plan for ongoing bias monitoring after deployment?” - “What happens if we discover bias affecting our patient population after deployment?”

Privacy Questions

Script: > “Walk us through exactly what patient data leaves our hospital, where it goes, how it’s stored, and how we can verify this. Can we see the Business Associate Agreement and SOC 2 Type II report?”

Follow-ups: - “What specific PHI elements does your system need?” - “Can the system work on-premise without sending data to the cloud?” - “What happens to our patients’ data if we terminate the contract?” - “Have there been any data breaches or security incidents?”

Safety Questions

Script: > “What patient outcomes have improved at hospitals using your system? Can you provide data on mortality, complications, length of stay, readmissions, or quality of life from prospective studies?”

Follow-ups if vendor only cites AUC or accuracy: - “AUC is a technical metric. Has deployment demonstrably improved patient outcomes?” - “What is the false positive rate in real-world clinical use?” - “What are the failure modes? What happens when the model fails?” - “Have there been any adverse events or patient harm attributed to the system?”

Workflow Integration Questions

Script: > “Has your system been tested with physicians and nurses at hospitals like ours? What did the usability testing reveal? How much time does it add or save per patient?”

Follow-ups: - “What is the typical implementation timeline from contract signing to go-live?” - “What ongoing clinical support do you provide?” - “Can we speak with 3 attending physician users at other hospitals about their experience?”

Procurement Contract Language Recommendations

If you decide to proceed with a pilot, include these provisions in your contract:

1. Performance Guarantees

Vendor guarantees that the AI system will achieve the following performance metrics
at [Hospital/Health System Name] during the pilot period:

- Sensitivity ≥ [threshold]% (or other appropriate metric)
- Specificity ≥ [threshold]%
- Positive Predictive Value ≥ [threshold]%
- False positive rate ≤ [threshold]%
- Physician satisfaction ≥ [threshold]/5 (measured by survey)
- [Clinical outcome metric] improved by ≥ [threshold]% vs. baseline

If performance falls below these thresholds for [timeframe], [Hospital] may
terminate the contract without penalty and receive full refund.

2. Fairness Requirements

Vendor warrants that the AI system has undergone bias testing and demonstrates
equitable performance across patient demographic groups (race, ethnicity, age, sex,
insurance status).

Vendor will provide [Hospital] with quarterly bias audit reports showing
performance metrics (sensitivity, specificity, PPV) stratified by:
- Race/ethnicity (White, Black, Hispanic, Asian, Other)
- Age (<65, ≥65)
- Sex (Male, Female)
- Insurance (Commercial, Medicare, Medicaid, Uninsured)

If disparate impact is identified (performance difference >10% across groups),
Vendor will work with [Hospital] to mitigate bias within [timeframe] or
[Hospital] may terminate without penalty.

3. Data Privacy & Security

Vendor agrees to:
- Sign HIPAA Business Associate Agreement (BAA) prior to any PHI access
- Store all data in HIPAA-compliant infrastructure in [Country/Region]
- Encrypt data at rest (AES-256 minimum) and in transit (TLS 1.3+ minimum)
- Provide SOC 2 Type II audit report annually
- Not use [Hospital] data for Vendor's own R&D without explicit written consent
- Delete all [Hospital] patient data within 30 days of contract termination
- Provide audit logs of all data access quarterly
- Notify [Hospital] within 24 hours of any data breach or security incident

4. Clinical Validation & Monitoring

[Hospital] has the right to:
- Conduct independent validation studies of the AI system
- Publish validation results (positive or negative) in peer-reviewed journals
- Access model performance dashboards in real-time
- Receive quarterly performance reports from Vendor
- Audit Vendor's quality management system

Vendor will provide:
- Technical documentation for independent validation
- API access for performance monitoring
- Support for [Hospital]'s evaluation efforts
- Notification of any FDA adverse event reports or regulatory actions

5. Liability & Indemnification

Vendor agrees to indemnify [Hospital] for:
- Any patient harm caused by AI system errors or failures
- Regulatory fines resulting from Vendor's non-compliance (HIPAA, FDA, etc.)
- Data breaches resulting from Vendor's security failures
- Malpractice claims arising from AI system recommendations

Liability cap: $[Amount] (no less than annual contract value x 10)

Vendor maintains professional liability insurance of at least $[Amount] and will
provide certificate of insurance to [Hospital].

6. Termination Rights

[Hospital] may terminate this agreement:
- For cause (breach of contract): Immediate termination, full refund
- For convenience: 90-day notice, pro-rated refund for remaining term
- For patient safety concerns: Immediate termination if AI system poses risk
- For non-performance: If system fails to meet performance guarantees
- For regulatory action: If FDA issues warning letter or requires modification

Upon termination:
- Vendor must delete all [Hospital] patient data within 30 days
- Vendor must provide data export in standard format (CSV, FHIR, etc.)
- [Hospital] retains all rights to its data and any derivatives
- Vendor returns all documentation and provides transition assistance

Pilot Implementation Plan

Even after thorough evaluation, always start with a pilot:

Phase 1: Controlled Pilot (Months 1-3)

Scope: - 1-2 clinical units (e.g., one ICU, one medical floor, one outpatient clinic) - 20-100 patients/day - Intensive monitoring - Dedicated clinical champion

Pre-Defined Success Criteria (DECIDE BEFORE PILOT): - Technical: Sensitivity ≥ [X]%, Specificity ≥ [Y]%, PPV ≥ [Z]% - Clinical: [Primary outcome] improved by ≥ [X]% vs. baseline - User: Physician satisfaction ≥ 4/5, Response rate ≥ 80% - Safety: Zero patient harm incidents - Equity: No performance disparities >10% across demographic groups

Metrics to Track: - Technical performance (sensitivity, specificity, PPV, NPV, AUC) - Alert burden (alerts/day, false positive rate, response time) - Physician experience (satisfaction, time spent per alert, override rate) - Workflow impact (time added/saved per patient) - Clinical outcomes (compare to baseline: mortality, complications, length of stay) - Equity impact (outcomes by race, ethnicity, age, sex, insurance) - Adverse events (any patient harm attributed to AI)

Go/No-Go Decision (Month 3): - Proceed to Phase 2 if ALL success criteria met - Iterate/adjust if most criteria met (address specific gaps) - Terminate if major criteria not met (don’t escalate commitment)

Phase 2: Expanded Pilot (Months 4-9)

Scope: - 5-10 units across multiple departments - Continue intensive monitoring - Broader physician engagement

Objectives: - Validate Phase 1 results at larger scale - Test in diverse clinical settings (ICU, floor, ED, outpatient) - Identify implementation challenges - Refine workflows and alert thresholds

Phase 3: Full Deployment (Month 10+)

Scope: - Hospital-wide or health system-wide

Requirements: - Phase 2 demonstrated sustained clinical benefit - Physician training completed for all users - Ongoing monitoring system in place - Regular re-auditing planned (quarterly bias audits, performance monitoring) - Governance structure for AI oversight

Never skip the pilot phase.

Real-World Case Study: Using the Checklist

Example: Evaluating a Hypothetical Sepsis Prediction Tool

Vendor Claims: - “AI predicts sepsis 6 hours before clinical recognition” - “92% sensitivity, 87% specificity” - “Deployed in 200+ hospitals” - “$400K/year for hospital-wide license”

Your Evaluation Using This Checklist:

Domain 1: Clinical Validation (Score: 4/10)

Published in peer-reviewed journal
Internal validation only (same health system, 3 hospitals)
No independent external validation
92% sensitivity in paper, but what about at external sites?
No prospective outcome data (mortality, length of stay)
Red flag: “Deployed in 200+ hospitals” ≠ Evidence of effectiveness (Epic sepsis model lesson!)

Domain 2: Patient Safety (Score: 3/10)

Safety mentioned in paper
No prospective outcome studies showing mortality benefit
No data on whether deployment reduced deaths or complications
False positive rate not clearly reported for real-world use
Alert burden unknown
Red flag: Only technical metrics (AUC, sensitivity), no patient outcomes

Domain 3: Fairness & Equity (Score: 2/10)

No fairness audit mentioned
No performance stratified by race/ethnicity
When asked, vendor says “we don’t use race as a feature, so it’s fair”
Red flag: Fairness through unawareness (doesn’t work!)

Domain 4: Privacy & Security (Score: 7/10)

Will sign BAA
SOC 2 Type II certified
Data encrypted at rest and in transit
Data stored in vendor cloud (no on-premise option)
No HITRUST certification

Domain 5: Workflow Integration (Score: 5/10)

Integrates with your EHR (Epic)
Implementation takes 3-6 months
Training: 2-hour online module (seems insufficient)
No customization; one-size-fits-all alert thresholds
No usability testing data shared

Domain 6: Business Viability (Score: 8/10)

Established company, 7 years in business
200 hospital customers (they claim)
Willing to provide 2 physician references
Pricing seems high ($400K/year)
FDA 510(k) cleared

Overall Weighted Score: 4.5 / 10

Decision: Do NOT Deploy - Insufficient clinical validation (internal only, no external validation) - No outcome evidence (AUC/sensitivity are not enough; need mortality/LOS data) - No fairness testing (high bias risk) - Workflow concerns (alert burden unknown, may cause alert fatigue)

Recommendation to Hospital Leadership:

“We evaluated [Vendor]’s sepsis prediction tool using the Clinical AI Vendor Evaluation Framework. The system scores 4.5/10, below our threshold for deployment.

Key concerns: - No external validation (validation only within vendor’s own health system) - No evidence of improved patient outcomes (only sensitivity/specificity reported, no mortality or LOS data) - No fairness audit (risk of bias affecting minority patients, similar to Epic sepsis model) - High cost ($400K/year) without demonstrated ROI

We recommend: 1. Request external validation study at 3+ independent hospitals 2. Request fairness audit with performance by race/ethnicity/insurance 3. Request prospective outcome data (mortality, length of stay, time to antibiotics) 4. Pilot at 2-3 peer hospitals before we consider 5. Re-evaluate in 12 months if vendor addresses these gaps

Alternative: Continue using existing sepsis protocols (qSOFA, SIRS) while monitoring the field for better-validated AI systems. The $400K could fund 2 additional ICU nurses, which has proven mortality benefit.”

Summary: Key Principles for Vendor Evaluation

Clinical validation is non-negotiable - External validation at 3+ hospitals required
Outcomes > Accuracy - Sensitivity/specificity don’t save lives; improved patient outcomes do
Demand prospective evidence - Retrospective studies always look better than prospective
Fairness testing is mandatory - Bias is the default; fairness must be proven
Start small, scale slowly - Pilot → Evaluate → Scale only if successful
Negotiate strong contracts - Performance guarantees, termination rights, data control
You can say no - Bad AI is worse than no AI; don’t deploy systems that aren’t ready

The most important lesson: You are not obligated to buy a product just because it has “AI” in the name. Demand evidence. Ask hard questions. Walk away if the evidence isn’t there.

Your patients deserve better than unvalidated AI systems.

Additional Resources

Regulatory and Professional Resources

Appendix E (The Clinical AI Morgue): Detailed failure case studies showing what goes wrong
FDA AI/ML Medical Device Guidance: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-enabled-medical-devices
Coalition for Health AI (CHAI): https://www.chai.org/
AMIA Clinical Informatics: https://amia.org/education-events/clinical-informatics-board-review-course
AMA AI Governance Guidance (8 steps for health system AI success): https://www.ama-assn.org/practice-management/digital-health/8-steps-position-your-health-system-ai-success

Remember: The best AI system is one that improves patient outcomes, operates fairly, respects privacy, integrates into clinical workflows, and has strong evidence supporting its use. Don’t settle for less.

Introduction: Why You Need This Checklist

The Vendor-Physician Information Asymmetry

Quick Reference: The 6-Domain Evaluation Framework

Domain 1: Clinical Validation

The Questions to Ask

Red Flags

Lessons from Epic Sepsis Model

Scoring Rubric

Domain 2: Patient Safety

The Questions to Ask

Red Flags

Lessons from IBM Watson for Oncology

Scoring Rubric

Domain 3: Fairness & Equity

The Questions to Ask

Red Flags

Lessons from OPTUM Algorithmic Bias

Scoring Rubric

Domain 4: Privacy & Security

The Questions to Ask

Red Flags

Lessons from DeepMind Streams

Scoring Rubric

Domain 5: Workflow Integration

The Questions to Ask

Red Flags

Lessons from Google Health India

Scoring Rubric

Domain 6: Business Viability

The Questions to Ask

Red Flags

Lessons from IBM Watson for Oncology

Scoring Rubric

Putting It All Together: The Overall Evaluation Matrix

Decision Framework

Sample Questions for Vendor Meetings

Clinical Validation Questions

Fairness Questions

Privacy Questions

Safety Questions

Workflow Integration Questions

Procurement Contract Language Recommendations

1. Performance Guarantees

2. Fairness Requirements

3. Data Privacy & Security

4. Clinical Validation & Monitoring

5. Liability & Indemnification

6. Termination Rights

Pilot Implementation Plan

Phase 1: Controlled Pilot (Months 1-3)

Phase 2: Expanded Pilot (Months 4-9)

Phase 3: Full Deployment (Month 10+)

Real-World Case Study: Using the Checklist

Example: Evaluating a Hypothetical Sepsis Prediction Tool

Domain 1: Clinical Validation (Score: 4/10)

Domain 2: Patient Safety (Score: 3/10)

Domain 3: Fairness & Equity (Score: 2/10)

Domain 4: Privacy & Security (Score: 7/10)

Domain 5: Workflow Integration (Score: 5/10)

Domain 6: Business Viability (Score: 8/10)

Summary: Key Principles for Vendor Evaluation

Additional Resources

Related Evaluation Frameworks

Regulatory and Professional Resources