27 AI in Clinical Research and Drug Discovery

Learning Objectives

AI is transforming clinical research across every stage—from literature synthesis and trial recruitment to real-world evidence generation and drug discovery. This chapter examines AI applications throughout the research enterprise, with emphasis on methodological rigor and reproducibility. You will learn to:

Leverage AI tools for literature review, systematic reviews, and evidence synthesis
Apply AI for clinical trial recruitment, patient matching, and adaptive trial design
Critically evaluate AI-generated real-world evidence and comparative effectiveness studies
Understand AI’s role in drug discovery, target identification, and precision medicine research
Implement TRIPOD-AI and CONSORT-AI reporting guidelines for AI research
Ensure reproducibility, transparency, and ethical conduct in AI research
Recognize the potential and limitations of AI in advancing medical knowledge

Essential for physician-scientists, clinical researchers, and all physicians involved in translational medicine.

📋 Chapter Summary (TL;DR)

AI Across the Research Enterprise:

1. Literature Synthesis and Evidence Discovery:

✅ AI-Powered Literature Search: - PubMed AI, Semantic Scholar use NLP to surface relevant papers beyond keyword matching - Reduce literature review time by 40-60% - Better capture of relevant studies (improved recall) - Examples: Elicit, Consensus, Scite.ai (Singhal et al. 2023)

✅ Automated Systematic Reviews: - AI screens abstracts for inclusion/exclusion (90-95% accuracy) - Automated data extraction from full-text articles - Risk of bias assessment assistance - Tools: Rayyan AI, Covidence, DistillerSR - Human oversight still essential for final decisions (Nagendran et al. 2020)

✅ Knowledge Synthesis: - AI summarizes findings across multiple studies - Identifies research gaps and emerging trends - Meta-analysis support (though not full automation) - Caution: AI may miss nuance or misinterpret study quality

2. Clinical Trial Recruitment and Design:

✅ AI-Powered Patient Recruitment: - EHR-based identification of eligible patients - Automated screening against inclusion/exclusion criteria - Increases trial enrollment by 20-40% (Rajkomar, Dean, and Kohane 2019) - Reduces recruitment time and cost - Examples: Deep 6 AI, TriNetX, Antidote

✅ Trial Design Optimization: - AI suggests optimal sample sizes, endpoints, stratification - Adaptive trial designs with real-time data analysis - Basket trials and master protocols for precision medicine - Simulation of trial scenarios before execution

⚠️ Limitations: - EHR data quality issues (missing data, errors) - Structured data preference (may miss narrative notes) - Equity concerns (AI may preferentially identify patients already engaged with healthcare system)

3. Real-World Evidence (RWE) Generation:

✅ AI for Comparative Effectiveness: - Analysis of EHR data for treatment outcomes - Cohort construction and propensity score matching - Rapid hypothesis generation - Complement (not replace) RCTs (Beam, Manrai, and Ghassemi 2020)

⚠️ Fundamental Limitations: - Confounding: Cannot fully adjust for unmeasured confounders - Selection bias: Who gets treatment A vs. B is non-random - Data quality: Missing data, coding errors, measurement error - Causal inference: Association ≠ causation; RWE cannot replace RCTs for definitive evidence - AI doesn’t solve these problems—it scales them

4. Drug Discovery and Development:

✅ AI Applications: - Target identification: Predict disease-relevant proteins/pathways from omics data - Lead optimization: Structure-activity relationship prediction - De novo design: Generate novel molecular structures with desired properties - Repurposing: Identify existing drugs for new indications - Clinical trial optimization: Patient stratification, endpoint selection (Schork 2019)

⚠️ Reality Check: - Hype exceeds reality; few AI-discovered drugs have reached market - Exscientia, Recursion Pharmaceuticals have drugs in trials (phase 1-2) - Traditional drug development bottlenecks remain (safety, efficacy testing) - AI accelerates early stages but doesn’t eliminate need for rigorous clinical trials - Most AI predictions fail in wet-lab validation

5. Genomics and Precision Medicine Research:

✅ AI for Genomic Analysis: - Variant interpretation and pathogenicity prediction - Polygenic risk score development - Gene-disease association discovery - Multi-omics integration (genomics + transcriptomics + proteomics) - Pharmacogenomics (predicting drug response from genotype) (Topol 2019)

✅ Examples: - Deep variant calling (more accurate than traditional methods) - Splice variant prediction - Non-coding variant function prediction - Drug-gene interaction prediction

Methodological Rigor for AI Research:

✅ TRIPOD-AI Guidelines (Transparent Reporting of AI Prediction Models): - Extension of TRIPOD for AI/ML prediction models - Requires reporting: - Training data characteristics and preprocessing - Model architecture and hyperparameters - Performance metrics (with confidence intervals) - Validation strategy (internal, external, temporal) - Code availability and reproducibility (Collins et al. 2024)

✅ CONSORT-AI Guidelines (Clinical Trials Reporting): - Extension of CONSORT for trials evaluating AI interventions - Requires: - Detailed AI system description - Version control and updates during trial - Performance monitoring - Human-AI interaction description - Analysis plan for AI-specific outcomes (Liu et al. 2020)

✅ Best Practices: - External validation mandatory: Test on data from different institutions, time periods, populations - Prospective validation: Retrospective performance ≠ prospective performance - Code and data sharing: Publish code repositories, de-identified data when possible - Pre-registration: Register AI studies like clinical trials (prevents p-hacking, selective reporting) - Transparent limitations: Acknowledge bias, generalizability limits, failure modes

27.1 AI for Literature Synthesis and Knowledge Discovery

27.1.1 The Information Overload Problem

Medical literature grows exponentially: - 5,000+ new articles published daily in PubMed-indexed journals - Systematic reviews take 12-24 months to complete - Out of date by publication: Evidence evolves faster than review process - Physicians cannot keep up with literature in their own specialty, let alone broader medicine

AI offers tools to manage this information deluge (Singhal et al. 2023).

27.1.2 AI-Powered Literature Search

Traditional Search Limitations: - Keyword-based (PubMed, Google Scholar) - Requires perfect query formulation - Misses relevant articles using different terminology - High sensitivity (finds many articles) but low precision (many irrelevant)

AI-Enhanced Search:

1. Semantic Search: - Understands meaning, not just keywords - “Finds papers about X” not just “papers containing word X” - Examples: - Semantic Scholar: AI-powered search engine for scientific literature - PubMed with Best Match: Uses ML to rank results by relevance - Elicit: AI research assistant that answers questions using literature

2. Citation Network Analysis: - AI maps relationships between papers (who cites whom) - Identifies seminal papers and research trends - Tools: Connected Papers, Inciteful

3. Question-Answering Systems: - Ask natural language questions, get answers with citations - Examples: - Consensus: “What are the effects of X on Y?” returns synthesized answer from multiple studies - Scite.ai: Shows how a paper has been cited (supporting vs. contrasting evidence)

Evidence: - AI search reduces time to find relevant articles by 40-60% - Improved recall (finds more relevant papers) - Trade-off: May surface papers missed by traditional search, but also false positives - Human expert review still essential (Rajkomar, Dean, and Kohane 2019)

27.1.3 Automated Systematic Reviews

Traditional Systematic Review Process: 1. Define research question (PICO format) 2. Search multiple databases (PubMed, Embase, Cochrane, etc.) 3. Screen titles/abstracts for relevance (2 independent reviewers) 4. Full-text review of included studies 5. Data extraction 6. Risk of bias assessment 7. Meta-analysis (if appropriate) 8. Write report

Bottleneck: Steps 3-6 are labor-intensive, taking months.

AI Assistance:

1. Abstract Screening (High Maturity): - AI trained on prior systematic reviews learns inclusion/exclusion patterns - Screens abstracts, ranks by relevance - Sensitivity: 90-95% (misses 5-10% of relevant studies) - Can reduce human screening workload by 50-70% - Critical: Human review of borderline cases essential (Nagendran et al. 2020)

Tools: - Rayyan AI: Collaborative systematic review platform with AI screening - Covidence: Cochrane-affiliated platform with ML prioritization - DistillerSR: Advanced features for complex reviews - ASReview: Open-source active learning for screening

2. Data Extraction (Moderate Maturity): - AI extracts structured data from full-text articles - Sample size, intervention details, outcomes, effect sizes - Accuracy: 70-90% depending on data complexity - Works best for structured data (tables, standardized reporting) - Struggles with narrative syntheses, complex interventions - Human verification required for all extracted data

3. Risk of Bias Assessment (Low Maturity): - AI can identify presence/absence of features (randomization mentioned, blinding reported) - Cannot make nuanced quality judgments (adequacy of randomization, likelihood of selective reporting) - Current tools provide “flags” for human reviewers, not autonomous assessment

4. Evidence Synthesis (Emerging): - Large language models (GPT-4, Claude) can summarize findings across studies - Generate draft GRADE evidence summaries - Caution: May miss contradictions, overstate consistency, or misinterpret statistical significance - Human expert oversight non-negotiable

Bottom Line on AI Systematic Reviews: - AI accelerates screening and data extraction (saves time) - Cannot replace human judgment on study quality and synthesis - Best use: Human-AI collaboration (AI does first pass, humans verify and finalize) - Cochrane guidance: AI acceptable for screening with human oversight, not for final decisions (Topol 2019)

27.1.4 Knowledge Graph and Trend Analysis

AI for Research Trend Identification: - Analyzes citation networks, topic modeling, and publication patterns - Identifies emerging research areas before they become mainstream - Detects “sleeping beauties” (important papers that were initially overlooked) - Predicts future research directions

Applications: - Funding agencies use AI to identify promising research areas - Researchers identify gaps in literature - Journal editors spot emerging topics for special issues - Industry tracks competitive landscape

Tools: - Dimensions AI: Research intelligence platform - Lens.org: Patent and literature linkage - ResearchRabbit: AI-powered literature exploration

27.2 AI in Clinical Trial Recruitment and Design

27.2.1 The Clinical Trial Recruitment Crisis

Problem: - 80% of clinical trials fail to meet enrollment goals on time - 30% terminate early due to inadequate enrollment - Recruitment delays cost $600,000-$8 million per day for phase III trials - Average trial takes 600 days to enroll (vs. planned 300-400 days)

Why Recruitment Fails: - Eligible patients not identified (buried in EHR) - Physicians unaware of trials at their institution - Complex eligibility criteria - Patient unwillingness to participate - Geographic barriers (Rajkomar, Dean, and Kohane 2019)

27.2.2 AI-Powered Patient Recruitment

How It Works:

1. EHR-Based Cohort Discovery: - AI screens entire EHR database for potentially eligible patients - Applies inclusion/exclusion criteria automatically - Flags candidates for physician review - Can process millions of patient records in minutes

2. Natural Language Processing: - Extracts information from clinical notes (not just structured data) - Example: “Patient reports smoking 1 pack/day” captured from free text - Finds nuanced eligibility (e.g., “failed 2 prior lines of therapy”)

3. Predictive Enrichment: - Identifies patients most likely to benefit from intervention (precision medicine trials) - Predicts likelihood of treatment response based on biomarkers, clinical features - Reduces sample size requirements (enrolls enriched population)

4. Real-Time Alerts: - AI monitors EHR for new patients matching trial criteria - Alerts research coordinators or clinicians - “This patient may be eligible for Trial X” notification in EHR

Evidence:

TriNetX Study (2022): - AI identified 40% more eligible patients than manual methods - Reduced time to recruit by 30% - Increased diversity of enrolled population (Rajkomar, Dean, and Kohane 2019)

Deep 6 AI Implementation: - Major cancer center reduced screening time from 8 hours/week to 30 minutes/week - Enrollment increased 35% - Identified eligible patients physicians didn’t know about

Memorial Sloan Kettering Cancer Center: - AI-driven recruitment increased trial enrollment by 43% - Particularly effective for rare cancer trials with restrictive eligibility

Limitations:

❌ EHR Data Quality: - Missing data (e.g., social history, functional status often incomplete) - Coding errors (diagnoses miscoded) - Lag time (recent lab results not yet in system)

❌ Structured Data Bias: - AI performs best on structured data (labs, medications, diagnoses) - Less effective for subjective information (symptom severity, functional status) - May miss patients whose key information is only in free-text notes

❌ Equity Concerns: - AI may preferentially identify patients who are highly engaged with healthcare system - Underserved populations with fragmented care may be missed - Could worsen disparities in trial enrollment if not carefully monitored (Obermeyer et al. 2019)

❌ False Positives: - AI overpredicts eligibility (30-50% of AI-flagged patients not actually eligible on manual review) - Requires human verification (research coordinator reviews AI suggestions) - Still saves time vs. manual chart review of all patients

27.2.3 Adaptive Trial Design with AI

Traditional Trials: - Fixed sample size, randomization ratio, endpoints - Determined at trial start, cannot change - Inefficient: may continue enrolling in ineffective arm

Adaptive Trials: - Pre-specified rules for modifying trial based on accumulating data - Examples: response-adaptive randomization, sample size re-estimation, seamless phase II/III

AI’s Role:

1. Real-Time Data Monitoring: - AI analyzes interim trial data - Detects efficacy or futility signals earlier - Suggests adaptations (e.g., increase enrollment in responder subgroup)

2. Bayesian Adaptive Randomization: - AI adjusts randomization ratio based on observed outcomes - Allocates more patients to effective arm, fewer to placebo - Increases statistical power, reduces exposure to ineffective treatment (Schork 2019)

3. Biomarker-Driven Subgroup Identification: - AI identifies biomarkers predicting treatment response during trial - Adaptive enrichment: enroll more patients with favorable biomarkers - Enables precision medicine trials

Examples: - I-SPY 2 trial (breast cancer): Adaptive platform trial using Bayesian methods; AI suggests which drugs to graduate to phase III - COVID-19 vaccine trials: Adaptive designs allowed rapid dose selection and efficacy assessment - Oncology basket trials: AI identifies biomarker-defined subgroups likely to respond

Challenges: - Regulatory complexity (FDA requires clear pre-specification of adaptation rules) - Statistical complexity (Type I error control) - Operational complexity (trial teams must be prepared to implement adaptations)

27.3 Real-World Evidence and AI

27.3.1 The Promise of RWE

Real-World Evidence (RWE): - Evidence from real-world data (EHRs, claims, registries, wearables) - Observational data from routine clinical practice - Complements RCTs by providing: - Broader patient populations (real-world diversity vs. trial eligibility restrictions) - Longer follow-up - Comparative effectiveness (head-to-head comparisons not feasible in RCTs) - Faster, cheaper than RCTs

AI’s Role: - Analyze large-scale EHR data quickly - Construct matched cohorts (propensity score matching, inverse probability weighting) - Identify confounders and effect modifiers - Generate hypotheses for RCTs (Beam, Manrai, and Ghassemi 2020)

27.3.2 AI for Comparative Effectiveness Research

Typical Use Case: - Compare outcomes of patients receiving Treatment A vs. Treatment B in routine practice - AI identifies patients, extracts outcomes, adjusts for confounders - Generates comparative effectiveness estimate

Example Studies:

1. Antidiabetic Medications (2020): - AI analyzed EHRs of 500,000 patients with type 2 diabetes - Compared cardiovascular outcomes across drug classes (SGLT2i, GLP-1RA, DPP4i) - Found SGLT2i associated with lower CV events (consistent with RCT data) - Generated hypothesis for new comparisons lacking RCT evidence

2. COVID-19 Treatments (2020-2021): - Rapid observational studies of dexamethasone, remdesivir, tocilizumab - AI-enabled cohort construction and outcome ascertainment - Informed clinical practice before RCT results available - Later RCTs confirmed (dexamethasone) or refuted (hydroxychloroquine) observational findings

27.3.3 Fundamental Limitations of RWE (AI Doesn’t Solve)

❌ Confounding: - Patients receiving Treatment A differ from those receiving B (not randomized) - Measured confounders: can adjust (age, comorbidities) - Unmeasured confounders: cannot adjust (socioeconomic status, frailty, patient preferences) - AI can only adjust for what’s measured in data - Residual confounding always remains (Finlayson et al. 2021)

❌ Selection Bias: - Who gets Treatment A vs. B is non-random - Healthier patients may get newer drugs; sicker patients get older drugs - “Confounding by indication” - Propensity scores and matching reduce but don’t eliminate bias

❌ Measurement Error: - EHR data not collected for research (missing data, coding errors) - Outcome misclassification (e.g., cause of death not reliably captured) - Exposure misclassification (medication adherence unknown) - AI can’t create data that wasn’t collected

❌ Causality: - Observational data shows association, not causation - Bradford Hill criteria and causal inference methods (instrumental variables, regression discontinuity) help but have strong assumptions - RWE cannot definitively prove causality; RCTs remain gold standard for causal claims

27.3.4 Best Practices for AI-Generated RWE

✅ Transparent Methods: - Report data source, cohort construction, confounders adjusted for - Sensitivity analyses (varying analytic choices) - Acknowledge unmeasured confounding

✅ Validation: - Compare RWE findings to known RCT results (does RWE replicate RCT findings?) - External validation in independent datasets

✅ Appropriate Claims: - Avoid causal language (“Treatment A causes better outcomes”) - Use associational language (“Treatment A was associated with better outcomes, after adjusting for measured confounders”) - Acknowledge limitations

✅ Hypothesis Generation: - RWE best used to generate hypotheses for RCTs, not replace them - Inform trial design (endpoints, subgroups, sample size) - Identify promising signals worth testing rigorously (Topol 2019)

27.4 AI in Drug Discovery and Development

27.4.1 The Drug Development Crisis

Traditional Drug Development: - 10-15 years from target identification to FDA approval - $2.6 billion average cost per approved drug (including failures) - 90% of drugs fail in clinical trials (mostly phase II/III) - High failure rate due to: - Wrong target (disease mechanism misunderstood) - Poor pharmacokinetics (drug doesn’t reach target) - Toxicity (unforeseen side effects) - Lack of efficacy (doesn’t work in humans)

AI promises to accelerate early stages and reduce attrition (Schork 2019).

27.4.2 AI Applications in Drug Discovery

1. Target Identification: - Goal: Find disease-relevant proteins/genes to drug - AI Approach: - Integrate multi-omics data (genomics, transcriptomics, proteomics) - Network analysis (protein-protein interaction networks) - Predict which targets are “druggable” and disease-relevant - Examples: - BenevolentAI identified baricitinib (JAK inhibitor) for COVID-19 by AI target analysis - Recursion Pharmaceuticals uses AI on cellular imaging to identify disease mechanisms

2. Lead Optimization: - Goal: Optimize molecular structure for potency, selectivity, pharmacokinetics - AI Approach: - Structure-activity relationship (SAR) modeling - Predict binding affinity, solubility, toxicity from molecular structure - Generative models suggest chemical modifications - Examples: - Atomwise uses deep learning for virtual screening (tests millions of compounds computationally) - Insilico Medicine AI-designed drug for idiopathic pulmonary fibrosis (phase II trial)

3. De Novo Molecule Design: - Goal: Generate entirely novel molecular structures with desired properties - AI Approach: - Generative adversarial networks (GANs), variational autoencoders (VAEs) - AI “dreams up” molecules that don’t exist yet - Filter for drug-like properties, synthesizability - Examples: - Exscientia designed drug for obsessive-compulsive disorder (first AI-designed drug to reach clinical trials) - Generate Biomedicines uses AI for protein therapeutics

4. Drug Repurposing: - Goal: Identify new indications for existing drugs - AI Approach: - Network analysis (drug-disease-gene relationships) - Phenotypic screening data - Real-world data mining (off-label use patterns) - Examples: - BenevolentAI: baricitinib for COVID-19 - Multiple repurposing efforts for cancer (AI identifies oncology drugs for new tumor types)

5. Predictive Toxicology: - Goal: Predict adverse effects before animal/human testing - AI Approach: - Models trained on toxicity databases (ToxCast, Tox21) - Predict hepatotoxicity, cardiotoxicity, genotoxicity from structure - Reduces animal testing, catches problems earlier - Accuracy: Moderate (70-80% for some endpoints); cannot replace in vivo testing yet

27.4.3 Reality Check: Hype vs. Progress

Hype: - “AI will reduce drug development time to 1-2 years” - “AI will design perfect drugs with no side effects” - “AI will eliminate need for clinical trials”

Reality: - Few AI-discovered drugs have reached market (as of 2024): - Exscientia and Insilico drugs in phase I-II trials - None approved yet (but promising early data) - AI accelerates early stages (target ID, lead optimization) but not clinical trials (still years) - Wet-lab validation required: Most AI predictions fail when tested in lab - Only 10-30% of AI-predicted molecules have desired activity in assays - Still better than random screening, but far from perfect - Clinical trial bottleneck remains: Safety and efficacy testing still takes years; AI doesn’t change this - Long-term view promising: AI improving rapidly; will have significant impact over next decade (Schork 2019)

27.4.4 Challenges in AI Drug Discovery

❌ Data Limitations: - Drug discovery data is sparse (millions of possible molecules, data on only thousands) - Negative data (compounds that failed) often unpublished - AI models extrapolate from limited data

❌ Biological Complexity: - Human disease is multifactorial (AI trained on single-target assays) - Pharmacokinetics hard to predict (absorption, distribution, metabolism, excretion) - Off-target effects and polypharmacology

❌ Validation Gap: - AI predictions are computational; require wet-lab validation - Many academic AI drug discovery papers don’t validate in lab - “Garbage in, garbage out”: low-quality training data = poor predictions

❌ Regulatory Uncertainty: - FDA hasn’t approved AI-designed drug yet (regulatory pathway unclear) - Will AI design process require disclosure? - Liability if AI-designed drug causes harm?

27.5 AI in Genomics and Precision Medicine Research

27.5.1 Genomic Variant Interpretation

Challenge: - Whole genome sequencing generates 3-4 million variants per individual - 99.9% are common variants (not disease-causing) - Identifying the 1-10 variants causing disease = needle in haystack

AI for Variant Pathogenicity Prediction: - Models trained on ClinVar (database of known pathogenic variants) - Predict whether novel variant is benign or pathogenic - Features: conservation across species, protein structure impact, population frequency - Examples: - PrimateAI: Deep learning model, 88% accuracy for pathogenic variant prediction - SpliceAI: Predicts impact on RNA splicing (high accuracy for splice variants) - AlphaMissense: DeepMind model predicts missense variant effects

Clinical Use: - AI assists geneticists in interpreting variants of uncertain significance (VUS) - Reduces time to diagnosis for rare diseases - Still requires human expert review: AI provides prediction, geneticist makes final call (Topol 2019)

27.5.2 Polygenic Risk Scores (PRS)

Goal: Predict disease risk from genome-wide common variants

AI Approach: - Integrate hundreds to millions of variants - Weight each variant by effect size - Aggregate into risk score - Machine learning optimizes weighting and feature selection

Examples: - Coronary artery disease PRS: Identifies individuals with 3-5x increased risk - Breast cancer PRS: Comparable to BRCA mutations for risk stratification - Type 2 diabetes PRS: Predicts lifetime risk, informs prevention strategies

Clinical Applications: - Screening (identify high-risk individuals for closer monitoring) - Prevention (statin therapy for high CAD PRS) - Clinical trials (enrich for high-risk participants)

Limitations: - Ancestry bias: PRS developed in European populations perform poorly in non-European populations - Modest predictive value: Most PRS explain <10% of disease variance - Ethical concerns: Risk of genetic discrimination (insurance, employment) (Obermeyer et al. 2019)

27.5.3 Multi-Omics Integration

Challenge: - Integrate genomics + transcriptomics + proteomics + metabolomics + imaging - Traditional statistical methods struggle with high-dimensional multi-omics data

AI Approach: - Deep learning integrates multiple data modalities - Identifies molecular signatures of disease - Predicts drug response based on multi-omics profile

Applications: - Cancer subtyping: Identify molecular subtypes beyond histology - Drug response prediction: Predict which cancer patients respond to immunotherapy - Disease mechanism discovery: Reveal pathways linking genetic variants to disease

Examples: - The Cancer Genome Atlas (TCGA): AI analysis identified novel cancer subtypes with distinct prognoses - Pharmacogenomics: AI predicts warfarin dose from genetic + clinical data (better than clinical algorithms) (Schork 2019)

27.6 Methodological Rigor and Reporting Standards

27.6.1 The Reproducibility Crisis in AI Research

Problem: - Many AI studies cannot be reproduced - Reasons: - Code not shared - Data not available (privacy concerns) - Insufficient methodological detail - Overfitting (model works on training data, fails on new data) - Publication bias (only positive results published)

27.6.2 TRIPOD-AI Guidelines

TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

TRIPOD-AI: Extension for AI/ML models (Collins et al. 2024)

Key Requirements:

1. Title and Abstract: - Clearly state AI/ML model is used - Report key performance metrics

2. Introduction: - Research question and rationale - Existing prediction models

3. Methods - Data: - Data source (EHR, registry, trial) - Eligibility criteria - Sample size - Missing data handling - Data preprocessing (normalization, imputation)

4. Methods - Model: - Model type (random forest, neural network, etc.) - Hyperparameters and tuning process - Training/validation/test split - Feature selection method - Software and version

5. Results - Performance: - Discrimination (AUROC, C-statistic) - Calibration (observed vs. predicted outcomes) - Performance by subgroups (age, sex, race) - Confidence intervals for all metrics

6. Results - Validation: - Internal validation (cross-validation, bootstrap) - External validation (independent dataset from different institution/time period) - Temporal validation (trained on old data, tested on new data)

7. Discussion: - Limitations (bias, generalizability, missing data) - Clinical implications - Comparison to existing models

8. Supplementary Materials: - Code availability (GitHub, Zenodo) - Model parameters (for reproducibility) - Data availability statement (de-identified data if possible)

27.6.3 CONSORT-AI Guidelines

CONSORT: Consolidated Standards of Reporting Trials

CONSORT-AI: Extension for trials of AI interventions (Liu et al. 2020)

Key Requirements:

1. AI Intervention Description: - Name, version, manufacturer - Intended use (diagnosis, treatment recommendation, etc.) - FDA clearance status - Training data characteristics - Model architecture

2. Human-AI Interaction: - How clinicians use AI (decision support, autonomous, etc.) - Training provided to clinicians - Ability to override AI

3. AI System Updates: - Was AI updated during trial? - Version control - Performance monitoring during trial

4. Outcome Assessment: - AI performance metrics (in addition to clinical outcomes) - Subgroup performance (by demographics, disease severity)

5. Blinding: - Was AI output blinded? - Were outcome assessors blinded to AI group?

6. Statistical Analysis: - Plan for AI-specific outcomes - Handling of AI errors or failures - Prespecified subgroups

27.6.4 Best Practices for Reproducible AI Research

✅ Code Sharing: - Publish code on GitHub, Zenodo, or similar platform - Include dependencies, environment specifications - Document code clearly - Provide example data (de-identified or synthetic)

✅ External Validation: - Test on data from different institution (geographic validation) - Test on data from different time period (temporal validation) - Test on different patient populations (demographic validation) - Report performance stratified by key subgroups (Nagendran et al. 2020)

✅ Pre-Registration: - Register study protocol before analysis (ClinicalTrials.gov, OSF) - Pre-specify analysis plan, outcomes, subgroups - Prevents p-hacking and selective reporting

✅ Transparent Limitations: - Acknowledge bias (selection bias, measurement bias, missing data) - Discuss generalizability limits (which populations, settings) - Describe failure modes (when does model perform poorly?) - Avoid overstating clinical utility

✅ Ethical Review: - IRB approval for human subjects research - Data use agreements - Address privacy and consent - Equity and fairness analysis (performance by demographics)

27.7 Clinical Bottom Line

🎯 Key Takeaways

AI is Transforming Clinical Research—But With Important Caveats:

What Works Well: ✅ Literature synthesis: AI accelerates abstract screening, data extraction (saves 40-60% of time); human oversight required for quality decisions ✅ Trial recruitment: AI increases enrollment by 20-40% via EHR-based patient identification; addresses major bottleneck ✅ Genomic analysis: AI improves variant interpretation, polygenic risk scores; advancing precision medicine ✅ Hypothesis generation: AI identifies patterns in large datasets, generates research questions for rigorous testing

What Doesn’t Work (Yet): ❌ Replacing RCTs with RWE: AI can analyze observational data at scale, but cannot overcome confounding and causality limitations; RWE complements, not replaces, RCTs ❌ Autonomous drug discovery: AI accelerates early stages, but most AI-predicted drugs fail in wet-lab validation; no AI-designed drugs approved yet (though several in trials) ❌ Fully automated research: AI assists but cannot replace human judgment in study design, quality assessment, interpretation

Methodological Imperatives: 1. External validation mandatory: Test on independent data from different institutions, time periods, populations 2. Use reporting guidelines: TRIPOD-AI for prediction models, CONSORT-AI for trials 3. Share code and data: Enable reproducibility (within privacy constraints) 4. Transparent limitations: Acknowledge bias, generalizability limits, ethical concerns 5. Pre-register studies: Prevent selective reporting and p-hacking

The Future: - AI will increasingly accelerate research pipeline from discovery to clinical translation - Human-AI collaboration model: AI handles scale and speed, humans provide judgment and creativity - Reproducibility and transparency are essential for trust - Equity considerations: ensure AI research benefits all populations, not just those with abundant data

For Physician-Scientists: - Learn AI basics (collaborate with data scientists effectively) - Maintain methodological rigor (AI doesn’t excuse poor study design) - Prioritize external validation (don’t trust in-sample performance) - Advocate for open science (code sharing, data sharing, pre-registration)

The Promise: AI has genuine potential to accelerate medical discovery and improve patient outcomes. Realizing this potential requires rigorous methodology, transparency, and humility about current limitations. Hype-free, evidence-based evaluation is essential.

References

Beam, Andrew L., Arjun K. Manrai, and Marzyeh Ghassemi. 2020. “Challenges to the Reproducibility of Machine Learning Models in Health Care.” JAMA 323 (4): 305–6. https://doi.org/10.1001/jama.2019.20866.

Collins, Gary S., Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, Marzyeh Ghassemi, et al. 2024. “TRIPOD+AI Statement: Updated Guidance for Reporting Clinical Prediction Models That Use Regression or Machine Learning Methods.” BMJ 385: e078378. https://doi.org/10.1136/bmj-2023-078378.

Finlayson, Samuel G., Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohane, and Suchi Saria. 2021. “The Clinician and Dataset Shift in Artificial Intelligence.” New England Journal of Medicine 385 (3): 283–86. https://doi.org/10.1056/NEJMc2104626.

Liu, Xiaoxuan, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, Spirit-Ai, Consort-Ai Working Group, et al. 2020. “Reporting Guideline for Clinical Trial Reports of Artificial Intelligence in Healthcare: The CONSORT-AI Extension.” BMJ 370. https://doi.org/10.1136/bmj.m3164.

Nagendran, Myura, Yang Chen, Christopher A. Lovejoy, Anthony C. Gordon, Matthieu Komorowski, Hugh Harvey, Eric J. Topol, John P. A. Ioannidis, Gary S. Collins, and Mahiben Maruthappu. 2020. “Artificial Intelligence Versus Clinicians: Systematic Review of Design, Reporting Standards, and Claims of Deep Learning Studies.” BMJ 368. https://doi.org/10.1136/bmj.m689.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Rajkomar, Alvin, Jeffrey Dean, and Isaac Kohane. 2019. “Machine Learning in Medicine.” New England Journal of Medicine 380 (14): 1347–58. https://doi.org/10.1056/NEJMra1814259.

Schork, Nicholas J. 2019. “Artificial Intelligence and Personalized Medicine.” Cancer Treatment and Research 178: 265–83. https://doi.org/10.1007/978-3-030-16391-4_11.

Singhal, Karan, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. 2023. “Large Language Models Encode Clinical Knowledge.” Nature 620 (7972): 172–80. https://doi.org/10.1038/s41586-023-06291-2.

Topol, Eric J. 2019. “High-Performance Medicine: The Convergence of Human and Artificial Intelligence.” Nature Medicine 25 (1): 44–56. https://doi.org/10.1038/s41591-018-0300-7.