Clinical Research with AI
Literature reviews that took weeks now take hours. Clinical trials recruit patients 20-40% faster with AI-powered matching. AI analyzes EHR data for comparative effectiveness in days instead of months. These tools accelerate research at every stage, from automated systematic reviews to drug discovery, but they also scale existing methodological problems. This chapter shows physician-researchers which AI tools work, which don’t, and how to maintain scientific rigor.
After reading this chapter, you will be able to:
- Leverage AI tools for literature review, systematic reviews, and evidence synthesis
- Apply AI for clinical trial recruitment, patient matching, and adaptive trial design
- Critically evaluate AI-generated real-world evidence and comparative effectiveness studies
- Understand AI’s role in drug discovery, target identification, and precision medicine research
- Implement TRIPOD-AI and CONSORT-AI reporting guidelines for AI research
- Ensure reproducibility, transparency, and ethical conduct in AI research
- Recognize the potential and limitations of AI in advancing medical knowledge
AI for Literature Synthesis and Knowledge Discovery
The Information Overload Problem
Medical literature grows exponentially: - 5,000+ new articles published daily in PubMed-indexed journals - Systematic reviews take 12-24 months to complete - Out of date by publication: Evidence evolves faster than review process - Physicians cannot keep up with literature in their own specialty, let alone broader medicine
AI offers tools to manage this information deluge Singhal et al., 2023.
AI-Powered Literature Search
Traditional Search Limitations: - Keyword-based (PubMed, Google Scholar) - Requires perfect query formulation - Misses relevant articles using different terminology - High sensitivity (finds many articles) but low precision (many irrelevant)
AI-Enhanced Search:
1. Semantic Search: - Understands meaning, not just keywords - “Finds papers about X” not just “papers containing word X” - Examples: - Semantic Scholar: AI-powered search engine for scientific literature - PubMed with Best Match: Uses ML to rank results by relevance - Elicit: AI research assistant that answers questions using literature
2. Citation Network Analysis: - AI maps relationships between papers (who cites whom) - Identifies seminal papers and research trends - Tools: Connected Papers, Inciteful
3. Question-Answering Systems: - Ask natural language questions, get answers with citations - Examples: - Consensus: “What are the effects of X on Y?” returns synthesized answer from multiple studies - Scite.ai: Shows how a paper has been cited (supporting vs. contrasting evidence)
Evidence: - AI search reduces time to find relevant articles by 40-60% - Improved recall (finds more relevant papers) - Trade-off: May surface papers missed by traditional search, but also false positives - Human expert review still essential Rajkomar et al., 2019
Automated Systematic Reviews
Traditional Systematic Review Process: 1. Define research question (PICO format) 2. Search multiple databases (PubMed, Embase, Cochrane, etc.) 3. Screen titles/abstracts for relevance (2 independent reviewers) 4. Full-text review of included studies 5. Data extraction 6. Risk of bias assessment 7. Meta-analysis (if appropriate) 8. Write report
Bottleneck: Steps 3-6 are labor-intensive, taking months.
AI Assistance:
1. Abstract Screening (High Maturity): - AI trained on prior systematic reviews learns inclusion/exclusion patterns - Screens abstracts, ranks by relevance - Sensitivity: 90-95% (misses 5-10% of relevant studies) - Can reduce human screening workload by 50-70% - Critical: Human review of borderline cases essential Nagendran et al., 2020
Tools: - Rayyan AI: Collaborative systematic review platform with AI screening - Covidence: Cochrane-affiliated platform with ML prioritization - DistillerSR: Advanced features for complex reviews - ASReview: Open-source active learning for screening
2. Data Extraction (Moderate Maturity): - AI extracts structured data from full-text articles - Sample size, intervention details, outcomes, effect sizes - Accuracy: 70-90% depending on data complexity - Works best for structured data (tables, standardized reporting) - Struggles with narrative syntheses, complex interventions - Human verification required for all extracted data
3. Risk of Bias Assessment (Low Maturity): - AI can identify presence/absence of features (randomization mentioned, blinding reported) - Cannot make nuanced quality judgments (adequacy of randomization, likelihood of selective reporting) - Current tools provide “flags” for human reviewers, not autonomous assessment
4. Evidence Synthesis (Emerging): - Large language models (GPT-4, Claude) can summarize findings across studies - Generate draft GRADE evidence summaries - Caution: May miss contradictions, overstate consistency, or misinterpret statistical significance - Human expert oversight non-negotiable
Bottom Line on AI Systematic Reviews: - AI accelerates screening and data extraction (saves time) - Cannot replace human judgment on study quality and synthesis - Best use: Human-AI collaboration (AI does first pass, humans verify and finalize) - Cochrane guidance: AI acceptable for screening with human oversight, not for final decisions Topol, 2019
Knowledge Graph and Trend Analysis
AI for Research Trend Identification: - Analyzes citation networks, topic modeling, and publication patterns - Identifies emerging research areas before they become mainstream - Detects “sleeping beauties” (important papers that were initially overlooked) - Predicts future research directions
Applications: - Funding agencies use AI to identify promising research areas - Researchers identify gaps in literature - Journal editors spot emerging topics for special issues - Industry tracks competitive landscape
Tools: - Dimensions AI: Research intelligence platform - Lens.org: Patent and literature linkage - ResearchRabbit: AI-powered literature exploration
AI in Clinical Trial Recruitment and Design
The Clinical Trial Recruitment Crisis
Problem: - 80% of clinical trials fail to meet enrollment goals on time - 30% terminate early due to inadequate enrollment - Recruitment delays cost $600,000-$8 million per day for phase III trials - Average trial takes 600 days to enroll (vs. planned 300-400 days)
Why Recruitment Fails: - Eligible patients not identified (buried in EHR) - Physicians unaware of trials at their institution - Complex eligibility criteria - Patient unwillingness to participate - Geographic barriers Rajkomar et al., 2019
AI-Powered Patient Recruitment
How It Works:
1. EHR-Based Cohort Discovery: - AI screens entire EHR database for potentially eligible patients - Applies inclusion/exclusion criteria automatically - Flags candidates for physician review - Can process millions of patient records in minutes
2. Natural Language Processing: - Extracts information from clinical notes (not just structured data) - Example: “Patient reports smoking 1 pack/day” captured from free text - Finds nuanced eligibility (e.g., “failed 2 prior lines of therapy”)
3. Predictive Enrichment: - Identifies patients most likely to benefit from intervention (precision medicine trials) - Predicts likelihood of treatment response based on biomarkers, clinical features - Reduces sample size requirements (enrolls enriched population)
4. Real-Time Alerts: - AI monitors EHR for new patients matching trial criteria - Alerts research coordinators or clinicians - “This patient may be eligible for Trial X” notification in EHR
Evidence:
TriNetX Study (2022): - AI identified 40% more eligible patients than manual methods - Reduced time to recruit by 30% - Increased diversity of enrolled population Rajkomar et al., 2019
Deep 6 AI Implementation: - Major cancer center reduced screening time from 8 hours/week to 30 minutes/week - Enrollment increased 35% - Identified eligible patients physicians didn’t know about
Memorial Sloan Kettering Cancer Center: - AI-driven recruitment increased trial enrollment by 43% - Particularly effective for rare cancer trials with restrictive eligibility
Limitations:
EHR Data Quality: - Missing data (e.g., social history, functional status often incomplete) - Coding errors (diagnoses miscoded) - Lag time (recent lab results not yet in system)
Structured Data Bias: - AI performs best on structured data (labs, medications, diagnoses) - Less effective for subjective information (symptom severity, functional status) - May miss patients whose key information is only in free-text notes
Equity Concerns: - AI may preferentially identify patients who are highly engaged with healthcare system - Underserved populations with fragmented care may be missed - Could worsen disparities in trial enrollment if not carefully monitored Obermeyer et al., 2019
False Positives: - AI overpredicts eligibility (30-50% of AI-flagged patients not actually eligible on manual review) - Requires human verification (research coordinator reviews AI suggestions) - Still saves time vs. manual chart review of all patients
Adaptive Trial Design with AI
Traditional Trials: - Fixed sample size, randomization ratio, endpoints - Determined at trial start, cannot change - Inefficient: may continue enrolling in ineffective arm
Adaptive Trials: - Pre-specified rules for modifying trial based on accumulating data - Examples: response-adaptive randomization, sample size re-estimation, combined phase II/III
AI’s Role:
1. Real-Time Data Monitoring: - AI analyzes interim trial data - Detects efficacy or futility signals earlier - Suggests adaptations (e.g., increase enrollment in responder subgroup)
2. Bayesian Adaptive Randomization: - AI adjusts randomization ratio based on observed outcomes - Allocates more patients to effective arm, fewer to placebo - Increases statistical power, reduces exposure to ineffective treatment Schork, 2019
3. Biomarker-Driven Subgroup Identification: - AI identifies biomarkers predicting treatment response during trial - Adaptive enrichment: enroll more patients with favorable biomarkers - Enables precision medicine trials
Examples: - I-SPY 2 trial (breast cancer): Adaptive platform trial using Bayesian methods; AI suggests which drugs to graduate to phase III - COVID-19 vaccine trials: Adaptive designs allowed rapid dose selection and efficacy assessment - Oncology basket trials: AI identifies biomarker-defined subgroups likely to respond
Challenges: - Regulatory complexity (FDA requires clear pre-specification of adaptation rules) - Statistical complexity (Type I error control) - Operational complexity (trial teams must be prepared to implement adaptations)
Real-World Evidence and AI
The Promise of RWE
Real-World Evidence (RWE): - Evidence from real-world data (EHRs, claims, registries, wearables) - Observational data from routine clinical practice - Complements RCTs by providing: - Broader patient populations (real-world diversity vs. trial eligibility restrictions) - Longer follow-up - Comparative effectiveness (head-to-head comparisons not feasible in RCTs) - Faster, cheaper than RCTs
AI’s Role: - Analyze large-scale EHR data quickly - Construct matched cohorts (propensity score matching, inverse probability weighting) - Identify confounders and effect modifiers - Generate hypotheses for RCTs Beam et al., 2020
AI for Comparative Effectiveness Research
Typical Use Case: - Compare outcomes of patients receiving Treatment A vs. Treatment B in routine practice - AI identifies patients, extracts outcomes, adjusts for confounders - Generates comparative effectiveness estimate
Example Studies:
1. Antidiabetic Medications (2020): - AI analyzed EHRs of 500,000 patients with type 2 diabetes - Compared cardiovascular outcomes across drug classes (SGLT2i, GLP-1RA, DPP4i) - Found SGLT2i associated with lower CV events (consistent with RCT data) - Generated hypothesis for new comparisons lacking RCT evidence
2. COVID-19 Treatments (2020-2021): - Rapid observational studies of dexamethasone, remdesivir, tocilizumab - AI-enabled cohort construction and outcome ascertainment - Informed clinical practice before RCT results available - Later RCTs confirmed (dexamethasone) or refuted (hydroxychloroquine) observational findings
Fundamental Limitations of RWE (AI Doesn’t Solve)
Confounding: - Patients receiving Treatment A differ from those receiving B (not randomized) - Measured confounders: can adjust (age, comorbidities) - Unmeasured confounders: cannot adjust (socioeconomic status, frailty, patient preferences) - AI can only adjust for what’s measured in data - Residual confounding always remains Finlayson et al., 2021
Selection Bias: - Who gets Treatment A vs. B is non-random - Healthier patients may get newer drugs; sicker patients get older drugs - “Confounding by indication” - Propensity scores and matching reduce but don’t eliminate bias
Measurement Error: - EHR data not collected for research (missing data, coding errors) - Outcome misclassification (e.g., cause of death not reliably captured) - Exposure misclassification (medication adherence unknown) - AI can’t create data that wasn’t collected
Causality: - Observational data shows association, not causation - Bradford Hill criteria and causal inference methods (instrumental variables, regression discontinuity) help but have strong assumptions - RWE cannot definitively prove causality; RCTs remain gold standard for causal claims
Best Practices for AI-Generated RWE
Transparent Methods: - Report data source, cohort construction, confounders adjusted for - Sensitivity analyses (varying analytic choices) - Acknowledge unmeasured confounding
Validation: - Compare RWE findings to known RCT results (does RWE replicate RCT findings?) - External validation in independent datasets
Appropriate Claims: - Avoid causal language (“Treatment A causes better outcomes”) - Use associational language (“Treatment A was associated with better outcomes, after adjusting for measured confounders”) - Acknowledge limitations
Hypothesis Generation: - RWE best used to generate hypotheses for RCTs, not replace them - Inform trial design (endpoints, subgroups, sample size) - Identify promising signals worth testing rigorously Topol, 2019
AI in Drug Discovery and Development
The Drug Development Crisis
Traditional Drug Development: - 10-15 years from target identification to FDA approval - $2.6 billion average cost per approved drug (including failures) - 90% of drugs fail in clinical trials (mostly phase II/III) - High failure rate due to: - Wrong target (disease mechanism misunderstood) - Poor pharmacokinetics (drug doesn’t reach target) - Toxicity (unforeseen side effects) - Lack of efficacy (doesn’t work in humans)
AI promises to accelerate early stages and reduce attrition Schork, 2019.
AI Applications in Drug Discovery
1. Target Identification: - Goal: Find disease-relevant proteins/genes to drug - AI Approach: - Integrate multi-omics data (genomics, transcriptomics, proteomics) - Network analysis (protein-protein interaction networks) - Predict which targets are “druggable” and disease-relevant - Examples: - BenevolentAI identified baricitinib (JAK inhibitor) for COVID-19 by AI target analysis - Recursion Pharmaceuticals uses AI on cellular imaging to identify disease mechanisms
2. Lead Optimization: - Goal: Optimize molecular structure for potency, selectivity, pharmacokinetics - AI Approach: - Structure-activity relationship (SAR) modeling - Predict binding affinity, solubility, toxicity from molecular structure - Generative models suggest chemical modifications - Examples: - Atomwise uses deep learning for virtual screening (tests millions of compounds computationally) - Insilico Medicine AI-designed drug for idiopathic pulmonary fibrosis (phase II trial)
3. De Novo Molecule Design: - Goal: Generate entirely novel molecular structures with desired properties - AI Approach: - Generative adversarial networks (GANs), variational autoencoders (VAEs) - AI “dreams up” molecules that don’t exist yet - Filter for drug-like properties, synthesizability - Examples: - Exscientia designed drug for obsessive-compulsive disorder (first AI-designed drug to reach clinical trials) - Generate Biomedicines uses AI for protein therapeutics
4. Drug Repurposing: - Goal: Identify new indications for existing drugs - AI Approach: - Network analysis (drug-disease-gene relationships) - Phenotypic screening data - Real-world data mining (off-label use patterns) - Examples: - BenevolentAI: baricitinib for COVID-19 - Multiple repurposing efforts for cancer (AI identifies oncology drugs for new tumor types)
5. Predictive Toxicology: - Goal: Predict adverse effects before animal/human testing - AI Approach: - Models trained on toxicity databases (ToxCast, Tox21) - Predict hepatotoxicity, cardiotoxicity, genotoxicity from structure - Reduces animal testing, catches problems earlier - Accuracy: Moderate (70-80% for some endpoints); cannot replace in vivo testing yet
Reality Check: Hype vs. Progress
Hype: - “AI will reduce drug development time to 1-2 years” - “AI will design perfect drugs with no side effects” - “AI will eliminate need for clinical trials”
Reality: - Few AI-discovered drugs have reached market (as of 2024): - Exscientia and Insilico drugs in phase I-II trials - None approved yet (but promising early data) - AI accelerates early stages (target ID, lead optimization) but not clinical trials (still years) - Wet-lab validation required: Most AI predictions fail when tested in lab - Only 10-30% of AI-predicted molecules have desired activity in assays - Still better than random screening, but far from perfect - Clinical trial bottleneck remains: Safety and efficacy testing still takes years; AI doesn’t change this - Long-term view promising: AI improving rapidly; will have significant impact over next decade Schork, 2019
Challenges in AI Drug Discovery
Data Limitations: - Drug discovery data is sparse (millions of possible molecules, data on only thousands) - Negative data (compounds that failed) often unpublished - AI models extrapolate from limited data
Biological Complexity: - Human disease is multifactorial (AI trained on single-target assays) - Pharmacokinetics hard to predict (absorption, distribution, metabolism, excretion) - Off-target effects and polypharmacology
Validation Gap: - AI predictions are computational; require wet-lab validation - Many academic AI drug discovery papers don’t validate in lab - “Garbage in, garbage out”: low-quality training data = poor predictions
Regulatory Uncertainty: - FDA hasn’t approved AI-designed drug yet (regulatory pathway unclear) - Will AI design process require disclosure? - Liability if AI-designed drug causes harm?
AI in Genomics and Precision Medicine Research
Genomic Variant Interpretation
Challenge: - Whole genome sequencing generates 3-4 million variants per individual - 99.9% are common variants (not disease-causing) - Identifying the 1-10 variants causing disease = needle in haystack
AI for Variant Pathogenicity Prediction: - Models trained on ClinVar (database of known pathogenic variants) - Predict whether novel variant is benign or pathogenic - Features: conservation across species, protein structure impact, population frequency - Examples: - PrimateAI: Deep learning model, 88% accuracy for pathogenic variant prediction - SpliceAI: Predicts impact on RNA splicing (high accuracy for splice variants) - AlphaMissense: DeepMind model predicts missense variant effects
Clinical Use: - AI assists geneticists in interpreting variants of uncertain significance (VUS) - Reduces time to diagnosis for rare diseases - Still requires human expert review: AI provides prediction, geneticist makes final call Topol, 2019
Polygenic Risk Scores (PRS)
Goal: Predict disease risk from genome-wide common variants
AI Approach: - Integrate hundreds to millions of variants - Weight each variant by effect size - Aggregate into risk score - Machine learning optimizes weighting and feature selection
Examples: - Coronary artery disease PRS: Identifies individuals with 3-5x increased risk - Breast cancer PRS: Comparable to BRCA mutations for risk stratification - Type 2 diabetes PRS: Predicts lifetime risk, informs prevention strategies
Clinical Applications: - Screening (identify high-risk individuals for closer monitoring) - Prevention (statin therapy for high CAD PRS) - Clinical trials (enrich for high-risk participants)
Limitations: - Ancestry bias: PRS developed in European populations perform poorly in non-European populations - Modest predictive value: Most PRS explain <10% of disease variance - Ethical concerns: Risk of genetic discrimination (insurance, employment) Obermeyer et al., 2019
Multi-Omics Integration
Challenge: - Integrate genomics + transcriptomics + proteomics + metabolomics + imaging - Traditional statistical methods struggle with high-dimensional multi-omics data
AI Approach: - Deep learning integrates multiple data modalities - Identifies molecular signatures of disease - Predicts drug response based on multi-omics profile
Applications: - Cancer subtyping: Identify molecular subtypes beyond histology - Drug response prediction: Predict which cancer patients respond to immunotherapy - Disease mechanism discovery: Reveal pathways linking genetic variants to disease
Examples: - The Cancer Genome Atlas (TCGA): AI analysis identified novel cancer subtypes with distinct prognoses - Pharmacogenomics: AI predicts warfarin dose from genetic + clinical data (better than clinical algorithms) Schork, 2019
Methodological Rigor and Reporting Standards
The Reproducibility Crisis in AI Research
Problem: - Many AI studies cannot be reproduced - Reasons: - Code not shared - Data not available (privacy concerns) - Insufficient methodological detail - Overfitting (model works on training data, fails on new data) - Publication bias (only positive results published)
TRIPOD-AI Guidelines
TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis
TRIPOD-AI: Extension for AI/ML models Collins et al., 2024
Key Requirements:
1. Title and Abstract: - Clearly state AI/ML model is used - Report key performance metrics
2. Introduction: - Research question and rationale - Existing prediction models
3. Methods - Data: - Data source (EHR, registry, trial) - Eligibility criteria - Sample size - Missing data handling - Data preprocessing (normalization, imputation)
4. Methods - Model: - Model type (random forest, neural network, etc.) - Hyperparameters and tuning process - Training/validation/test split - Feature selection method - Software and version
5. Results - Performance: - Discrimination (AUROC, C-statistic) - Calibration (observed vs. predicted outcomes) - Performance by subgroups (age, sex, race) - Confidence intervals for all metrics
6. Results - Validation: - Internal validation (cross-validation, bootstrap) - External validation (independent dataset from different institution/time period) - Temporal validation (trained on old data, tested on new data)
7. Discussion: - Limitations (bias, generalizability, missing data) - Clinical implications - Comparison to existing models
8. Supplementary Materials: - Code availability (GitHub, Zenodo) - Model parameters (for reproducibility) - Data availability statement (de-identified data if possible)
CONSORT-AI Guidelines
CONSORT: Consolidated Standards of Reporting Trials
CONSORT-AI: Extension for trials of AI interventions Liu et al., 2020
Key Requirements:
1. AI Intervention Description: - Name, version, manufacturer - Intended use (diagnosis, treatment recommendation, etc.) - FDA clearance status - Training data characteristics - Model architecture
2. Human-AI Interaction: - How clinicians use AI (decision support, autonomous, etc.) - Training provided to clinicians - Ability to override AI
3. AI System Updates: - Was AI updated during trial? - Version control - Performance monitoring during trial
4. Outcome Assessment: - AI performance metrics (in addition to clinical outcomes) - Subgroup performance (by demographics, disease severity)
5. Blinding: - Was AI output blinded? - Were outcome assessors blinded to AI group?
6. Statistical Analysis: - Plan for AI-specific outcomes - Handling of AI errors or failures - Prespecified subgroups
Best Practices for Reproducible AI Research
Code Sharing: - Publish code on GitHub, Zenodo, or similar platform - Include dependencies, environment specifications - Document code clearly - Provide example data (de-identified or synthetic)
External Validation: - Test on data from different institution (geographic validation) - Test on data from different time period (temporal validation) - Test on different patient populations (demographic validation) - Report performance stratified by key subgroups Nagendran et al., 2020
Pre-Registration: - Register study protocol before analysis (ClinicalTrials.gov, OSF) - Pre-specify analysis plan, outcomes, subgroups - Prevents p-hacking and selective reporting
Transparent Limitations: - Acknowledge bias (selection bias, measurement bias, missing data) - Discuss generalizability limits (which populations, settings) - Describe failure modes (when does model perform poorly?) - Avoid overstating clinical utility
Ethical Review: - IRB approval for human subjects research - Data use agreements - Address privacy and consent - Equity and fairness analysis (performance by demographics)