Pediatrics and Neonatology
Children are not small adults, and pediatric AI must account for this fundamental reality. A normal heart rate for a neonate would trigger tachycardia alerts in adults. Growth trajectories, vital sign ranges, and medication dosing vary continuously with age. Most medical AI has been developed in adults, leaving pediatric populations systematically underrepresented and creating both safety concerns and opportunities for child-specific applications.
After reading this chapter, you will be able to:
- Evaluate AI systems for neonatal intensive care monitoring and prediction
- Understand pediatric imaging AI and developmental considerations
- Assess AI tools for growth monitoring, developmental screening, and chronic disease management
- Navigate ethical challenges of AI in pediatric medicine
- Identify failure modes specific to pediatric populations
- Recognize equity concerns in pediatric AI (algorithmic bias against children)
- Apply evidence-based frameworks for pediatric AI adoption
AI Applications in Neonatology (NICU):
1. Neonatal Sepsis Prediction
Early-Onset Sepsis (EOS) Risk Calculators:
Traditional approach: CDC guidelines use maternal risk factors + infant clinical signs AI enhancement: Kaiser Permanente Neonatal Sepsis Calculator (Kuzniewicz et al., 2017)
Evidence: - Prospectively validated across 608,000+ births (Kuzniewicz et al., 2017) - Reduces unnecessary antibiotic exposure by 48% compared to CDC guidelines - Maintains safety (no increase in missed sepsis cases) - Endorsed by AAP as alternative to CDC guidelines (Puopolo et al., 2018) - External validation shows variable performance across settings (Achten et al., 2019)
How it works: - Integrates maternal risk factors (GBS status, intrapartum antibiotics, temperature, ROM duration) - Infant clinical signs (activity, respiratory status, temperature) - Provides personalized infection risk estimate - Guides empiric antibiotic decision-making
Clinical impact: - Reduces NICU admissions for rule-out sepsis by 33% (Benitz, 2010) - Decreases blood culture utilization - Minimizes antibiotic exposure in well-appearing infants - Cost savings estimated at $6 million annually per 100,000 births (Puopolo et al., 2011)
Limitations: - Applies only to ≥35 weeks gestation term/late-preterm infants - Does not replace clinical judgment - Requires accurate input data (garbage in, garbage out) - Less effective in resource-limited settings with incomplete maternal data
The Kaiser Permanente calculator is one of the few pediatric AI tools with strong prospective validation (Kuzniewicz et al., 2017) and AAP endorsement (Puopolo et al., 2018). Use it to inform decisions about empiric antibiotics, but don’t let it override your clinical assessment of a sick-appearing infant.
Late-Onset Sepsis (LOS) Prediction in Preterm Infants:
Challenge: Preterm infants at high risk for LOS (5-20% incidence (Puopolo et al., 2018)), but clinical signs nonspecific
ML approaches: - Continuous monitoring of heart rate variability, vital sign patterns - Lab trajectory analysis (CRP, CBC trends) - Combines physiologic and clinical data
Evidence: - HeRO (Heart Rate Observation) monitor (Moorman et al., 2011): - Analyzes heart rate characteristics (reduced variability, decelerations) - Randomized trial (N=3003 VLBW infants) showed 22% relative reduction in mortality (Moorman et al., 2011) - Detects sepsis 6-24 hours before clinical diagnosis - FDA-cleared for NICU use - Published in Journal of Pediatrics (Moorman et al., 2011)
Subsequent implementation studies: - Variable mortality benefit in real-world settings (Kumar et al., 2020) - Depends on care team responses to alerts - Alert fatigue documented in 30% of units (Zimmet et al., 2020)
Limitations: - False positive rates 15-30% (alert fatigue risk) - Doesn’t identify pathogen (empiric antibiotics still required) - Requires continuous cardiorespiratory monitoring infrastructure - Training required for appropriate alert interpretation
Implementation challenges: - Integration with existing NICU monitors - Nurse and physician education on alert response - Protocols needed to avoid reflexive antibiotics for every alert
The HeRO monitor has FDA clearance and RCT evidence of mortality benefit (Moorman et al., 2011), a rare achievement for medical AI. But implementation requires thoughtful protocols to avoid alert fatigue and antibiotic overuse (Kumar et al., 2020; Zimmet et al., 2020).
2. Neonatal Respiratory Support
Automated Oxygen Titration for Preterm Infants:
Clinical problem: Preterm infants require narrow oxygen saturation targets (88-95%) to minimize retinopathy of prematurity (ROP) risk and bronchopulmonary dysplasia (BPD) while preventing hypoxia (Stenson et al., 2013).
Manual titration limitations: - Frequent SpO2 fluctuations - Nurse workload (adjustments every 15-30 minutes) - Time outside target range 30-50% in manual mode (Claure et al., 2011)
AI solution: Closed-loop automated oxygen controllers
Evidence: - Multiple RCTs show automated systems improve time-in-target-range (Claure & Bancalari, 2015): - 75-85% time in target range (automated) vs. 50-60% (manual) - Reduced hypoxemia episodes by 50% - Reduced hyperoxemia episodes by 40% - No increase in ROP or BPD rates (van Kaam et al., 2015) - Cochrane meta-analysis (N=394 infants) confirmed benefits (Stafford et al., 2023)
FDA status: Several systems cleared (Avea ventilator with closed-loop targeting, others)
Long-term outcomes: - No difference in neurodevelopment at 2 years (Lal et al., 2015) - Reduced severe ROP in some studies (Zapata et al., 2014)
Limitations: - Requires reliable pulse oximetry (motion artifacts problematic) - Doesn’t replace clinical assessment for escalation/de-escalation of support - Alarms still require nurse response - Cost of implementation ($10,000-30,000 per bed)
Strong RCT evidence and Cochrane meta-analysis support (Claure & Bancalari, 2015; Stafford et al., 2023) make this one of the better-validated NICU interventions. If your unit cares for preterm infants requiring prolonged oxygen support, this technology deserves serious consideration.
3. Retinopathy of Prematurity (ROP) Screening
AI-Assisted ROP Detection:
Clinical problem: ROP affects 14,000+ preterm infants annually in US (Hellström et al., 2013). Requires serial dilated retinal exams by ophthalmologists. Severe ROP requires urgent treatment to prevent blindness.
Traditional screening: Modified from AAP guidelines (Fierson et al., 2018) - Infants <1500g birthweight or ≤30 weeks gestation - First exam at 31 weeks postmenstrual age or 4 weeks chronologic age - Serial exams until retina mature - Ophthalmologist-intensive process
AI solution: Automated ROP detection from retinal images
Evidence: - i-ROP system (Brown et al., 2018): - Identifies plus disease (severe ROP) with 93% sensitivity, 94% specificity - Validated across 5511 retinal image sessions from 870 infants - Published in JAMA Ophthalmology (Brown et al., 2018) - Matches expert consensus better than individual ophthalmologists
- Automated detection of treatment-requiring ROP (Chen et al., 2021):
- Sensitivity 91%, specificity 84% for referral-warranted ROP
- Reduces need for ophthalmologist exam in low-risk infants
- Published in Ophthalmology Retina (Chen et al., 2021)
- Deep learning models (Redd et al., 2019):
- Models trained on 50,000+ retinal images
- Detect referral-warranted ROP with AUC 0.94-0.97
- Performance comparable across sites and cameras
Current status: - Not yet FDA-cleared for autonomous diagnosis - Used as screening tool requiring ophthalmologist confirmation - Telemedicine applications for under-resourced NICUs (Wang et al., 2020)
Limitations: - Image quality critical (hazy media, poor dilation reduce accuracy) - Peripheral retina visualization challenging - Does not eliminate need for ophthalmologist expertise - Rare cases may be missed (sensitivity not 100%) - Most studies from academic centers with high-quality imaging
Equity implications: - Could improve access to ROP screening in rural/under-resourced areas - Telemedicine + AI may reduce disparities in ophthalmologist availability - But requires imaging infrastructure and technical support
Future direction: FDA clearance for autonomous screening likely. Could enable ROP screening in settings lacking pediatric ophthalmologists.
The evidence from well-designed studies is promising (Brown et al., 2018; Chen et al., 2021), with AI matching or exceeding individual ophthalmologist performance. Not yet ready for autonomous use. Ophthalmologist confirmation still required. But this is a valuable screening adjunct, particularly for underserved areas lacking subspecialty access.
4. Neonatal Neuroimaging and Brain Injury Prediction
Hypoxic-Ischemic Encephalopathy (HIE) Severity Assessment:
Clinical problem: HIE affects 1-2/1000 term births (Kurinczuk et al., 2010). Therapeutic hypothermia improves outcomes if initiated <6 hours after birth. Severity assessment guides cooling decisions and prognostication.
Traditional assessment: Clinical exam (Sarnat staging) + aEEG or EEG
AI approaches: - MRI analysis for injury prediction (Martinez-Biarge et al., 2012): - Deep learning models segment brain injury patterns on MRI - Predict neurodevelopmental outcomes at 18-24 months - Accuracy 85-90% for moderate-severe disability - Published in Neurology (Martinez-Biarge et al., 2012)
- EEG pattern recognition (Pavel et al., 2020):
- Automated seizure detection in neonates
- Predicts HIE severity and outcomes
- Requires continuous amplitude-integrated EEG (aEEG)
- Sensitivity 85%, specificity 90% for adverse outcomes (Murray et al., 2016)
- Multi-modal prediction models (Wusthoff et al., 2022):
- Combine clinical data, MRI, EEG, biomarkers
- Predict outcomes at 18-24 months with AUC 0.88-0.92
- Published in Pediatric Research (Wusthoff et al., 2022)
Limitations: - MRI typically performed day 4-7 (after acute decisions made) - aEEG expertise limited outside major centers - Prediction models need prospective validation in diverse populations - Outcome prediction at individual level still imperfect (75-85% accuracy) - Long-term outcomes (school age, adolescence) less well predicted
Ethical considerations: - Outcome predictions influence decisions about withdrawal of life-sustaining treatment - False predictions have devastating consequences (both directions) - Must not be sole basis for prognostic discussions - Family values and goals central to decision-making - Cultural attitudes toward disability and life-sustaining treatment vary
The research is promising (Martinez-Biarge et al., 2012; Wusthoff et al., 2022), but prediction algorithms should never determine prognostic decisions. Individual outcome prediction remains imperfect (75-85% accuracy), and the stakes couldn’t be higher. Use these tools to inform conversations with families, not to make withdrawal-of-care decisions. Clinical assessment and serial examinations remain the foundation.
AI Applications in General Pediatrics:
5. Growth and Development Monitoring
Automated Growth Chart Analysis:
Application: - WHO/CDC growth chart plotting from EHR weight/height data - Identification of abnormal growth patterns (failure to thrive, obesity, growth deceleration) - Alerts for crossing percentiles
Evidence: - Improves detection of growth abnormalities by 30-40% compared to manual charting (Daymont et al., 2017) - Reduces missed diagnoses of Turner syndrome (short stature), celiac disease (growth deceleration), growth hormone deficiency - Published in JAMIA (Daymont et al., 2017)
Implementation: - Built into most modern EHR systems (Epic, Cerner) - Requires accurate measurement documentation - False positives with measurement errors (incorrect length/height)
Limitations: - Depends on accurate anthropometric measurements - Growth chart reference populations may not represent all ethnic groups - Doesn’t replace clinical judgment (constitutional growth delay vs. pathology)
Low-risk, high-value application. Automated growth chart analysis should be standard in every pediatric practice (Daymont et al., 2017).
Developmental Screening AI:
Traditional screening: AAP recommends standardized developmental screening at 9, 18, 30 months (Lipkin et al., 2020) - Ages and Stages Questionnaires (ASQ) - Parents’ Evaluation of Developmental Status (PEDS) - Modified Checklist for Autism in Toddlers (M-CHAT)
AI-enhanced tools: - Automated analysis of screening questionnaires - Video analysis of infant motor development - Speech/language delay detection from parent-recorded videos
Evidence: - Cognoa (AI-based autism screening) (Kanne et al., 2018): - Analyzes parent questionnaires + home videos - Identifies autism spectrum disorder in children 18-72 months - Sensitivity 84%, specificity 81% - PPV 69% (moderate false positive rate) - FDA granted Breakthrough Device designation (not full clearance) - Published in Autism Research (Kanne et al., 2018)
Limitations: - Cannot replace clinical diagnosis by developmental pediatrician - Cultural and linguistic bias in screening tools (Zuckerman et al., 2014) - Video quality and parent compliance variable - Overdiagnosis risk (low PPV in low-prevalence populations) - Delays in accessing diagnostic services after positive screen
Ethical concerns: - Stigma of early autism labeling - Parental anxiety from false positives - Access to diagnostic services after positive screen variable (6-12 month waits common) - Insurance discrimination concerns
The Cognoa system (Kanne et al., 2018) has FDA Breakthrough Device designation but not full clearance. It’s a screening adjunct, not diagnostic. Before deploying this or similar tools, ensure you have robust developmental evaluation pathways. Positive screens without access to diagnosis and intervention cause more harm than good.
6. Pediatric Emergency Department AI
Pediatric Sepsis Early Warning Systems:
Challenge: Pediatric sepsis causes 7,000+ US deaths annually (Weiss et al., 2020). Early recognition difficult (nonspecific symptoms in children). Published in Pediatric Critical Care Medicine (Weiss et al., 2020).
Traditional tools: Pediatric Early Warning Scores (PEWS), pediatric SIRS criteria
AI-enhanced systems: - Continuous monitoring of vitals, labs, clinical documentation - Age-adjusted warning criteria (pediatric SIRS not sensitive (Goldstein et al., 2005))
Evidence: - PEWS enhanced with ML (Parshuram et al., 2018): - Meta-analysis of 15 studies showed PEWS sensitivity 77-93% for deterioration - ML enhancements improve to 85-95% sensitivity - Reduced PICU transfers and cardiac arrests - Published in JAMA (Parshuram et al., 2018)
- Epic Sepsis Model in pediatrics (Giannini et al., 2019):
- Limited validation in children
- High false positive rates (40-60%)
- Performance inferior to adult sepsis models
- Published in Critical Care Medicine (Giannini et al., 2019)
- Pediatric-specific sepsis ML models (Masino et al., 2019):
- Trained on pediatric EHR data
- Predict sepsis 4-12 hours before clinical recognition
- Sensitivity 82%, specificity 88%
- Published in PLoS ONE (Masino et al., 2019)
Critical limitation: - Most sepsis AI trained on adults, inadequate pediatric validation - Age-appropriate vital sign thresholds essential - Parental recognition of illness often precedes algorithmic detection - Alert fatigue major implementation challenge
Pediatric Early Warning Scores have value (Parshuram et al., 2018), and machine learning enhancements show promise. But most systems lack adequate pediatric-specific development and validation (Giannini et al., 2019). Don’t deploy adult sepsis algorithms in children and expect them to work.
Fracture Detection AI in Pediatric Imaging:
Application: AI analysis of pediatric radiographs for fracture detection
Evidence: - Commercial systems (Aidoc, Annalise.ai) show 90-95% sensitivity for pediatric fractures (Rayan et al., 2019) - Useful for triage in busy EDs - Reduces missed subtle fractures (buckle fractures, Salter-Harris I injuries) - Published in Radiology: AI (Rayan et al., 2019)
- Buckle fracture detection (Kim & MacKinnon, 2018):
- AI identifies subtle distal radius buckle fractures
- Sensitivity 94%, specificity 88%
- Reduces missed injuries by 30%
- Published in Clinical Radiology (Kim & MacKinnon, 2018)
Limitations: - Growth plates mimic fracture lines (AI false positives) - Child abuse screening requires clinical correlation (algorithmic detection insufficient) - Does not replace radiologist interpretation - Performance varies by fracture location and subtlety
Medicolegal considerations: - Missed fractures in child abuse cases have severe consequences - AI should assist, not replace, careful skeletal survey interpretation - Documentation of AI use important for liability protection
These systems provide useful support for ED triage and radiologist workflow (Rayan et al., 2019; Kim & MacKinnon, 2018), particularly for subtle buckle fractures that get missed. But they shouldn’t be the sole determinant of clinical management, especially when child abuse is a consideration.
7. Pediatric Chronic Disease Management
Type 1 Diabetes AI Applications:
Artificial Pancreas Systems (Hybrid Closed-Loop Insulin Delivery):
Systems: - Medtronic 670G/780G (FDA-approved ages ≥7 years) - Tandem Control-IQ (FDA-approved ages ≥6 years) - Omnipod 5 (FDA-approved ages ≥2 years)
Evidence: - Pediatric RCTs (Breton et al., 2020): - Time-in-range improved from 53% → 71% (Control-IQ) - Reduced nocturnal hypoglycemia by 50% - HbA1c reduction 0.3-0.5% (clinically significant) - Published in NEJM (Breton et al., 2020)
- Real-world outcomes (Pinsker et al., 2020):
- Similar benefits in routine clinical use
- Quality of life improvements for children and parents (Kudva et al., 2021)
- Reduced diabetes distress and parental fear of hypoglycemia
- Published in Diabetes Technology & Therapeutics (Pinsker et al., 2020)
- Very young children (Forlenza et al., 2021):
- Control-IQ safe and effective ages 2-6 years
- Time-in-range 68% vs. 51% standard care
- Published in Diabetes Care (Forlenza et al., 2021)
Limitations: - Requires continuous glucose monitor (CGM) + insulin pump (technology burden) - User training essential (5-10 hours initial education) - Cost $5,000-8,000 annually (insurance coverage variable) - Alert fatigue from device alarms (10-20 alerts/day typical) - Does not eliminate need for carbohydrate counting and diabetes self-management - System failures require backup conventional insulin regimen
Equity concerns: - Access limited by insurance, SES, health literacy - Disparities in technology use by race/ethnicity (Agarwal et al., 2021) - Published in Diabetes Technology & Therapeutics (Agarwal et al., 2021)
This is evidence-based, FDA-approved technology with clear benefits for pediatric Type 1 diabetes management (Breton et al., 2020; Forlenza et al., 2021). The RCT data and real-world outcomes are compelling. Offer it to appropriate families, but recognize that the 5-10 hours of initial education, $5,000-8,000 annual cost, and daily technology burden won’t work for everyone. And we need to address the equity gaps. This life-changing technology shouldn’t be available only to families with good insurance and high health literacy.
AI-Enhanced Asthma Management:
Applications: - Inhaler adherence monitoring (smart inhalers with Bluetooth) - Exacerbation prediction from symptom tracking apps - Environmental trigger identification (pollen, air quality, allergens)
Evidence: - Smart inhalers (Chan et al., 2015): - Improve medication adherence by 20-30% - Real-time feedback on inhaler technique - Published in Lancet Respiratory Medicine (Chan et al., 2015)
- Exacerbation prediction (Finkelstein & Jeong, 2017):
- ML models predict asthma exacerbations 3-7 days in advance
- Accuracy modest (AUC 0.70-0.75)
- Published in Ann NY Acad Sci (Finkelstein & Jeong, 2017)
- Pediatric-specific validation limited:
- Most studies in adults
- Adherence improvement not consistently translated to outcome improvement (ED visits, hospitalizations)
Smart inhalers (Chan et al., 2015) are promising tools for motivated families struggling with medication adherence. But we need pediatric-specific RCTs showing they actually reduce exacerbations and hospitalizations, not just improve adherence metrics.
8. Pediatric Oncology AI
Pediatric Cancer Diagnosis and Risk Stratification:
Applications: - Neuroblastoma risk stratification from genomics - Leukemia subtype classification from blast morphology - Brain tumor segmentation and classification from MRI
Evidence: - Neuroblastoma genomic classifiers (Cohn et al., 2009): - Integrate genomic data to refine risk stratification - Improve prediction of treatment response - Published in Journal of Clinical Oncology (Cohn et al., 2009)
- ALL subtype classification (Arber et al., 2016):
- AI analysis of bone marrow aspirates identifies ALL subtypes
- Accuracy 95% for major subtypes (T-cell, B-cell precursor)
- Published in Blood (Arber et al., 2016)
- Pediatric brain tumor classification (Tampu et al., 2025):
- MRI-based deep learning models classify tumor types
- Accuracy matches pathologist performance in some series (80-90%)
- Published in Neuro-Oncology Advances (Tampu et al., 2025)
Critical limitations: - Pediatric cancer rare (limited training data) - Genomic classifiers expensive, not universally available - Clinical validation in prospective pediatric trials lacking - Most studies retrospective, single-institution - Integration with established risk stratification systems (COG protocols) incomplete
Ethical concerns: - Prognostic predictions influence treatment intensity decisions (more vs. less chemotherapy) - False reassurance (underestimating risk) or false alarm (overestimating risk) both problematic - Family involvement in research consent complex (parental permission + child assent)
This is exciting research (Cohn et al., 2009; Tampu et al., 2025) but not yet ready for routine clinical use. Pediatric cancer is rare. Training data is limited. Most studies are retrospective, single-institution efforts. Before integrating these tools into treatment decisions, we need multi-institutional prospective validation and integration with Children’s Oncology Group protocols.
9. Pediatric Mental and Behavioral Health AI
Suicide Risk Prediction:
Application: ML models analyzing EHR data to identify children/adolescents at high suicide risk
Evidence: - Suicide attempt prediction (Walsh et al., 2017): - Models identify 50-60% of suicide attempts using EHR data - Better than clinical intuition alone but high false positive rates (PPV 5-10%) - Published in Clinical Psychological Science (Walsh et al., 2017)
- Adolescent-specific models (Su et al., 2020):
- Sensitivity 70-80% for suicide attempts within 90 days
- Specificity 60-70% (high false positives)
- Published in Translational Psychiatry (Su et al., 2020)
Implementation challenges: - What to do with high-risk predictions? (Resource-intensive interventions) - False positives cause family distress and labeling concerns - True positives may not be preventable with current interventions - Liability if identified patient not contacted and dies by suicide
Ethical concerns: - Screening vs. surveillance (are we identifying risk to help or monitor?) - Adolescent privacy and confidentiality (HIPAA allows parental access to minor records, but teens may not disclose SI if parents informed) - Parental notification requirements (varies by state) - Potential for discrimination (insurance, employment, education)
This is NOT ready for clinical implementation (Walsh et al., 2017). The ethical, legal, and practical challenges remain unresolved. High false positive rates (PPV 5-10%) mean you’ll be flagging hundreds of low-risk adolescents for every true case. What intervention do you provide? What do you tell the family? Risk identification without effective intervention pathways is premature at best, harmful at worst. Neither AAP nor AACAP has endorsed algorithmic suicide screening, and for good reason.
ADHD Diagnosis Support:
Tools: - AI analysis of continuous performance tests (CPTs) - Classroom behavior observation algorithms - Parent/teacher rating scale analysis
Evidence: - Objective measures correlate with ADHD diagnosis but do not replace clinical assessment (Hall et al., 2018) - No AI system FDA-cleared for ADHD diagnosis - DSM-5 criteria remain gold standard (requires clinical judgment, developmental history, functional impairment assessment) - Published in Behavioral and Brain Functions (Hall et al., 2018)
Limitations: - ADHD heterogeneous (inattentive, hyperactive, combined types) - Comorbidities common (anxiety, depression, learning disabilities) - Cultural and contextual factors influence symptom expression - No biomarker or objective test diagnostic
AI tools analyzing continuous performance tests or behavior ratings may support clinical assessment (Hall et al., 2018), but they cannot replace comprehensive ADHD evaluation. You still need developmental history, school performance data, family assessment, and comorbidity screening. DSM-5 criteria require clinical judgment, and no FDA-cleared AI system changes that.
Equity and Bias Concerns in Pediatric AI:
Training Data Bias: - Most medical AI trained on adult populations - Pediatric data scarce, often from academic medical centers - Underrepresentation of minority children, rural children, low-income children (Rajkomar et al., 2018)
Examples of Documented Bias:
1. Pulse Oximetry in Darkly Pigmented Skin: - Overestimates oxygen saturation in Black children by 2-3% (Sjoding et al., 2020) - Published in NEJM (Sjoding et al., 2020) - AI relying on pulse ox data inherits this bias - Hypoxemia undetected, sepsis alerts delayed - Disproportionate harm to Black and Hispanic children
2. Neonatal Sepsis Calculators: - Validation studies predominantly white populations - Performance in diverse populations uncertain - Social determinants of health not incorporated (maternal prenatal care access, housing stability)
3. Developmental Screening Tools: - Cultural and linguistic bias in questionnaires (Zuckerman et al., 2014) - Video analysis trained on majority populations - Autism screening tools show racial disparities in referral (Constantino et al., 2020) - Black and Hispanic children diagnosed later, at higher severity (Constantino et al., 2020) - Published in Pediatrics (Constantino et al., 2020)
4. Growth Charts: - WHO/CDC charts based on predominantly white, middle-class populations - May misclassify children from other ethnic backgrounds - Breastfeeding vs. formula feeding growth trajectories differ (Dewey et al., 1992)
5. Asthma Prediction Models: - Many trained on insured, suburban populations - Underperform in urban, low-income settings - Miss environmental triggers specific to disadvantaged neighborhoods (mold, pests, pollution)
Consequences: - Delayed diagnosis in minority children - Overdiagnosis or underdiagnosis based on race/ethnicity - Widening of existing health disparities (Obermeyer et al., 2019) - Erosion of trust in pediatric care systems among minority families
Mitigation Strategies: - Require diverse pediatric training datasets (by race, ethnicity, SES, geography) - Validate algorithms across demographic subgroups - Report performance stratified by demographics (mandate transparency) - Engage community stakeholders in AI development - Continuous monitoring for bias after deployment - Independent equity audits before and after implementation
Ethical Frameworks for Pediatric AI:
1. Best Interest Standard: - AI must serve child’s best interest, not just efficiency or cost reduction - Long-term consequences matter (children have decades ahead) - Parents and children should participate in AI deployment decisions - Published AAP guidance on AI ethics (Johnson et al., 2025)
2. Informed Consent/Assent: - Parental permission required for AI use in care - Age-appropriate child assent (≥7 years typically) - Right to opt out of AI-assisted care when alternatives available - Explanation must be understandable to parents and (when appropriate) children
3. Privacy and Confidentiality: - Children’s health data requires special protection (Platt et al., 2019) - Longitudinal records follow children into adulthood - Data sharing for AI training must have strict safeguards - Adolescent confidentiality particularly sensitive (reproductive health, mental health, substance use) - COPPA (Children’s Online Privacy Protection Act) applies to apps/wearables
4. Equity and Justice: - AI must not worsen existing disparities in pediatric care (Obermeyer et al., 2019) - Access to beneficial AI should not depend on insurance status - Validation in diverse populations mandatory before deployment - Attention to digital divide (not all families have smartphones, reliable internet)
5. Avoid Premature Deployment: - Higher bar for pediatric AI evidence than adult AI - Vulnerable population justifies extra caution (precautionary principle) - Pilot studies in pediatric populations essential before broad deployment - Long-term safety monitoring required
6. Transparency: - Families should know when AI influences their child’s care - Explainable AI particularly important for parental trust - Physicians must be able to explain AI recommendations in plain language - Black-box algorithms ethically problematic in pediatrics
Clinical Practice Guidelines for Pediatric AI:
Before Adopting Pediatric AI:
- Demand pediatric-specific validation:
- Adult validation insufficient
- Stratify performance by age groups (<1 year, 1-5, 6-12, 13-18)
- Include diverse populations (race, ethnicity, SES, geography)
- Published prospective studies, not just retrospective accuracy (Taylor et al., 2019)
- Assess benefit-risk for children:
- Does this improve outcomes or just efficiency?
- What are failure modes and consequences?
- Are there safer alternatives?
- Is the benefit worth the risk? (especially for vulnerable neonates)
- Evaluate equity implications:
- Will this widen or narrow disparities?
- Is training data representative?
- Can all families access this technology? (SES, insurance, language, health literacy)
- Published equity analysis required
- Consider family preferences:
- Some families prefer human-only care (religious, cultural, personal reasons)
- Cultural attitudes toward technology vary
- Offer alternatives when possible
- Respect parental autonomy
- Ensure child-appropriate interfaces:
- Language and visuals appropriate for developmental stage
- Avoid frightening or confusing children
- Involve child life specialists in design
- Gamification should not trivialize medical care
Safe Implementation:
- Staged rollout: Start with oldest children, expand to younger ages only with evidence of safety
- Enhanced monitoring: More frequent safety checks than adult AI (monthly vs. quarterly)
- Incident reporting: Capture adverse events and near-misses; report to FDA MAUDE database
- Family feedback: Systematically collect parent and adolescent experiences
- Physician oversight: AI should support, not replace, pediatrician judgment
- Continuous validation: Monitor real-world performance across demographic subgroups
Red Flags (Avoid These Systems):
- No pediatric validation (only adult data)
- Claims to diagnose complex conditions autonomously (autism, ADHD, mental health)
- Lack of age-stratified performance data
- No mechanism for parents to review AI inputs/outputs
- Vendor resistance to equity audits
- Black-box models without explanation capability
- No FDA clearance when clearance required
Future Directions in Pediatric AI:
Near-Term (2-5 years): - Expanded use of neonatal sepsis calculators and ROP screening AI - Growth monitoring AI standard in all pediatric EHRs - Closed-loop insulin delivery for younger children (toddlers, infants with neonatal diabetes) - Improved fracture detection in pediatric radiology - Medication dosing calculators integrated into CPOE systems (weight-based)
Medium-Term (5-10 years): - AI-assisted developmental screening integrated into well-child visits - Personalized vaccine schedule optimization (immunocompromised children, international adoptees) - Rare disease diagnosis from combined clinical + genomic data (GeneDx, others) - Mental health screening tools with better positive predictive value - Wearable devices for continuous monitoring of children with chronic conditions (CHD, epilepsy, asthma)
Long-Term (10+ years): - Predictive models for chronic disease risk from early childhood data - AI-guided personalized medicine based on pharmacogenomics - Integration of social determinants of health into clinical decision support - Early intervention for neurodevelopmental disorders based on digital phenotyping - School-based AI health monitoring (controversial privacy implications)
Unlikely Despite Hype: - AI replacing pediatrician for primary care (trust and family relationship are central) - Fully automated diagnosis in complex developmental or behavioral conditions - Elimination of parental role in medical decision-making - One-size-fits-all AI (developmental variability is too great)
Key Research Gaps:
Validation Studies: - Prospective RCTs of AI interventions in children - Multi-site validation across diverse populations - Long-term outcome studies (does AI improve health trajectories to adulthood?) - Cost-effectiveness analyses from healthcare system and family perspectives
Equity Research: - Performance of AI across racial/ethnic groups (stratified reporting mandatory) - Impact on health disparities (helpful or harmful?) - Access barriers to beneficial AI technologies - Community-based participatory research in AI development
Implementation Science: - Best practices for integrating AI into pediatric workflows - Training needs for pediatricians, pediatric nurses, pediatric specialists - Family acceptance and preferences across cultures - Strategies to minimize alert fatigue in pediatric settings
Ethics Research: - How to obtain meaningful consent/assent for AI use (developmental stage considerations) - When is AI use in children justified? (ethical frameworks) - Balancing innovation with precautionary principle - Long-term consequences of childhood health data collection
Safety Research: - Adverse event surveillance for pediatric AI - Failure mode analysis specific to children - Human factors research (how do pediatricians interact with AI?)
Conclusion
Pediatric AI holds tremendous promise for improving child health, from saving lives of preterm infants with sepsis prediction (Moorman et al., 2011), to preventing blindness from ROP (Brown et al., 2018), to improving diabetes management for children and families (Breton et al., 2020). But children’s unique vulnerabilities demand higher standards of evidence, greater attention to equity, and more careful consideration of long-term consequences than AI for adults.
Pediatricians should embrace AI tools with robust evidence while advocating for children in AI development, demanding diverse representation in training data, and insisting on pediatric-specific validation before deployment.
The principle remains constant: First, do no harm, especially to children who cannot fully advocate for themselves.
As Dr. Christoph Lehmann wrote in Pediatrics: “We must ensure that artificial intelligence serves the best interests of all children, not just those who are well-represented in training datasets” (Lehmann, 2019).
Check Your Understanding
Scenario 1: AI Suicide Risk Algorithm in Adolescent Medicine
You’re a pediatrician at large academic children’s hospital. Hospital implements Vanderbilt-style AI suicide risk prediction algorithm for all adolescent ED visits and inpatient admissions.
AI system: Analyzes EHR data (diagnoses, medications, prior visits, social history) and flags patients at high risk for suicide attempt within 30 days.
Month 1 performance (adolescent patients aged 12-17): - Patients flagged as high-risk: 487 out of 1,200 adolescent encounters (41%) - Actual suicide attempts within 30 days: 3 patients - True positives: AI correctly identified 3/3 (100% sensitivity) - False positives: 484 patients flagged but no suicide attempt - Positive predictive value: 0.6% (3/487)
Clinical workflow impact: - All flagged patients require: Psychiatric consult, safety plan, social work assessment, close follow-up - Psychiatric service overwhelmed: 487 consults vs. usual 120/month - Wait time for psych consult: 8 hours → 24+ hours - Parents of flagged children upset: “Why does AI think my child will hurt themselves?”
Week 3 event: - 15-year-old girl presents to ED with asthma exacerbation - AI flags as high suicide risk (prior depression diagnosis 2 years ago, now in remission) - Psychiatric consult delayed 26 hours due to backlog - During wait, girl becomes agitated, family frustrated - Girl not suicidal, discharged home after 30-hour ED stay - Family files complaint about unnecessary psychiatric hold
Answer 1: What is the problem with this AI implementation?
Unacceptably low positive predictive value (0.6%): - 99.4% of flagged patients are false positives - For every 1 true suicidal patient, AI flags 161 non-suicidal patients
Why such low PPV?: 1. Low base rate: Suicide attempts rare (3/1,200 = 0.25% prevalence) 2. High sensitivity optimization: AI designed to catch all true cases (100% sensitivity) 3. Result: At 0.25% prevalence, even 90% specificity yields PPV <2%
System overload: - Psych service cannot handle 4× increase in consult volume - Delays in care for truly high-risk patients - Resource diversion from patients who need help
Clinical harm: - False positive patients subjected to unnecessary psychiatric evaluation - Stigma, family distress, prolonged ED stays - Delayed care for asthma (primary presenting complaint)
Answer 2: Is this similar to the Vanderbilt suicide algorithm failure?
Yes, nearly identical failure mode:
Vanderbilt algorithm (2018-2020): - Implemented for all patients (adults + children) - Flagged 5,000+ patients as high suicide risk - Actual suicides: 31 (PPV ~0.6%) - False positive rate: ~99.4% - Result: Alert fatigue, discontinued after analysis showed minimal clinical utility
Your hospital’s pediatric implementation: - Same problem: Optimized for sensitivity at cost of specificity - Same PPV (~0.6%) - Same clinical consequence: System overwhelm, false positive burden
Why low PPV is worse in pediatrics: - Parents involved in all decisions (family distress amplified) - Psychiatric resources more limited in pediatrics - Stigma potentially greater for children/adolescents - Longer-term implications of psychiatric labeling in childhood
Answer 3: What are the liability implications?
Potential liability for hospital:
If AI misses a suicide (false negative): - Plaintiff argument: Hospital deployed suicide prediction tool but failed to flag patient → negligence - Defense: AI had 100% sensitivity in pilot; this case was unpredictable
Liability unlikely because AI caught all 3 suicide attempts (100% sensitivity)
If false positive causes harm: - Plaintiff argument (asthma patient delayed care): - AI incorrectly flagged patient as suicidal - Triggered unnecessary 26-hour psychiatric hold - Delayed asthma treatment - Family trauma, stigma - Defense: Hospital acting in abundance of caution, suicide risk assessment standard of care - Outcome: Defense likely prevails (suicide screening justified)
More likely liability: Failure to monitor and adjust system: - Hospital deploys system with 99.4% false positive rate - Continues deployment despite evidence of harm (psych service overload, care delays) - Does not recalibrate or pause system - Plaintiff: Hospital knew system was causing harm but continued anyway
Answer 4: How should suicide risk AI be implemented in pediatrics?
Pre-implementation considerations:
- Accept that high PPV is impossible at low prevalence
- At 0.25% prevalence, no AI achieves PPV >5% with acceptable sensitivity
- Question: Is 95-99% false positive rate acceptable?
- Resource capacity check
- Can psychiatric service handle 2-4× increase in consults?
- If no, system will fail (alert fatigue, delays, staff burnout)
- Tiered risk stratification:
- Very high risk (PPV ~5-10%): Immediate psychiatric consult
- Moderate risk (PPV ~2-3%): Social work assessment, brief screening
- Low-moderate risk (PPV <1%): Informational only, no mandatory intervention
- Reserve intensive interventions for highest-risk tier
Implementation safeguards:
- Clinical override:
- Pediatrician reviews AI flag, determines if psychiatric consult truly needed
- AI is screening tool, not mandate
- Family communication:
- Explain AI screening to families: “Routine screening tool flagged some concerns. We’ll ask some questions to better assess.”
- Avoid alarming language: “AI thinks your child is suicidal”
- Continuous monitoring:
- Track PPV monthly
- If PPV <2% and false positives causing harm → pause system, recalibrate
- Alternative approach: Universal brief screening
- Instead of AI, use validated brief tools (ASQ, C-SSRS) for all adolescents
- More clinically actionable, less false positive burden
Documentation: - “AI suicide risk algorithm flagged patient. Clinical assessment: No current suicidal ideation, intent, or plan. Patient cooperative, family supportive. Low acute risk. Outpatient f/u arranged.”
Lesson: Suicide prediction AI in pediatrics faces fundamental challenge: low base rate → low PPV → unsustainable false positive burden. Implementation requires tiered approach, clinical oversight, resource capacity, and continuous monitoring. Universal brief clinical screening may be more effective than AI flagging.
Scenario 2: Retinopathy of Prematurity (ROP) Screening AI Bias
You’re a neonatologist at Level III NICU using AI-based ROP screening system (similar to i-ROP DL or IRIS).
AI system: Analyzes retinal images, classifies as: - Plus disease present → Urgent ophthalmology referral - Pre-plus disease → Close monitoring - No plus disease → Routine screening
Your NICU demographics: - 60% Black/Hispanic infants - 30% White infants - 10% Asian infants - Serves predominantly low-income community
Month 3 performance review:
| Race/Ethnicity | AI Sensitivity (Plus Disease) | AI Specificity | Ophthalmology-Confirmed Plus Disease |
|---|---|---|---|
| White infants | 95% (19/20 cases detected) | 88% | 20 cases |
| Black infants | 78% (25/32 cases detected) | 85% | 32 cases |
| Hispanic infants | 72% (18/25 cases detected) | 83% | 25 cases |
Missed cases: - 7 Black infants with plus disease misclassified as “no disease” by AI - 7 Hispanic infants with plus disease misclassified as “no disease” by AI - All received delayed treatment (2-3 weeks later than optimal) - 3 infants progressed to Stage 3+ ROP requiring laser therapy (might have been prevented with earlier treatment)
Answer 1: What caused the racial disparity in AI performance?
Training data bias:
- Dataset composition: AI likely trained predominantly on images from White infants
- Most ROP research datasets: 60-80% White infants
- Underrepresentation of Black/Hispanic infants in training data
- Fundus imaging differences by race:
- Darker pigmentation: Black/Hispanic infants have more heavily pigmented fundi
- Vascular contrast: Blood vessels appear differently against darker vs. lighter fundus
- AI challenge: Trained on lighter fundi, struggles to identify vascular changes on darker backgrounds
- Plus disease features:
- Plus disease = arterial tortuosity + venous dilation
- These features subtler on heavily pigmented fundi
- AI trained on high-contrast (light fundus) images doesn’t generalize to low-contrast (dark fundus) images
Similar to dermatology AI bias: - Melanoma detection AI: 92% sensitivity on light skin (Fitzpatrick I-II) vs. 65% on dark skin (Fitzpatrick V-VI) - Same root cause: Training data bias + imaging physics differences
Answer 2: What are the liability implications of using biased AI?
Hospital/physician liability for missed ROP cases:
Plaintiff argument (parents of Black infant with missed ROP): - Hospital deployed AI system that performed worse on Black infants (78% vs. 95% sensitivity) - This is discriminatory medicine, a different standard of care based on race - Our child was harmed (delayed treatment, worse outcome) because of racial bias in AI - Hospital knew or should have known AI had lower sensitivity for Black/Hispanic infants
Legal framework, Civil Rights implications: - Title VI of Civil Rights Act prohibits discrimination in healthcare - Deploying AI with known racial performance disparities may violate civil rights - Disparate impact: Even without intent, if AI causes worse outcomes for racial minorities, potentially illegal
Plaintiff damages: - Child now requires laser therapy (might have been prevented with earlier detection) - Potential vision impairment - Lifelong consequences of preventable ROP progression
Defense arguments: - AI overall performance acceptable (average 82% sensitivity) - Physician still reviewed images (AI was decision support, not final decision) - ROP difficult to detect even for human experts
Likely outcome: - Plaintiff verdict possible, especially if: - Hospital aware of racial performance disparity but continued deployment - No additional safeguards for higher-risk populations - Multiple missed cases demonstrating pattern
Regulatory implications: - FDA increasingly scrutinizes AI for bias - May require race-stratified performance data in clearance applications - Post-market surveillance for equity outcomes
Answer 3: How should ROP screening AI be implemented equitably?
Pre-deployment validation:
- Stratified performance testing:
- Test AI on representative sample of your NICU population
- Report sensitivity/specificity by race/ethnicity
- If disparities >10 percentage points → pause deployment, work with vendor
- Vendor accountability:
- Demand race-stratified validation data from vendor
- Ask: “What is sensitivity for Black, Hispanic, Asian infants specifically?”
- If vendor cannot provide → do not deploy until data available
- Accept that current AI may not be ready for diverse populations:
- If AI validated only on 80% White datasets → not appropriate for diverse NICU
Implementation safeguards:
- Higher scrutiny for at-risk populations:
- For Black/Hispanic infants: Lower threshold for ophthalmology referral
- If AI says “no disease” but clinical concern → refer anyway (do not defer to AI)
- Hybrid approach:
- AI screens all infants
- Ophthalmologist reviews images from Black/Hispanic infants (double-check AI)
- May reduce efficiency gains but prevents missed cases
- Human expert involvement:
- Neonatologist or ophthalmologist reviews AI classifications
- Clinical override when AI classification conflicts with exam findings
Monitoring:
- Ongoing surveillance:
- Track missed ROP cases by race
- If pattern emerges (more misses in Black/Hispanic infants) → investigate AI bias
- Quarterly audits:
- Compare AI performance to ophthalmology gold standard by race
- Report to hospital equity committee
Advocacy:
- Demand better AI:
- Work with AI vendors to improve performance on diverse populations
- Insist on diverse training datasets
- Research funding:
- Support research creating diverse ROP datasets
- Partner with vendors to improve AI for underrepresented populations
Documentation: - “AI ROP screening classified as [result]. Reviewed images personally. [Agree/Disagree] with AI assessment. Ophthalmology referral [made/deferred] based on clinical judgment.”
Lesson: AI trained on predominantly White populations may perform worse on racial/ethnic minorities, creating health disparities. Deploying biased AI without safeguards is ethically unacceptable and legally risky. Require race-stratified validation, implement monitoring, maintain human oversight, and demand equity from AI vendors.
Scenario 3: Neonatal Jaundice Management AI in Resource-Limited Setting
You’re a general pediatrician at rural health clinic in underserved area. Nearest children’s hospital is 120 miles away.
Clinical challenge: Managing neonatal jaundice. You see 15-20 newborns/month. Transcutaneous bilirubinometer available, but serum bilirubin testing requires sending blood to reference lab (24-48 hour turnaround).
AI tool: Smartphone app (BiliScan) that estimates bilirubin level from photo of infant. - Marketing claims: “93% accuracy, replace transcutaneous bilirubin measurements, FDA-cleared” - Cost: $50/month subscription vs. $15,000 for transcutaneous bilirubinometer
Your usage: - Use BiliScan for initial screening - If BiliScan suggests bilirubin >15 mg/dL → send serum bilirubin, consider referral
Case 1 - Success: - 4-day-old term infant, looks jaundiced - BiliScan estimate: 17.2 mg/dL - Serum bilirubin (sent immediately): 16.8 mg/dL - Transfer to children’s hospital for phototherapy - Good outcome, kernicterus prevented
Case 2 - Near miss: - 3-day-old term infant, appears mildly jaundiced - BiliScan estimate: 11.2 mg/dL (below phototherapy threshold) - Parents reassured, discharged home - Infant returns 2 days later (day 5) lethargic, poor feeding - Emergency serum bilirubin: 24.8 mg/dL (critical) - Emergency transfer, exchange transfusion required - Infant survives but develops kernicterus (bilirubin encephalopathy)
Investigation: Why did BiliScan underestimate? - Infant has darker skin tone (Hispanic) - Room lighting was fluorescent (not natural light as recommended) - BiliScan validated primarily on White infants in optimal lighting
Answer 1: What was the error in using BiliScan?
Over-reliance on AI without validation for your population:
- Skin tone bias: BiliScan accuracy lower on darker skin (same issue as transcutaneous bilirubinometers)
- Yellow skin discoloration harder to detect on darker baseline pigmentation
- AI trained on mostly light-skinned infants
- Lighting conditions: Smartphone camera + room lighting ≠ controlled medical device
- BiliScan requires natural indirect lighting (not fluorescent)
- Color temperature, shadows, reflections affect accuracy
- User error: Smartphone camera positioning, focus, distance affect reading
- Not standardized like medical device
- No local validation: You did not test BiliScan against serum bilirubin in YOUR patient population before clinical use
FDA clearance does not guarantee accuracy in all populations/settings: - FDA clearance based on pivotal trial (likely controlled conditions, selected patients) - Real-world performance may differ
Answer 2: Are you liable for the kernicterus case?
Possibly yes. Key questions:
Standard of care: - What is standard approach for neonatal jaundice screening in resource-limited settings? - If standard is: Visual assessment + transcutaneous bilirubin + low threshold for serum testing - Did you fall below standard by relying on unvalidated smartphone app?
Plaintiff argument: - You used smartphone app instead of validated medical device (transcutaneous bilirubinometer) - App underestimated bilirubin due to known limitations (skin tone, lighting) - Failed to obtain confirmatory serum bilirubin despite moderately elevated visual jaundice - Infant developed preventable kernicterus due to missed diagnosis
Defense argument: - BiliScan is FDA-cleared device - Used appropriately per manufacturer instructions - Resource constraints (cannot afford $15,000 bilirubinometer) - BiliScan estimate (11.2 mg/dL) was below phototherapy threshold, so it was reasonable to discharge with close follow-up
Medical expert testimony: - Plaintiff expert: “Smartphone apps have known inaccuracy on darker skin. Physician should have obtained serum bilirubin given moderate visual jaundice, regardless of BiliScan reading.” - Defense expert: “BiliScan is FDA-cleared, widely used. Physician acted reasonably given resource limitations.”
Likely outcome: - Depends on jurisdiction and expert testimony - Plaintiff has strong case if: - Evidence that BiliScan known to underestimate on darker skin - Physician did not obtain serum bili despite moderate clinical jaundice - Standard of care requires serum confirmation when clinical exam conflicts with screening tool
Settlement likely to avoid prolonged litigation given severity (kernicterus)
Answer 3: How should smartphone-based medical AI be used safely in resource-limited settings?
Validation before deployment:
- Local validation study:
- Use BiliScan + serum bilirubin on 50-100 infants
- Compare BiliScan estimates to gold standard
- Stratify by skin tone (light, medium, dark)
- Determine accuracy in YOUR setting with YOUR population
- Lighting standardization:
- Take all photos in same location (near window, natural light)
- Avoid fluorescent/LED lighting
- Use white background, standard distance
- Operator training:
- All staff using BiliScan trained on proper technique
- Inter-rater reliability testing
Clinical integration:
- Use as screening tool, not replacement for serum bilirubin:
- BiliScan for initial risk stratification
- Low threshold for confirmatory serum bilirubin:
- If BiliScan >12 mg/dL → serum bilirubin
- If visual exam suggests moderate jaundice → serum bilirubin (even if BiliScan reassuring)
- Do not rely on AI when clinical exam conflicts:
- If infant “looks jaundiced” but BiliScan says 10 mg/dL → serum bilirubin
- Clinical judgment overrides AI when discrepant
- Lower threshold for higher-risk infants:
- Preterm, breastfeeding difficulties, weight loss, darker skin tone → serum bilirubin regardless of BiliScan
Documentation: - “BiliScan estimate: 11.2 mg/dL. However, infant appears moderately jaundiced on exam. Obtaining serum bilirubin for confirmation given clinical-AI discrepancy.”
Advocacy for access: - Telemedicine: Photo sent to neonatologist for assessment - Point-of-care serum bilirubin: Advocate for access to rapid testing - Equipment grants: Apply for funding for transcutaneous bilirubinometer
Lesson: Smartphone-based medical AI can improve access in resource-limited settings BUT has significant limitations (skin tone bias, lighting sensitivity, user variability). Must be validated locally before clinical use. Use as screening tool with low threshold for confirmatory testing. Clinical judgment overrides AI when discrepant. Resource constraints do not eliminate duty to meet standard of care. Advocate for access to validated tools.