19 Medical Ethics, Bias, and Health Equity

Learning Objectives

AI in medicine raises profound ethical questions: algorithmic bias, health equity, informed consent, autonomy, and the changing physician-patient relationship. This chapter examines ethical frameworks for responsible AI deployment in healthcare. You will learn to:

Apply bioethical principles (autonomy, beneficence, non-maleficence, justice) to medical AI
Recognize and mitigate algorithmic bias and health disparities
Navigate informed consent challenges with AI-assisted care
Assess fairness and equity implications of AI systems
Understand professional responsibilities when using AI
Apply ethical frameworks for AI development and deployment

Essential for all physicians and healthcare leaders.

📋 Chapter Summary (TL;DR)

Core Ethical Principles for Medical AI:

19.0.2 2. Beneficence and Non-Maleficence

AI must improve outcomes, not just efficiency (Topol 2019)
Rigorous validation required before deployment (Nagendran et al. 2020)
Ongoing monitoring for harms essential (performance drift, unexpected failures)
First, do no harm—applies to algorithms as much as treatments

19.0.3 3. Justice and Equity

AI must not worsen health disparities (Obermeyer et al. 2019)
Training data must represent diverse populations (Daneshjou et al. 2022)
Access to beneficial AI should be equitable, not just for well-resourced institutions
Algorithmic fairness requires intentional design, not just technical accuracy

19.0.4 4. Algorithmic Bias is Pervasive

Documented across radiology, dermatology, risk prediction, and resource allocation
Often reflects biases in training data and healthcare systems (gichoya2022equity?)
Can amplify existing health disparities if not actively mitigated
Requires diverse development teams and inclusive validation studies

Landmark Case Studies:

❌ The Obermeyer Study (Science, 2019): - Commercial algorithm used healthcare costs as proxy for health needs - Black patients systematically under-prioritized despite being sicker - Algorithm affected millions of patients across the U.S. - Demonstrated how seemingly neutral technical choices encode racial bias (Obermeyer et al. 2019)

❌ Dermatology AI Bias: - Skin cancer detection AI trained predominantly on light skin tones - Performance degradation on dark skin (higher false negatives) (Daneshjou et al. 2022) - Risk: Delayed cancer diagnosis in underrepresented populations

✅ IDx-DR Inclusive Validation: - Diabetic retinopathy AI validated across diverse racial/ethnic groups - Prospective trial deliberately enrolled representative population - Demonstrated equitable performance across subgroups (Abràmoff et al. 2018) - Model for inclusive AI development

Professional Responsibilities: - Physicians remain ethically and legally responsible for AI-assisted decisions (Char, Shah, and Magnus 2018) - Cannot delegate responsibility to algorithms (“the AI said so” is not a defense) - Must understand AI limitations and failure modes before clinical use - Advocacy for patients over algorithmic recommendations when appropriate - Duty to report and address bias when observed

Clinical Bottom Line: Medical AI is not ethically neutral technology. Every AI system embeds values, makes trade-offs, and can perpetuate or mitigate inequities. Physicians have professional obligation to ensure AI serves all patients equitably, respects autonomy, and improves—not just automates—care.

19.1 Introduction

Medical AI is often framed as a purely technical challenge: train algorithms on medical data, validate performance, deploy in clinical settings. But this framing obscures profound ethical questions:

Who benefits from medical AI? Patients at well-resourced academic centers, or also those in rural clinics and safety-net hospitals?
Who is harmed when AI fails? Disproportionately, underrepresented populations excluded from training data.
Who decides when AI is “good enough”? Developers optimizing for overall accuracy, or patients for whom the algorithm performs poorly?
How does AI change the physician-patient relationship? Does it enhance clinical judgment or erode physician autonomy and accountability?

These are not hypothetical concerns. As this chapter documents, medical AI has already perpetuated racial bias (Obermeyer et al. 2019), performed inequitably across skin tones (Daneshjou et al. 2022), and raised fundamental questions about informed consent and physician responsibility (Char, Shah, and Magnus 2018).

The ethical deployment of medical AI requires more than technical validation. It demands intentional attention to fairness, equity, transparency, and accountability—principles central to medical professionalism but often absent from algorithmic development.

This chapter applies traditional bioethical frameworks to medical AI, examines documented cases of bias and inequity, and provides practical guidance for physicians navigating ethical challenges in AI-augmented practice.

19.2 Bioethical Principles Applied to Medical AI

The four principles of biomedical ethics—autonomy, beneficence, non-maleficence, and justice—provide a framework for evaluating medical AI (Char, Shah, and Magnus 2018).

19.2.2 Beneficence: AI Must Benefit Patients

The Principle: Medical interventions should improve patient outcomes. “Do good.”

Challenges in Medical AI:

1. Efficiency ≠ Benefit: Many AI systems improve workflow efficiency (faster reads, reduced documentation burden) but don’t clearly improve patient outcomes. Efficiency benefits institutions and physicians; do they benefit patients?

2. Surrogate Outcomes: AI often validated on surrogate outcomes (detection accuracy, sensitivity, specificity) rather than clinical outcomes (mortality, morbidity, quality of life). High accuracy doesn’t guarantee patient benefit (Nagendran et al. 2020).

3. Unintended Consequences: AI can have hidden downsides: alert fatigue, deskilling of physicians, over-diagnosis from ultra-sensitive algorithms.

Evidence Standard for Beneficence:

Medical AI should meet the same evidence standard as medications or procedures:

Preclinical validation: Retrospective accuracy studies (analogous to Phase I/II trials)
Clinical validation: Prospective studies demonstrating clinical benefit (analogous to Phase III trials)
Post-market surveillance: Ongoing monitoring for harms and performance drift (analogous to Phase IV)

Critical Question: Has this AI been shown to improve patient outcomes in prospective studies, or only to achieve high accuracy in retrospective datasets?

Most medical AI has only retrospective validation. This is insufficient for claiming beneficence (Topol 2019).

Example: IDx-DR (Diabetic Retinopathy Screening)

✅ Demonstrated Beneficence: - Prospective trial showed AI enabled screening in primary care clinics where ophthalmologists unavailable - Increased screening rates for underserved populations - Detected vision-threatening retinopathy that would have otherwise been missed - Clear patient benefit: prevented vision loss through earlier detection (Abràmoff et al. 2018)

Example: Epic Sepsis Model

❌ Failed to Demonstrate Beneficence: - Retrospective studies suggested high accuracy - External validation showed poor real-world performance (Wong et al. 2021) - High false positive rate caused alert fatigue - No evidence deployment improved sepsis outcomes - Efficiency goal (early sepsis detection) did not translate to patient benefit

19.2.3 Non-Maleficence: First, Do No Harm

The Principle: Medical interventions should not harm patients. When harm is unavoidable (e.g., chemotherapy side effects), benefits must outweigh harms.

Harms from Medical AI:

1. Direct Harms: - False negatives: Missed diagnoses (cancer, fractures, strokes) leading to delayed treatment - False positives: Unnecessary testing, procedures, anxiety, overtreatment - Incorrect treatment recommendations: AI suggesting wrong medication, wrong dose, contraindicated therapy

2. Indirect Harms: - Alert fatigue: Too many AI warnings causing physicians to ignore all alerts, including true positives - Deskilling: Over-reliance on AI eroding clinical skills (radiologists losing ability to detect findings without AI) - Delayed care: AI-driven workflows introducing bottlenecks (e.g., waiting for AI report before human review)

3. Equity Harms: - Algorithmic bias: AI performing worse for underrepresented groups, causing disparate harm (Obermeyer et al. 2019; Daneshjou et al. 2022) - Access disparities: Beneficial AI only available to wealthy institutions, widening quality gaps

4. Psychological Harms: - Loss of trust: Patients losing confidence in physicians who defer to algorithms - Dehumanization: Care feeling automated and impersonal

Risk Mitigation Strategies:

Preventing AI Harm in Clinical Practice

Before Deployment: 1. Demand prospective validation in populations similar to yours (Nagendran et al. 2020) 2. Assess subgroup performance: How does AI perform for your patient demographics? 3. Understand failure modes: How and why does the AI fail? 4. Ensure human oversight: AI should augment, not replace, physician judgment

During Use: 5. Monitor real-world performance: Does AI perform as expected in your setting? 6. Track harms: Capture false negatives, alert fatigue, workflow disruptions 7. Maintain clinical skills: Don’t let AI erode your ability to practice without it 8. Preserve physician final authority: Algorithms recommend; physicians decide

After Adverse Events: 9. Root cause analysis: Was AI contributory? How can recurrence be prevented? 10. Reporting mechanisms: Report AI failures to vendors, FDA (if applicable), and institutional quality/safety teams

Precautionary Principle: When AI evidence is uncertain, err on the side of caution. Burden of proof for safety and efficacy rests with those deploying AI, not with patients who may be harmed (vayena2018machine?).

19.2.4 Justice and Health Equity

The Principle: Medical resources should be distributed fairly. Benefits and burdens should not fall disproportionately on particular groups.

Justice Challenges in Medical AI:

1. Training Data Bias: AI trained predominantly on data from well-resourced academic centers, affluent populations, and racial/ethnic majorities performs worse for underrepresented groups (gichoya2022equity?).

2. Access Disparities: - Advanced AI often deployed first at wealthy institutions - Rural, safety-net, and under-resourced hospitals lack infrastructure for AI - Widens existing quality gaps between haves and have-nots

3. Algorithmic Amplification of Bias: AI can amplify existing healthcare disparities: - If training data reflects biased medical practice (e.g., Black patients receiving less pain medication), AI learns and perpetuates that bias - If outcome proxies are biased (e.g., healthcare costs correlating with access, not need), AI recommendations will be biased (Obermeyer et al. 2019)

4. Representation in AI Development: - Lack of diversity in AI development teams can lead to blind spots about bias and harm - Lack of diversity in leadership means equity considerations may be deprioritized

19.3 Algorithmic Bias: Documented Cases and Lessons

Algorithmic bias in medicine is not theoretical—it’s documented, measurable, and consequential.

19.3.1 Case 1: Racial Bias in Healthcare Resource Allocation

The Obermeyer Study (Science, 2019) (Obermeyer et al. 2019):

Background: - Commercial algorithm used by healthcare systems to allocate care management resources - Predicted which patients would benefit from extra support (care coordination, disease management) - Used healthcare costs as proxy for healthcare needs

The Bias: - At any given risk score, Black patients were significantly sicker than white patients - Algorithm systematically under-predicted risk for Black patients - Result: Black patients needed to be much sicker than white patients to receive same level of care

Why It Happened: - Healthcare costs reflect access to care, not just medical need - Black patients face barriers to accessing care (insurance, transportation, discrimination), leading to lower healthcare spending despite higher illness burden - Algorithm learned that “lower spending = lower need,” perpetuating inequity

Impact: - Affected millions of patients across U.S. healthcare systems - Reduced number of Black patients flagged for high-risk care management programs by >50% - Vendors corrected algorithm after publication, but similar bias likely exists in other systems

Lessons: - Choice of outcome variable matters: “Cost” and “need” are not the same - Historical biases propagate: If training data reflects biased systems, AI learns bias - Disparities can be subtle: Algorithm didn’t explicitly use race, but outcomes were racially biased - External auditing essential: Bias was discovered by independent researchers, not algorithm developers

19.3.2 Case 2: Dermatology AI Bias Across Skin Tones

Multiple Studies (Daneshjou et al. 2022; Esteva et al. 2017):

Background: - AI for skin cancer detection trained predominantly on images of light skin - Dermatology training datasets severely underrepresent darker skin tones (Fitzpatrick IV-VI)

The Bias: - AI performance degrades on darker skin tones - Higher false negative rates for melanoma in Black and Latino patients - Risk: Delayed cancer diagnosis in precisely the populations with worse melanoma outcomes

Why It Happened: - Training data reflect existing disparities: dermatology textbooks and databases predominantly feature light skin - Developers didn’t intentionally exclude dark skin—they used available data - Performance testing often doesn’t stratify by skin tone, hiding the problem

Lessons: - Representation in training data is critical - Performance must be evaluated across relevant subgroups (not just overall accuracy) - “Color-blind” algorithms are not fair algorithms—ignoring race/ethnicity can perpetuate disparities - Field-specific challenges: Dermatology must actively recruit diverse image datasets

19.3.3 Case 3: Pulse Oximetry Bias

Recent Findings (sjoding2020racial?):

Background: - Pulse oximeters measure oxygen saturation non-invasively - Critical for managing COVID-19, sepsis, respiratory failure - Generally considered accurate and unbiased

The Bias: - Pulse oximeters overestimate oxygen saturation in Black patients (compared to arterial blood gas) - Black patients more likely to have hidden hypoxemia (low oxygen despite “normal” pulse ox) - May delay recognition of deterioration and treatment escalation

Why It Happened: - Medical devices calibrated and validated primarily on white participants - Skin pigmentation affects light absorption, but devices not adjusted for this - Decades of use before bias recognized

Lessons: - Bias exists even in established, widely-used technologies - Real-world performance monitoring essential (not just initial validation) - Equity requires intentionality—assuming fairness is insufficient - Simple technologies can have complex bias (not just AI-specific problem)

19.4 Mitigating Bias and Promoting Equity

Addressing algorithmic bias requires action across the AI lifecycle: development, validation, deployment, and monitoring.

19.4.1 During AI Development

1. Diverse and Representative Training Data: - Ensure training datasets include adequate representation of all relevant demographic groups - Stratify by race, ethnicity, sex, age, geography, socioeconomic status - Partner with diverse healthcare institutions (not just academic centers)

2. Diverse Development Teams: - Include clinicians who care for underserved populations - Involve ethicists, health equity experts, community representatives - Diversity in team increases likelihood of identifying potential biases early

3. Equity as a Design Goal: - Define fairness explicitly: Equal performance across groups? Equal access? Equal benefit? - Different fairness definitions involve trade-offs—be transparent about choices (parikh2019addressing?) - Test for bias proactively (don’t assume fairness)

4. Choice of Outcome Variables: - Scrutinize proxies for actual outcomes (cost ≠ need, admissions ≠ severity) - Consider how historical disparities may bias outcome definitions - Validate that proxy measures don’t encode existing inequities

19.4.2 During Validation

5. Subgroup Analysis is Mandatory: - Report algorithm performance stratified by race, ethnicity, sex, age, insurance status - Don’t hide subgroup disparities in overall accuracy metrics - Identify where performance is acceptable vs. inadequate (Nagendran et al. 2020)

6. External Validation in Diverse Populations: - Validate in institutions serving underrepresented populations - Test across geographic, socioeconomic, and practice setting diversity - Don’t assume generalizability—prove it (vabalas2019machine?)

7. Assess Calibration, Not Just Discrimination: - Discrimination (AUC-ROC) measures ability to rank risk - Calibration measures whether predicted probabilities match actual outcomes - Calibration often differs across subgroups even when discrimination appears similar - Miscalibration can lead to disparate treatment

8. Consider Intersectionality: - Bias often compounds across multiple identities (e.g., Black + female, or rural + elderly) - Test performance in intersectional subgroups, not just single demographic categories

19.4.3 During Deployment

9. Transparent Communication: - Disclose known limitations and subgroup performance differences - Don’t hide uncertainties or deployment risks (Char, Shah, and Magnus 2018) - If AI performs worse for certain groups, inform clinicians and patients

10. Human Oversight and Physician Autonomy: - AI should inform, not dictate, decisions - Physicians must have ability to override algorithms when clinical judgment differs - Preserve physician accountability (can’t blame algorithm for bad decisions)

11. Equitable Access: - Ensure beneficial AI available across practice settings (not just wealthy institutions) - Consider cost, infrastructure requirements, and workflow fit - Avoid creating two-tiered healthcare: AI-augmented for the wealthy, standard care for everyone else

19.4.4 During Ongoing Monitoring

12. Continuous Performance Monitoring: - Track real-world performance across demographic subgroups - Monitor for performance drift over time (models degrade as clinical practice evolves) - Establish thresholds for acceptable performance and trigger investigations when crossed (gichoya2022equity?)

13. Adverse Event Reporting: - Create mechanisms for clinicians and patients to report suspected AI harms - Investigate patterns suggesting bias (e.g., more false negatives in particular group) - Share lessons learned across institutions

14. Regular Bias Audits: - Independent audits of AI fairness (not just self-reporting by vendors) - Involve community stakeholders and equity experts - Make audit results public (transparency increases accountability)

15. Willingness to Decommission: - If AI is found to perpetuate bias and cannot be corrected, stop using it - Don’t continue harmful AI just because it’s already deployed - Patient welfare > sunk costs

19.6 Professional Responsibilities in the AI Era

Medical AI doesn’t eliminate physician responsibility—it increases it.

19.6.1 Physician as Steward of AI

1. Understand Before Using: - Don’t use AI as a black box: “I don’t know how it works, but it’s accurate” - Understand (in general terms): What data does AI use? How was it trained? What are known limitations? - Know how AI was validated and in which populations

2. Maintain Independent Judgment: - AI recommendations are inputs to decision-making, not final decisions - Physician must independently assess patient, formulate differential, consider AI output in clinical context - Avoid automation bias (uncritically accepting AI recommendations)

3. Recognize and Override When Appropriate: - If AI recommendation conflicts with clinical judgment, investigate - Don’t defer to algorithm when patient-specific factors (not captured by AI) are relevant - Document reasoning when overriding AI

4. Protect Patient Interests: - Advocate for patients over algorithmic efficiency - If AI-driven workflow harms patient (delays, errors, loss of personalized care), escalate concerns - Professional obligation to prioritize patient welfare over institutional AI investments

19.6.2 Physician as Advocate for Equity

5. Demand Evidence of Fairness: - Ask: “How does this AI perform for my patient population?” - Refuse to use AI with known bias unless no alternative exists and bias is disclosed to patients

6. Monitor for Disparate Impact: - If you suspect AI performs worse for certain patients, document and report - Advocate for inclusive validation and bias mitigation

7. Ensure Equitable Access: - Support policies ensuring beneficial AI available to all patients (not just those at elite institutions)

19.6.3 Physician as Learner

8. Stay Current: - Medical AI evolves rapidly—yesterday’s evidence may be outdated - Engage with medical AI literature (not just vendor claims) - Participate in institutional AI governance and education

9. Teach Others: - Educate trainees, colleagues, and patients about AI capabilities and limitations - Model appropriate AI use (thoughtful integration, not blind adherence)

19.7 Ethical Frameworks for Institutional AI Governance

Healthcare institutions deploying AI need structured governance to ensure ethical use (Char, Shah, and Magnus 2018).

19.7.1 Core Components of AI Governance

1. AI Ethics Committee: - Multidisciplinary: clinicians, ethicists, informaticists, legal, community representatives - Reviews proposed AI deployments for ethical concerns - Authority to approve, modify, or reject AI adoption

2. Equity Impact Assessment: - Required before deploying AI - Analyzes potential disparate impact on vulnerable populations - Identifies mitigation strategies

3. Transparency Requirements: - AI systems must be documented: purpose, training data, validation studies, known limitations - Performance data (including subgroup performance) made available to clinicians - Patients informed about AI use (tiered consent approach)

4. Ongoing Monitoring: - Real-world performance tracking (overall and subgroup) - Adverse event reporting mechanisms - Regular bias audits

5. Accountability: - Clear designation of responsibility (who is accountable when AI fails?) - Physician retains ultimate authority and responsibility - Vendors held accountable for undisclosed risks or misrepresented performance

6. Sunset Provisions: - AI deployment isn’t permanent—reevaluate periodically - Decommission AI that performs poorly, perpetuates bias, or becomes outdated

19.8 Conclusion: Toward Ethical Medical AI

Medical AI holds enormous promise: more accurate diagnosis, personalized treatment, equitable access to specialist expertise. But realizing this promise requires more than technical innovation—it demands ethical intentionality.

The history of medical AI thus far includes both successes (IDx-DR enabling diabetic retinopathy screening in underserved communities) and failures (algorithms perpetuating racial bias in resource allocation). The difference lies not in the technology itself, but in the values and priorities of those who develop, validate, deploy, and use it (Obermeyer et al. 2019; Char, Shah, and Magnus 2018).

Ethical medical AI requires:

Centering equity: Diverse training data, subgroup validation, bias mitigation, equitable access
Respecting autonomy: Informed consent, transparency, patient opt-out options
Demonstrating benefit: Prospective validation of clinical outcomes, not just retrospective accuracy
Preventing harm: Rigorous testing, ongoing monitoring, physician oversight, accountability mechanisms
Physician stewardship: Understanding AI, maintaining judgment, protecting patient interests

The goal is not AI-free medicine (that ship has sailed), nor is it uncritical AI adoption. The goal is AI that embodies the values of medicine: commitment to patients above all, respect for human dignity, pursuit of equity, and fierce protection of the vulnerable.

Physicians—by virtue of their clinical expertise, ethical training, and patient advocacy role—are uniquely positioned to ensure AI serves these ends. That responsibility cannot be delegated to algorithms.

19.9 References

Abràmoff, Michael D., Philip T. Lavin, Michele Birch, Nilay Shah, and James C. Folk. 2018. “Pivotal Trial of an Autonomous AI-Based Diagnostic System for Detection of Diabetic Retinopathy in Primary Care Offices.” Npj Digital Medicine 1 (1): 1–8. https://doi.org/10.1038/s41746-018-0040-6.

Char, Danton S., Nigam H. Shah, and David Magnus. 2018. “Implementing Machine Learning in Health Care: Addressing Ethical Challenges.” New England Journal of Medicine 378 (11): 981–83. https://doi.org/10.1056/NEJMp1714229.

Daneshjou, Roxana, Kailas Vodrahalli, Roberto A. Novoa, Melissa Jenkins, Weixin Liang, Veronica Rotemberg, Justin Ko, et al. 2022. “Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set.” Science Advances 8 (32): eabq6147. https://doi.org/10.1126/sciadv.abq6147.

Esteva, Andre, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639): 115–18. https://doi.org/10.1038/nature21056.

Nagendran, Myura, Yang Chen, Christopher A. Lovejoy, Anthony C. Gordon, Matthieu Komorowski, Hugh Harvey, Eric J. Topol, John P. A. Ioannidis, Gary S. Collins, and Mahiben Maruthappu. 2020. “Artificial Intelligence Versus Clinicians: Systematic Review of Design, Reporting Standards, and Claims of Deep Learning Studies.” BMJ 368. https://doi.org/10.1136/bmj.m689.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Topol, Eric J. 2019. “High-Performance Medicine: The Convergence of Human and Artificial Intelligence.” Nature Medicine 25 (1): 44–56. https://doi.org/10.1038/s41591-018-0300-7.

Wong, Andrew, Erkin Otles, John P. Donnelly, Andrew Krumm, Jeffrey McCullough, Olivia DeTroyer-Cooley, Justin Pestrue, et al. 2021. “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients.” JAMA Internal Medicine 181 (8): 1065–70. https://doi.org/10.1001/jamainternmed.2021.2626.

19.0.1 1. Autonomy and Informed Consent