Health Care Bias Is Dangerous. But So Are ‘Fairness’ Algorithms

Medical systems disproportionately fail people of color, but a focus on fixing the numbers could lead to worse outcomes.
Photo collage of a hand with a pulse oximeter code and a string hanging by a thread
Photo-illustration: WIRED Staff; Getty Images

Mental and physical health are crucial contributors to living happy and fulfilled lives. How we feel impacts the work we perform, the social relationships we forge, and the care we provide for our loved ones. Because the stakes are so high, people often turn to technology to help keep our communities safe. Artificial intelligence is one of the big hopes, and many companies are investing heavily in tech to serve growing health needs across the world. And many promising examples exist: AI can be used to detect cancertriage patients, and make treatment recommendations. One goal is to use AI to increase access to high-quality health care, especially in places and for people that have historically been shut out. 

Yet racially biased medical devices, for example, caused delayed treatment for darker-skinned patients during the Covid-19 pandemic because pulse oximeters overestimated blood oxygen levels in minorities. Similarly, lung and skin cancer detection technologies are known to be less accurate for darker-skinned people, meaning they more frequently fail to flag cancers in patients, delaying access to life-saving care. Patient triage systems regularly underestimate the need for care in minority ethnic patients. One such system, for example, was shown to regularly underestimate the severity of illness in Black patients because it used health care costs as a proxy for illness while failing to account for unequal access to care, and thus unequal costs, across the population. The same bias can also be observed along gender lines. Female patients are disproportionately misdiagnosed for heart disease, and receive insufficient or incorrect treatment. 

Fortunately, many in the AI community are now actively working to redress these kinds of biases. Unfortunately, as our latest research shows, the algorithms they have developed could actually make things worse in practice if put into practice, and put people’s lives at risk. 

The majority of algorithms developed to enforce “algorithmic fairness” were built without policy and societal contexts in mind. Most define fairness in simple terms, where fairness means reducing gaps in performance or outcomes between demographic groups. Successfully enforcing fairness in AI has come to mean satisfying one of these abstract mathematical definitions while preserving as much of the accuracy of the original system as possible. 

With these existing algorithms, fairness is typically achieved through two steps: (1) adjusting performance for worse performing groups, and (2) degrading performance for better performing groups. These steps can be distinguished by their underlying motivations. 

Imagine that, in the interest of fairness, we want to reduce bias in an AI system used for predicting future risk of lung cancer. Our imaginary system, similar to real world examples, suffers from a performance gap between Black and white patients. Specifically, the system has lower recall for Black patients, meaning it routinely underestimates their risk of cancer and incorrectly classifies patients as “low risk” who are actually at “high risk” of developing lung cancer in the future. 

This worse performance may have many causes. It may have resulted from our system being trained on data predominantly taken from white patients, or because health records from Black patients are less accessible or lower quality. Likewise, it may reflect underlying social inequalities in health care access and expenditures. 

Whatever the cause of the performance gap, our motivation for pursuing fairness is to improve the situation of a historically disadvantaged group. In the context of cancer screening, false negatives are much more harmful than false positives; the latter mean that the patient will have follow-up health checks or scans that they did not need, whereas the former means that more future cases of cancer will go undiagnosed and untreated. 

One way to improve the situation of Black patients is therefore to improve the system’s recall. As a first step, we may decide to err on the side of caution and tell the system to change its predictions for the cases it is least confident about involving Black patients. Specifically, we would flip some low-confidence “low risk” cases to “high risk” in order to catch more cases of cancer. This is called “levelling up,” or designing systems to purposefully change some of its predictions for the groups currently disadvantaged by systems, and follow up with them more often (e.g., increased frequency of cancer screenings). 

This change comes at the cost of accuracy; the number of people falsely identified as being at risk of cancer increases, and the system’s overall accuracy declines. However, this trade-off between accuracy and recall is acceptable because failing to diagnose someone with cancer is so harmful. 

By flipping cases to increase recall at the cost of accuracy, we can eventually reach a state where any further changes would come at an unacceptably high loss of accuracy. This is ultimately a subjective decision; there is no true “tipping point” between recall and accuracy. We have not necessarily brought performance (or recall) for Black patients up to the same level as white patients, but we have done as much as possible with the current system, data available, and other constraints to improve the situation of Black patients and reduce the performance gap. 

This is where we face a dilemma, and where the narrow focus of modern fairness algorithms on achieving equal performance at all costs creates unintended but unavoidable problems. Though we cannot improve performance for Black patients any further without an unacceptable loss of accuracy, we could also reduce performance for white patients, lowering both their recall and accuracy in the process, so that our system has equal recall rates for both groups. In our example, we would alter the labels of white patients, switching some of the predictions from “high risk” to “low risk.” 

The motivation is mathematical convenience: Our aim is to make two numbers (e.g., recall) as close to equal as possible between two groups (i.e., white and Black patients), solely to satisfy a definition that says a system is fair when these two numbers are equal. 

Clearly, marking a formerly “high risk” patient as “low risk” is extremely harmful for patients who would not be offered follow-up care and monitoring. Overall accuracy decreases and the frequency of the most harmful type of error increases, all for the sake of reducing the gap in performance. Critically, this reduction in performance is not necessary, or causally linked, to any improvements for groups with lower performance. 

Yet this is what is happening in many algorithms that enforce group fairness because this is the mathematically optimal solution. This type of degradation, where fairness is achieved by arbitrarily making one or more groups worse off, or by bringing better performing groups down to the level of the worst performing group, is called “leveling down.” Wherever it may occur, using fairness algorithms to enforce fairness through levelling down is a cause for concern.

In fact, what we have described here is actually a best case scenario, in which it is possible to enforce fairness by making simple changes that affect performance for each group. In practice, fairness algorithms may behave much more radically and unpredictably. This survey found that, on average, most algorithms in computer vision improved fairness by harming all groups—for example, by decreasing recall and accuracy. Unlike in our hypothetical, where we have decreased the harm suffered by one group, it is possible that leveling down can make everyone directly worse off. 

Leveling down runs counter to the objectives of algorithmic fairness and broader equality goals in society: to improve outcomes for historically disadvantaged or marginalized groups. Lowering performance for high performing groups does not self-evidently benefit worse performing groups. Moreover, leveling down can harm historically disadvantaged groups directly. The choice to remove a benefit rather than share it with others shows a lack of concern, solidarity, and willingness to take the opportunity to actually fix the problem. It stigmatizes historically disadvantaged groups and solidifies the separateness and social inequality that led to a problem in the first place.

When we build AI systems to make decisions about people's lives, our design decisions encode implicit value judgments about what should be prioritized. Leveling down is a consequence of the choice to measure and redress fairness solely in terms of disparity between groups, while ignoring utility, welfare, priority, and other goods that are central to questions of equality in the real world. It is not the inevitable fate of algorithmic fairness; rather, it is the result of taking the path of least mathematical resistance, and not for any overarching societal, legal, or ethical reasons. 

To move forward we have three options: 

• We can continue to deploy biased systems that ostensibly benefit only one privileged segment of the population while severely harming others. 
• We can define fairness in formalistic mathematical terms, and deploy AI that is less accurate for all groups and actively harmful for some groups. 
• We can take action and achieve fairness through “leveling up.” 

We believe leveling up is the only morally, ethically, and legally acceptable path forward. The challenge for the future of fairness in AI is to create and implement systems that are substantively fair, not only procedurally fair through leveling down. Leveling up is a more complex challenge: It needs to be paired with active steps to root out the real life causes of biases in AI systems. Technical solutions are often only a Band-aid to deal with a broken system. Improving access to health care, curating more diverse data sets, and developing tools that specifically target the problems faced by historically disadvantaged communities can help make substantive fairness a reality.

This is a much more complex challenge than simply tweaking a system to make two numbers equal between groups. It may require not only significant technological and methodological innovation, including redesigning AI systems from the ground up, but also substantial social changes in areas such as health care access and expenditures. 

Difficult though it may be, this refocusing on “fair AI” is essential. AI systems make life-changing decisions. Choices about how they should be fair, and to whom, are too important to treat fairness as a simple mathematical problem to be solved. This is the status quo which has resulted in fairness methods that achieve equality through leveling down. Thus far, we have created methods that are mathematically fair, but cannot and do not demonstrably benefit disadvantaged groups. 

This is not enough. Existing tools are treated as a solution to algorithmic fairness, but thus far they do not deliver on their promise. Their morally murky effects make them less likely to be used and may be slowing down real solutions to these problems. What we need are systems that are fair through leveling up, that help groups with worse performance without arbitrarily harming others. This is the challenge we must now solve. We need AI that is substantively, not just mathematically, fair. 

Disclosure: Chris Russell is also an employee at Amazon Web Services. He did not contribute to this op-ed nor its underlying research in his capacity as an Amazon employee. They were prepared solely through the Trustworthiness Auditing for AI project at the Oxford Internet Institute.

Update March 3, 2023 11AM Eastern: This article was updated to include an author disclosure and make clearer the hypothetical example of leveling down in healthcare.