Which fairness metric should I use to evaluate my AI model?
A key component of algorithmic accountability is the question of how to measure and monitor fairness. How do you know when a model crosses the threshold between bad and good cases and when to discontinue its use?
The computer science community has tried its best to mathematically define ways to measure fairness and concluded that fairness is subjective and there is no single definition for it. Instead, it recommends multiple metrics of fairness that are tailored to different scenarios and use cases.
The quiz below asks a series of questions depending on your use case that will direct the user to the most relevant fairness metric for evaluating the performance of an AI tool.
Some helpful definitions (click on the arrows to expand):
into an algorithmic decision-making process such that different subgroups of people are treated differently.
For example, consider an algorithm that determines eligibility for a home loan based on the data provided in a loan application. If the algorithm uses the applicant’s credit score as an input, it is enacting disparate treatment along that dimension.
Disparate Impact
Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.
For example, suppose an algorithm that determines a person’s eligibility for a home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If a black applicant is more likely to have mailing addresses with this postal code than a white applicant, then this algorithm may result in disparate impact.
Try out the quiz for yourself! At the end, you’ll receive a recommendation for a fairness metric relevant to your use case along with a definition and a healthcare example that clarifies pros/cons for the metric. After taking the quiz, you can specify this fairness metric in your contract as part of an auditing framework for evaluating the performance of the procured tool over time.
Loading…
An example that may help capture some of the difficulties of quantifying fairness:
Consider a breast-cancer screening tool that uses a patient’s medical history for recommending mammograms for high-risk patients. As a product user, the consideration of a patient’s racial identity data within this tool bothers you — will that information end up skewing the predictions in a discriminatory way?
Some motivations for this scenario:
You want to identify all of the potential cancer cases for early intervention while reducing the frequency of genuinely unnecessary cancer screenings as possible. You’re more willing to tolerate cases where patients receive mammograms and return non-cancerous results (false positives) rather than cases where patients are not recommended mammograms and develop breast cancer absent medical treatment (false negatives).
As a racial equity advocate, you especially want to avoid imposing undue stress and regressive financial burden on BIPOC communities.
By removing racial demographics information, you worry that you may be lowering the accuracy rate of the screening tool for BIPOC communities and increasing the rate of uncaught breast cancer in vulnerable communities. What if the demographics information is sensitive but removing it leads to worse outcomes for minority groups?
You also want to be careful of legal and organizational requirements around discrimination. Part of you feels that ignoring these attributes and treating all patients equally without considering racial identity is the safest option. You also acknowledge that social determinants of health can produce disparate outcomes for patients despite equal treatment — including factors like lack of access to healthy food, the absence of safe exercise spaces, and unconscious racial bias when seeking healthcare.
So what should you do?
You should take the quiz! Along the way, you’ll answer questions about risk tolerance for false positives vs. false negatives, whether it’s better to be aware or unaware of sensitive demographics data within the algorithm predictions, and whether you prefer for the algorithm to prioritize equal treatment or equal health outcomes for patients. At the end, the quiz will recommend you to use the Equalized Odds fairness metric, along with an explanation of why it’s useful for your specific case.
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (