Partnership with the Duke MIDS Program: Evaluating Fairness in Credit ML Models

9 min readJul 16, 2024

Authors: (2OS) Michael Sarkis and Jennifer Kreizenbeck; (Duke MIDS) Genesis Qu, Pragya Rahuvanshi, Yiyu Lin, and Yinting Zhong

2nd Order Solutions (2OS) has partnered with the Duke Master in Interdisciplinary Data Science (MIDS) program for the past two years as a way to engage second year MIDS students on capstone projects to explore and investigate real-world problems. During the 2022/2023 academic year, 2OS worked with a team of four MIDS students to explore Explainable Boosting Machines in an empirical comparison to the more oft-used Gradient Boosting Machines.¹

2OS recently wrapped up its 2023/2024 academic year partnership with the Duke MIDS program, focusing on exploring and assessing fairness methodologies in machine learning (ML) across a variety of open-source packages. The goal of the project was to compare existing and available open-source packages for fairness in ML. These packages were assessed across the following dimensions: metrics, useability, interpretability, bias mitigation, and generalizability. After working through the initial project framework with a set of 2OS experts, Duke MIDS students framed their project by developing a comparison methodology to identify a recommended fairness pipeline.

“To solve this problem, we iterated through open-source ML packages that address fairness on datasets we know to have biases and compared the packages for their interpretability and efficacy. We identified disparate impact as a key metric to measure how much less likely the underrepresented groups are to receive a positive outcome. Using this metric, we tested the efficacy of individual packages’ methods of mitigating bias. AIF 360 stood out for its variety of mitigation methods and efficacy. Additionally, to control for confounding variables in the data, we proposed matching as a solution to clean data before training. Using AIF 360 and matching, 2OS now has a framework to improve its products by evaluating and reducing bias in its machine learning algorithms.” — Duke MIDS team

¹ This project was performed in two parts. Part one was completed by the MIDS students, and part two was completed during summer 2023 by two 2OS interns. See a previous 2OS blog post on findings from this two-part project.

Fairness in Lending Overview

Fairness in machine learning is the process of detecting, understanding, and rectifying algorithmic bias in a machine learning model or system. Algorithmic bias may occur on the basis of gender, race and ethnicity, disability, or other protected classes.² Since many of our models deal with who receives a specific credit product, it is incredibly important to ensure fairness in the model outcomes across protected class features. The financial services market is highly regulated and has explicit fairness requirements for models used in lending. Through the Equal Credit Opportunity Act (ECOA),³ U.S. law protects consumers by prohibiting unfair and discriminatory practices, including unfair product offers.

² Starting point references: https://arxiv.org/pdf/2206.04101.pdf and https://arxiv.org/pdf/2210.02516.pdf

³ See additional details from the Office of the Comptroller of the Currency (OCC) here: https://www.occ.treas.gov/topics/consumers-and-communities/consumer-protection/fair-lending/index-fair-lending.html.

Packages

Duke MIDS students evaluated open-source packages that quantify algorithmic bias in models. To ensure fair comparison across packages, the students defined fairness as “an equal opportunity to obtain a positive outcome for both the underprivileged and privileged groups.” The evaluation of the packages was based on two publicly available data sets:

Comparison across packages was based on the development of a “biased” model (i.e., a model without any bias mitigation methodologies). This “biased” or “unmitigated” model was assessed across nine open-source Python packages, where disparate impact was evaluated to quantify discrepancies in model predictions. The identified biases were then mitigated using tools and techniques available in the packages. Performance of the utilized bias mitigation methods are assessed in comparison to the “unmitigated” model.

The nine Python packages that were assessed and compared:

AI Fairness 360 (AIF 360): Developed by IBM, this package includes a comprehensive suite of algorithms categorized as preprocessing (applied before model training), in-processing (applied during model training), and post-processing (applied post-hoc, modifying model outputs and predictions).
DALEX: Developed by MI2.AI, this package offers model agnostic tools to understand model performance, conditional effects of variables, and variable importance; however, specific tools to identify fairness deficiencies and bias mitigation are not included.
Deon: Developed by DrivenData, this package provides an “ethics checklist” via a command-line tool to evaluate biases. However, the checklist’s main goal is not automated bias calculation and mitigation, but rather a concerted effort to engage and prompt data scientists and analysts in discussion.
Fairlearn: Developed by Microsoft, this package offers tools to evaluate and visualize fairness metrics to help users understand and address potential biases in their models. Methods include bias mitigation methods that target bias reduction in both individual and multiple features.
fairness-in-ml: This package, with a codebase developed by Xebia, utilizes adversarial models by using generative adversarial networks (GANs) to enforce independence of predictions from sensitive attributes.
Responsible AI toolbox: Developed by Microsoft, this package includes a comprehensive collection of tools to better understand AI systems for improved decision-making and working towards more responsibly developed and managed AI models and tools. However, this toolkit does not include tools to mitigate bias.
Smclarify: Developed by Amazon, this package works within Amazon’s proprietary Amazon Web Services (AWS) SageMaker suite to detect bias and provide model explainability tools. The primary focus is model transparency for debugging and improvement but does not offer direct solutions for bias mitigation.
Themis-ML: This package utilizes pandas and scikit-learn to implement various bias mitigation algorithms. The documentation is sparse, which can make the package difficult to use.
PiML: Developed by a team at Wells Fargo, this package offers model interpretability methods and fairness tools by integrating the solas-ai package.

While each evaluated Python package had its individual merits and appropriate use cases, the Duke MIDS team identified AIF 360 as the best performing package among the evaluated packages. The team’s overall goal was to assess both bias identification and mitigation methods, which AIF 360 provided. Not all assessed packages included mitigation tools, and some packages only tangentially addressed bias as part of a larger explainable machine learning framework. As such, the comprehensiveness, versatility, and clarity in documentation of AIF 360 contributed to its identification as the top performing package.

Evaluation

Fairness was evaluated based on both disparate impact and balanced accuracy, where the former focuses on bias mitigation (i.e., the ratio between the proportion of each group receiving the positive outcome) and the latter assesses model performance. Mitigating bias comes at a cost, as bias mitigation methods will often detrimentally affect model performance. From a compliance and ethical standpoint, bias must be mitigated. From a practical and business standpoint, model performance should not be significantly impacted. Ergo, this evaluation tests the level of mitigation balanced against the impact to model performance.

Disparate impact is highlighted by the Consumer Compliance Handbook from the Board of Governors at the Federal Reserve.

In Equation 1, Y is the target, where “1” represents a positive outcome. D is the protected attribute class, where “unprivileged” is typically the minority class and “privileged” is the majority class.

Table 1: Example scenario for loan approve/decline, with a protected class for married versus other marital status loan applicants

The model predicts loan approval to 90% of married applicants versus 67% of single applicants, which would result in a calculated disparate impact of 0.74. This indicates a positive bias towards married applicants and adverse impact on applicants falling in other marital status groups.

Equation 2: Example disparate impact calculation

A key component when assessing disparate impact is to also consider the relative sizes of each group to avoid falling in the trap of Simpson’s Paradox⁴ (also known as the amalgamation paradox).

As a toy example, consider the scenario of credit card applications for two card products. Suppose for Card A, 80% of women are approved for a card, whereas only 60% of men are approved. Meanwhile, for Card B, 43% and 42% of women and men are approved for the card, respectively. This would suggest significant bias in favor of women for Card A, but similar approval rates for Card B; however, the number of women and men applicants hasn’t yet been considered. In this scenario, only 75 women applied for Card A, while there 500 applications from men. Card B had approximately equal numbers (~300) of applications. When considering application approvals for both cards, approval rates for women and men are about 50% and 53% for women and men, respectively. The aggregation of these metrics changes the conclusion one would make relative to the bias of card application approval rates.

⁴ Simpson’s paradox is a phenomenon observed in which trends or patterns appear in segmented data but can disappear or reverse when data is aggregated.

Balanced accuracy is calculated as the mean of the sensitivity (i.e., true positive rate) and the specificity (i.e., the true negative rate) of the model output, assuming a categorical response. The goal is to ensure that both the majority and minority classes are evaluated with equal importance.

For example, suppose a logistic regression model predicts whether a customer will be approved or declined for a loan (see Table 2). The balanced accuracy metric provides intuition surrounding the rates of “correct” loan approvals and declines.

Table 2: Example scenario for loan approve/decline

To calculate balanced accuracy:

Equation 3: Example balanced accuracy calculation

The closer the balanced accuracy is to one, the better the model’s classification performance. In this case, the balanced accuracy is 0.79, which would indicate fairly good performance. Balanced accuracy would be assessed across multiple dimensions and segments to ensure adequate model performance.

Practical Implications

The work from the Duke MIDS students provided valuable first-hand experience with many popular fairness Python packages for bias identification. The project also further highlights the complexity and nuance associated with bias identification and mitigation, as there are so many dimensions to measuring fairness that a single metric (or even a handful of metrics) may not be able to provide sufficient coverage.

Comparison across various bias mitigation metrics from AIF 360 demonstrates the need for understanding the nuances and necessity of assessing bias across multiple dimensions. Figure 1 shows the impact of different methodologies on the two metrics, balanced accuracy and disparate impact. While all methods are potential methods to mitigate bias, not all methods work with the same level of efficacy. For example, while the Disparate Impact Remover reduces the disparate impact, it does have some consequences on balance accuracy. Meanwhile, the reweighting technique maintains the balanced accuracy, but shows an increase in disparate impact.

Figure 1: AIF 360 bias mitigation method performance (Source: Duke MIDS)

Financial institutions have a legal and ethical responsibility to ensure fairness in lending models, which requires both a deep understanding of credit products as well as the second order effects associated with the usage of bias and fairness metrics and mitigation methods. While two metrics are noted in this post, there are many others, including additional conditional metrics based on protected attributes. Depending on the type of model and potentially impacted protected attributes, oftentimes a bespoke fairness and bias identification and mitigation framework is required.

Do you have questions? We’d love to answer them!

Contact the authors by email at:

Interested in 2OS insights?

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!