Quantifying Uncertainty in Credit and Lending

2nd Order Solutions
10 min readAug 6, 2024

--

Authors: Mikhal Ben-Joseph, Syed Raza, and Aaron McGuire

Considerations for Conformal Prediction in Lending

Conformal prediction (CP) is a statistical method that has been gaining popularity in various machine learning (ML) applications for its versatility in capturing and calibrating uncertainty . This article overviews CP and dives into potential use cases in credit and lending, ultimately demonstrating that despite its flexibility in application, the method’s true utility is scenario dependent.

Overview of Conformal Prediction

Calibration

To understand the value proposition of conformal prediction, it’s important to first understand calibration. In the categorical sense (the focus of this article), a model is said to be calibrated if its predicted probability matches the observed probability of a phenomenon; that is, the model’s average prediction for a class (or all classes) is the same as the actual positive rate for the class(es) in the real world. While in some situations it is important to maintain the post-hoc integrity of the model outputs and thus calibration can be less valuable, calibration is often very desirable because it reflects the reliability of a model across the output space.

Even models with high predictive power and good performance can lack calibration. For example, below is a simple XGBoost model for prediction default outcomes in the Kaggle Lending Club dataset. The model has a high AUC but isn’t exceptionally well calibrated for defaults. The yellow dashed line in the right-hand chart of Figure 1 demonstrates what perfect calibration would look like: the average prediction value exactly matching the proportion of true positives in the real world.

Figure 1: XGBoost performance and calibration for a credit risk dataset (Source: 2OS Internal)

Conformal prediction (CP) can help address calibration problems not by directly altering the model, but rather by making guarantees about the true positive rates within its predictive outputs. This article will discuss calibration and CP in categorical sense, but both methods are applicable in the regression context as well.

From Point Prediction to Prediction Sets

Most ML models output probability-like values indicating the most likely class for a sample. These values should correctly rank-order uncertainty (a heuristic measure), but they are not necessarily calibrated to the observed probabilities. CP transforms heuristic uncertainty into rigorous uncertainty by calibrating the outputs for each class to a holdout group’s real-world class prevalence.

CP performs this calibration by turning point predictions into prediction sets. We will address the technical details in a subsequent section, but the crucial aspect is the prediction sets have a theoretical guarantee of coverage, or correctness. A simplified, plain-English example: in an image classification problem for animals, we no longer guess that the single most likely class for a sample is “dog”; instead, we are 90% sure that across all samples, the prediction set we create (such as {“dog,” “wolf,” “hyena”}) for any given sample contains the true answer. This is remarkably useful information when it is important to know the level of certainty in our predictions.

The theoretical coverage guarantee is made even more attractive by the fact that CP can be applied to practically any model (regression or classification, from logistic regression to neural nets) and has minimal assumptions (merely the exchangeability of data points). Moreover, CP is relatively simple to implement, with a variety of established packages and computationally efficient variations.

The Use Case Constraint

While its model-agnostic and distribution-free nature might make CP seem like a silver bullet for any ML challenge, its utility is limited to certain use cases. For CP to be helpful, a prediction region rather than a single point prediction must be useful output. In other words, if you ultimately need to make your single best guess on the animal for the image classification example above, having an output set with 3 potential species is useless, regardless of how certain you are that the correct species is somewhere within that set.

Note: i.i.d. stands for independent and identically distributed (Source: 2OS Internal)

In binary classification, the use-case limitations are self-evident. With only two potential classes (let’s say, Class A and Class B), the prediction set must be one of the following: {Class A}, {Class B}, {Class A, Class B}, {empty}. The set output is useful only when having a third uncertain group (the “both options” set) and/or an empty set is helpful. Since binary classification is perhaps the most prevalent mode of prediction in credit and lending models (default/no default, fraud/no fraud, etc.), outputting a prediction region may not lead to as many valuable insights as it would in industries where multi-class classification is more common.

In the next section of the article, we will give a brief technical overview of CP. Then, we will explore two case studies for CP in lending, demonstrating the variance in utility.

How Conformal Prediction Works

(Source: 2OS Internal)

Training

Build a model as usual, holding some of the data out in a calibration dataset. There are a few ways to handle non-training data, but for simplicity we will use the simple split (inductive) method, meaning the model never sees the calibration data until calibration time, when it is processed in full.

Calibration

Run the calibration data through the model and convert the output score into a non-conformity score. This score describes how well the calibration data “fits in” with the training data, and thus reflects uncertainty in the model’s predictions. Choosing a non-conformity score is an important and potentially creative process; the method used here impacts the type of coverage guarantee in the final output and the size of the prediction sets. In the examples to follow, our non-conformity score will be the simple Least-Ambiguous Set-Valued Classifier (LAC), which is [1 — (model score of the true label)].

Next, we sort the scores and identify the (1 — α) quantile of scores, q, where α (alpha) is analogous to a significance level.

Set Prediction

When we run the test data through the model and translate the output to a non-conformity score, non-conformity scores less than q for a given class will have that class included in their set.

Figure 2: Non-conformity score histogram for LAC conformal prediction

As seen in the image above, as the value of α decreases, our certainty threshold moves to the right, which translates to the inclusion of higher and higher test non-conformity scores for that class. This makes intuitive sense: as we want to be more confident in our prediction on a new test sample, we include more classes in our sets even when the sample is increasingly dissimilar (“nonconformal”) to what we saw in the training data for each class. Then, we have a higher chance of containing the true class for each sample’s prediction.

CP’s theoretical guarantee is that in the long run, the prediction regions contain the true value with probability (1 — α).

Case Studies for Conformal Prediction in Lending

Binary Classification

The default risk model in Figure 1 (from the Kaggle Lending Club dataset) represents a classic binary classification problem in lending, with 1 being a default and 0 being non-default. If the purpose of this model were to simply predict whether a borrower will default, applying CP would only obfuscate the binary decisioning process by adding in the “both options” and empty sets.

However, let’s say the lender wants to isolate borrowers who the model puts in the “both options” or empty sets from borrowers who receive a single class designation. The empty set indicates the model didn’t have enough information to classify the borrower into any class at the assigned certainty level, so the lender might request additional applicant information or use alternative data sources. The “both options” set indicates the model predicts that defaulting and not defaulting are both likely outcomes, so the lender might leverage second-look products or manual review.

In this hypothetical case, we want CP to bring the percent charge off (CO) values closer to 0 and 1 for classes 0 and 1, respectively. The “both” and empty groups should be somewhere in the middle.

Figure 3: CP rank orders CO rates by set as expected

As seen above, CP appropriately ranks the CO rates among the sets and has reasonable sample sizes within each group. As we already know, the model had high predictive capabilities, and as such no samples were assigned to the empty set. However, we do care about the “both options” set and intend to treat it differently. Moreover, we now enjoy some guarantees about the overall reliability of our predictions across all classes, which could help reduce operational or decision-making buffers that we might have otherwise had in place.

It’s important to recognize, though, that if we look under the hood in binary cases, we see that CP is simply identifying thresholds directly from the output of the XGBoost model. It’s not picking up on some intrinsic “uncertainty” factor the original model didn’t already know about. As seen in the chart below which illustrates the clean cuts on the XGBoost output for each CP set, CP simply finds the output range where the model is already uncertain and places a theoretical guarantee on the calibration.

Figure 5: Relationship between predicted probability output and conformal prediction set classifications (Source: 2OS Internal)

Multiclass Classification

The multiclass case is more illuminating. Let’s take the example of a lender’s collections division that wants to send out proactive educational materials to borrowers at high risk of defaulting within a certain period (using the AmEx multiclass Kaggle dataset for analysis). A borrower’s “treatment” will depend on the predicted proximity of default. Borrowers are thus classified into categories: 1, 2, 3, or 0, representing default within 1–6, 7–12, 13–18, or not within 18 months, respectively.

We use CP after fitting and tuning a regular XGBoost model that outputs a single predicted category for each sample. Below is the number of borrowers in the test group that fall into each set output by CP.

Figure 4: Frequency of test samples in each output group by CP (Source: 2OS Internal)

Industry Implementation

The ability to create prediction regions which maintain rigorous certainty guarantees is a valuable one given the right circumstances in credit and lending. However, it requires alignment across all technical and executive teams in terms of shared expectations and understanding of the method’s capabilities.

There are four key steps to making CP an asset in any lending institution:

1. Find the Right Use Case

Not every ML problem should be solved with CP. Select opportunities where set predictions make sense and add value; a good place to start is with multiclass models (which may be uncommon in a financial institution). Then, prioritize models that would benefit most from having rigorous guarantees of their reliability.

2. Understand and Interpret the Guarantee

Operating under theoretical guarantees (such as CP’s) as if they were true every time in real life may leave a system vulnerable to unexpected risk. CP’s theoretical guarantee does not actually “guarantee” a certain outcome in finite lending cases such as default prediction.

3. Choose the Appropriate Adjustments

There is a wide and growing range of available CP tools and research. Data scientists implementing CP should choose methods that balance computational efficiency, utility of outputs, explainability, and performance for the intended purpose. This requires communicating methodology constraints and tradeoffs adequately with upstream decision makers, especially in hierarchical financial institutions.

4. Evaluate and Test for Coverage

Like with any ML tool, evaluation and refinement are key to successful continued functionality. Use a slate of metrics with a variety of focus areas to establish reliability and coverage benchmarks over time. Document results continuously, noting any corresponding adjustments made to the base model, and embed the evaluation task in the other model risk management processes for consistency and regulatory purposes.

Conclusion

CP is a cutting-edge topic among academics, and the potential for the method is growing. There are a few exceptional resources already available for use, but the key is finding industry- and company-specific use cases that will most accentuate and benefit from the power of the method.

For credit and lending, these opportunities will almost always be multi-class classification cases, or binary cases transformed as such. Implementing CP successfully in these cases is not merely a matter of executing a methodology, but rather of developing shared understanding of the guarantee and aligning on risk tolerances as well.

Do you have questions? We’d love to answer them!

The experts at 2nd Order Solutions are ready to help you identify opportunities for improving your model capabilities. If we can be of help, please reach out to the authors:

Interested in 2OS insights?

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!

--

--

2nd Order Solutions

A boutique credit advisory firm providing credit risk & data science consulting services from top 10 banks to fintech startups https://2os.com/