Model Degradation in Credit Risk

2nd Order Solutions
8 min readJun 18, 2024

--

Author: Harry Shi

In the past few decades, the credit risk industry has undergone significant transformation, witnessing advancements in machine learning techniques. Notably, the transition from traditional logistic regression to machine learning models such as the gradient boosting machines (GBM) has revolutionized credit risk assessment, enhancing its efficiency, effectiveness, and reliability across all phases of the credit life cycle. In recent years, there has been a growing emphasis on ensuring the ongoing performance of these models, bringing us to today’s topic of model degradation.

Why is it important to understand model degradation?

In the credit industry, model degradation refers to the performance decline of models used to predict important events, such as default. While these models may initially demonstrate exceptional performance during development, testing, and early deployment phases, degradation becomes inevitable over time. This process of degradation can manifest suddenly due to unforeseen events such as the COVID-19 pandemic, or it may occur gradually over time due to shifts in customer behavior or market conditions.

Regardless of the pace, overlooking model degradation can have significant consequences, including sub-optimal risk assessments, increased credit loss, and heightened compliance scrutiny. Thus, it is crucial for financial institutions and fintech companies to actively monitor and manage model degradation, ensuring effectiveness and reliability of their credit risk assessment processes.

What are the types of model degradation?

Now that we understand the importance of addressing model degradation in the credit industry, let’s explore the major types of degradation through an example scenario. During the COVID-19 pandemic, an (imaginary) bank, Lockdown Financial (LF), relied on a pre-COVID machine learning credit risk model to assess the creditworthiness of loan applicants as usual. The model examined various factors, such as debt-to-income (DTI) ratio and FICO score, to predict the likelihood of a borrower defaulting on a loan.

However, as the pandemic worsened, the government’s relief efforts — such as stimulus checks, expanded unemployment benefits, and forbearance on existing loan payments — began to alter the credit landscape. Customers, who would typically be categorized as high risk due to their large existing debts, suddenly appeared financially healthier. Their DTI ratios and FICO score improved, not because their regular incomes had increased or their spending habits had changed, but because of temporary relief measures.

The credit risk model, trained on pre-pandemic data, interpreted the improved financial metrics — the reduced DTI ratios and the increased FICO score — as indicators of improved creditworthiness. Consequently, the model’s scores for potential borrowers began to drift upwards. Encouraged by the model’s assessment, LF ramped up its loan originations. However, as the pandemic faded and the economic conditions normalized, the impact of these government interventions diminished. Many of these new borrowers, whose financial health was not as robust as indicated by their pandemic-era data, began defaulting at a much higher rate than predicted. LF faced a sudden surge in loan defaults, leading to substantial losses.

Data Drift

LF’s analysts realized that they were experiencing model degradation in the form of data drift, which occurs when the distribution of input used to train a machine learning model changes over time. The government aid during the pandemic shifted the credit profiles of potential loan applicants, inflating their FICO scores and lowering their DTI ratios. The model, trained on pre-pandemic data, was suddenly processing new data that differed significantly from its training data. As a result, the model inaccurately predicted lower risk for new loan applications. The model, originally trained to represent a specific distribution of credit profiles, could no longer accurately assess the credit risk associated with a different set of credit profiles.

While the data drift in this case was primarily due to changes in customer credit profiles influenced by temporary government policies, data drift may be caused by internal factors such as adjustments in data collection methods, updates in data processing pipelines, changes in acquisition stage customer demographics, and variations in product offerings.

Concept Drift

Upon deeper analysis, the LF’s analysts discovered that the issue was not merely the changes in the distribution of input data. The functional relationship between the input features — debt and financial health metrics — and the output, which is the default probability, no longer held. Previously, high DTI ratios and low FICO scores reliably suggested higher risk. However, under pandemic conditions, even though some applicants showed stable debt levels or increased incomes, these metrics ceased to be reliable predictors of default. With incomes boosted and debt payments deferred, when the analysts refitted the model, they noticed a sign flip in how DTI ratios affected defaults — high DTI suddenly seemed less risky. This change occurred because some borrowers with high DTI ratios likely received more substantial government aid. This sign flip illustrates the concept drift phenomenon, where the fundamental relationships that the model had learned were no longer valid in the new context.

This type of model degradation typically arises from changes that are external to the model and often beyond the organization’s control. These changes can include shifts in consumer behavior, macroeconomic conditions, credit regulatory changes, or unexpected events such as global pandemics or natural disasters.

How do we detect model degradation?

Unsurprisingly, model monitoring is vital in detecting model degradation and addressing potential issues before they impact model performance. From a statistical perspective, the process of drift detection typically involves several stages. The first stage is retrieving historical and new data, followed by measuring dissimilarity between the datasets, calculating test statistics, and eventually performing hypothesis tests to evaluate the statistical significance of any observed changes.

While this framework provides a structured approach, drift detection methods in credit industry often adopt more straightforward strategies that may not always require a comprehensive hypothesis test. Instead, industry benchmarks or best practices often are crucial to identify deviations from expected performance. This helps ensure timely detection and response to model degradation in dynamic credit environments.

Detecting Data Drift

The simplest method to detect data drift between historical and current features involves comparing basic statistics of both datasets. However, more advanced statistical methods can provide more robust results. For categorical features, Chi-Squared test and Fisher’s Exact test can be employed to detect data drift; while for continuous features, divergence and distance tests like the Kolmogorov-Smirnov test and the Population Stability Index (PSI) are common. PSI quantifies the extent of distribution change between two samples collected at different times, and thus is widely used in the field of credit risk assessment to monitor variable stability. A high PSI value suggests data drift, with thresholds typically varying across institutions.

As shown by the illustrative case below, PSI essentially compares the expected distribution (at the time of model building) with the actual distribution (current monitored sample). This comparison quantifies distribution change, as evidenced by the difference between the monitored sample and the model-built sample.

Figure 1: Illustrative case of distribution comparison between expected and actual (Source: 2OS Internal)

Detecting Concept Drift

For detecting concept drift, monitoring the model performance metrics over time with a predefined threshold is a common practice. Metrics may include ROC/AUC, Precision/Recall, Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Mean Squared Error (RMSE), etc. If concept drift occurs, these performance metrics act as alerts.

For instance, a significant drift would result in increased error metrics such as MAE and MSE. When the model consistently delivers unreliable or less accurate results over time compared to benchmarks, coupled with changes in performance metrics, it usually signals concept drift. It’s worth noting that in the case of logistic regression, refitting the model to assess how much coefficients have changed can help detect concept drift.

Monitoring Portfolio Performance

Other than model performance metrics, monitoring risk metrics such as delinquency, roll rates, charge-off rates, and first-pay default rates, alongside revenue performance metrics like cash-on-cash rates is crucial for model degradation detection.

More importantly, the monitoring team should always validate the accuracy and consistency of the model output over time. This involves verifying whether the rank ordering¹ of the model’s predictions remains stable across different time periods. Specifically, the highest risk decile identified by the model should consistently exhibit the highest level of risk. If this ranking order deteriorates over time, it could indicate the presence of concept drift or other forms of model degradation.

For instance, in the case of model degradation illustrated below, the ranking order of the model output no longer holds true. This deviation from the expected pattern indicates model deterioration and underscores the importance of ongoing monitoring and validation efforts in degradation detection of credit risk models.

¹ You can check out the explanation of rank ordering techniques in our logistic regression blog post.

Figure 2: Illustrative case of model degradation (Source: 2OS Internal)

How can effective model governance mitigate model degradation?

There are methods to help prevent model degradation from happening in the first place. Data quality is the priority. The training data should accurately reflect real-world scenarios, going through proper data cleaning and preprocessing to account for any anomalies, such as duplicates, missing values, lagged data, incorrect data, and outliers. Other common methods, such as synthetic data generation, can make machine learning models more resilient to data drift, as it ensures a more representative range of training data.

However, despite these preventive measures, model degradation over time is inevitable. This is where effective model governance becomes crucial in mitigating this risk. Given the high cost of rebuilding a model due to any degree of model degradation, model governance should clearly outline the categorization of all models by priority tier and establish threshold levels for actions like refitting, refreshing, or completely rebuilding the model.

  • Refitting: retrain the model on new production data
  • Refreshing: retrain with data that is not entirely new
  • Rebuilding: tear down the model and construct it again with an entirely new set of features or targets, particularly if significant changes have occurred in the business objectives or environment.

Do you have questions? We’d love to answer them!

Contact the authors by email at:

Interested in 2OS insights?

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!

Additional Reads

  • J. Lu, A. Liu, F. Dong, F. Gu, J. Gama and G. Zhang, “Learning under Concept Drift: A Review,” in IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 2346–2363, 1 Dec. 2019, doi: 10.1109/TKDE.2018.2876857.

--

--

2nd Order Solutions
2nd Order Solutions

Written by 2nd Order Solutions

A boutique credit advisory firm providing credit risk & data science consulting services from top 10 banks to fintech startups https://2os.com/

No responses yet