# Logistic Regression in the Credit Risk Industry

**Author**: Katie Shewell

**What is logistic regression and why should we care?**

Logistic regression is a widely used model building algorithm in the credit risk industry. It is a statistical model that predicts the likelihood of an event occurring — such as a customer defaulting on a loan. Although simple in comparison to machine learning (ML) algorithms such as tree-based models or feedforward neural networks, logistic regression’s appeal stems from its high explainability which can be lost in more complex ML algorithms.

Sometimes simpler is better when regulations are at play, like there are in the credit industry. When an application is rejected, the lender must specify adverse action reason (or reasons for decline), so it is vital to have a transparent model. Logistic regression easily pinpoints which variables contributed the most to determining an applicant’s risk and interprets the impact of those variables on the prediction. More complex ML algorithms¹ may depend on *post-hoc *explainability* *methods for decline reasons. In fact, the Consumer Financial Protection Bureau (CFPB) raised that validating the accuracy of such explanations “may not be possible with less interpretable models” when relying on *post-hoc* explanation methods.

¹

Note: there is an area of research for inherently interpretable models, which has the potential to have the best of both worlds — performance and explainability; but we focus on the tried-and-true logistic regression for this blog post.

Logistic regression uses log odds to classify a binary outcome (0 or 1, “yes” or “no”, etc.). By applying a logit function to a linear combination of input features (a.k.a., the linear regression equation), the resulting values are forced to fall between 0 and 1. This value can be interpreted as the probability of the positive event occurring. The logistic regression equation is given below,

Typically, a decision threshold is set to determine whether the observation should be classified as a “yes” or a “no”. In the case of determining risk, the resulting probability is often used as a risk score to segment applicants into different risk buckets. More on this later!

**Importance of Monotonicity**

The transparency provided by logistic regression is vital in a highly regulated environment like the credit industry. A common burden lenders face when building machine learning models is that the relationship between a predictor and the target variable must be ** monotonic**. This means that as a customer’s risk increases, the predictor variable must either strictly increase or decrease. If the relationship between a feature and the target variable is non-monotonic, this can cause confusion or inconsistency when giving reasons for decline. Hence, simpler models such as logistic regression, can easily ensure monotonic relationships.

Let’s consider an example. The graph below shows that an applicant is accepted for a loan if their risk score is below 0.5. The lender must give reasons for why an applicant was denied. In this case, an applicant with a debt-to-income ratio (DTI) of 0.1 or 0.3 are accepted for a loan, while an applicant with a DTI of 0.2 is denied. The lender, therefore, cannot claim that the applicant was denied because their DTI is too high or too low. A model that forces a monotonic relationship would prevent inconsistencies from occurring.

The benefit of a logistic regression model is that there will always be a monotonic relationship between the predictors and the target variable because the algorithm is based on a linear combination of features. For example, you could claim that when the debt-to-income proportion rises, an applicant’s risk score will also rise. This allows for clean cuts in approve or deny decisions. More complex algorithms can enforce monotonicity, but it is typically at the expense of performance, if done during model-fitting, or interpretation, if done in a *post-hoc* manner.

Although these more advanced algorithms with better predictive power exist, logistic regression still manages to be one of the best algorithms for transparency and compliance. So how do we build one?

**Building a Logistic Regression Model**

**1.** **Choose a target variable**

Although this may sound intuitive, choosing a target variable is often an iterative process. The goal of logistic regression, in this case, is to accurately provide a risk score to each applicant. Therefore, the target variable must be a good indication of a “risky” customer. Examples of target variables could be flags for charge-offs or days-past-due. A small percentage of the data may include customers who charged-off, so it may be beneficial to consider a more balanced target variable. Choosing a target variable is a balance between data availability and the severity of the action to best represent risk. Hence, it is an iterative process of testing multiple target variables to see which one performs the best.

**2.** **Split data into train and test set**

Randomly split the data into two datasets, the train set which the algorithm can learn from, and a test set which can determine how well the model performs on unseen data.

**3.** **Check that assumptions are met and preprocess the data**

Use the train data to check that assumptions for building a logistic regression model are met. Other preprocessing steps include:

- Remove / handle nulls from data
- Remove outliers from data
- Encode categorical features
- Normalize or standardize numerical data
- Apply transformations to variables as needed
- Check for interactions between predictors variables
- Narrow down model variables with feature selection

**4.** **Build the model**

First, train the model using the preprocessed train data set. Then, use this model to make predictions on the test data set.

**5.** **Validate on test data**

Use the predicted outcomes and actual outcomes from the test data to determine how well the model performed. Some common metrics for comparing models include:

- Accuracy — The proportion of correctly predicted observation
- ROC Curve / AUC — The Receiver Operating Characteristic (ROC) Curve and Area Under the ROC Curve (AUC) assess the model’s ability to discriminate between classes when the decision threshold is varied
- Precision / Recall — Precision is the proportion of true positives predictions among all positive predictions, while recall is the proportion of true positives among all actual positive observations
- F1 Score — There is typically a trade-off between precision and recall. F1 Score gives the harmonic mean between the two, often used when both metrics are equally important

**6.** **Make tweaks and repeat**

Model building is an iterative process. Try removing insignificant features or adding more features to see if the model performance improved. There is also the option for model tuning by changing the optimization approach (i.e., the solver) or adding a penalty (L1 or L2 regularization). L1 forces insignificant variable coefficients to zero, essentially acting as feature selection. L2 forces variable coefficients to be small but never zero. These hyperparameter tuning methods can help improve model robustness.

**7.** **Final decision**

Decide which metric(s) are most important for the problem at hand. For instance, when dealing with credit risk, we often prioritize AUC because this helps show how well a model can rank order risk. Use the chosen metric(s) to decide which model to deploy!

**Understanding Model Output and Rank Ordering**

Once you have chosen a final model, let’s discuss what we can learn from the output and how the probabilities can be used in the credit risk industry.

**Model Output**

After we’ve finished training the logistic regression model, there are two important parts to the summary output. Next to each predictor in the model you will see both a corresponding coefficient and a *p*-value.

**1. Coefficients**

The coefficients for each feature in the model indicate the impact each feature has on the log odds. If the coefficient is positive, that means an increase in the feature’s value will increase the risk score. A negative coefficient will lower the risk probability. The magnitude of the coefficient is indicative of the feature’s importance in making predictions. One thing to note is that the scale of input features can affect these magnitudes (i.e., variables with larger scales can have coefficients with larger magnitudes). Normalizing features before fitting the model can ensure that the coefficients can be directly compared to determine importance.

**2. P-values**

The *p*-value associated with each feature are calculated using the Wald test. If the *p*-value is less than a chosen significance level (traditionally 0.05), we can deduce that the feature is statistically significant given the presence of the other predictors.

**Rank Ordering**

Another important validation technique is determining if the model can correctly rank-order risk. This means that the customers deemed the “riskiest” or who have the largest risk score should see the greatest number of charge-offs. The customers who are the least risky, should have the lowest number of charge-offs. The figure below helps illustrate this process. The probabilities output from the logistic regression model should be broken into deciles. The lowest 10% are in risk bucket 1, the next lowest 10% are in risk bucket 2, and so on. Split the test data into each risk bucket based on the model output and determine the percent of money charged-off in each group. If the model is correctly rank-ordering risk, the bar chart should have a sloped effect. This means that each risk bucket is gradually riskier, and as a result has higher charge-offs than the previous bucket.

This technique can also be used for model monitoring to determine if the model needs to be re-fit. When higher charge-off rates occur in the lower risk buckets, that is usually a sign that the model cannot properly rank order risk.

**We have ML algorithms … why do we still care about Logistic Regression?**

There is always a trade-off between building a model with high predictive ability and high explainability. In the credit risk industry, lenders often have no choice other than to choose the more interpretable models. While more advanced algorithms can better capture non-linear relationships, logistic regression is arguably one of the most transparent algorithms for classification.

The linear nature of logistic regression models allows for a straightforward interpretation of the impact each feature has on the final prediction. An alternative option are tree-based models such as random forests or gradient boosting machines (GBMs). These models often perform better but lose some of the inherent interpretability found in logistic regression.

**Do you have questions? We’d love to answer them!**

Contact the author by email at:

**Interested in 2OS insights?**

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!