Enhancing Trust in Credit Risk Models: A Comparative Analysis of EBMs and GBMs

8 min readFeb 20, 2024

Authors: Congcong Ma*, Solange Umuhoza*, and Alice Liu

*Note: Congcong and Solange were part of the 2OS intern cohort of 2023 and will be returning to 2OS as part of the 2OS new grad cohort of 2024.

Applying cutting edge machine learning models in the consumer credit space often results in a trade-off between performance and interpretability. Models that have better predictive performance tend to be less interpretable, which contrasts with models that are more interpretable, but have comparatively worse predictive performance. Unlike other industry sectors that typically focus on predictive performance, the credit industry has a unique demand for machine learning models that can perform in both areas.

For the past decade, banks and fintechs have made advancements on the credit modeling front by transitioning from logistic regression to gradient boosting machines (GBMs) for risk models. With this advancement came a large increase in predictive accuracy that logistic regression models couldn’t achieve alone, but also, due to the nature of the GBM algorithm, there was less inherent transparency on how underlying models were making decisions. This introduced the push to use post-hoc explainability methodologies like partial dependence plots (PDPs) and Shapley values, where interpretability methods are applied on model outputs after the model is fit. The explainable boosting machine (EBM) algorithm serves as an interesting challenger to the industry standard use of GBMs¹, as it boasts of having comparable model performance while being inherently more interpretable.

¹ We used XGBoost or XGB as the GBM algorithm of choice.

What are EBMS?

Developed by Microsoft researchers, EBMs (explainable boosting machines) are “tree-based, cyclic gradient boosting generalized additive models with automatic interaction detection” (Nori et al., 2019). The EBM is designed to provide accurate predictions while maintaining explainability (Lou et al., 2013). Let’s break down what the EBM building blocks are:

Generalized additive model (GAM): a type of generalized linear model (GLM) that sums together variables that are transformed by some function and are “linked” or connected using a link function to the response variable (e.g., logit link² for binary response). The functions used for the variables represent the functional relationship that is in the data. By allowing for these functional representations, it allows for more complex relationships between the variables and the response to be modeled (Hastie and Tibshirani, 1987).
Tree-based: Regression trees are used as the functional representations of the data, which divide or partition the data into subsets (e.g., through splitting of nodes into branches) as the tree grows.
Cyclic: Trees are grown in a cyclical manner, where variables are rotated through the iterations of regression trees to develop the models. This helps to combat effects of collinearity among the included variables.
Gradient boosting: Trees are built sequentially using one variable for each of the trees, where residuals (i.e., errors) from previous trees in the cycle are used as the response for subsequent trees, which means the gradient is used as a boosting mechanism to get better performance and accuracy.

² The logit link is equal to log-odds. Assume p is probability, falling between 0 and 1, then the logit(p) = ln [p / (1 — p) ], which maps the probability to real numbers.

How do EBMs work?

GBMs and EBMs use gradient boosting as part their core training process, but it is applied differently for each respective methodology. GBMs use gradient boosting to build weak learners (such as small decision trees) sequentially that utilize either all or a subset of input features. Each tree is designed to correct mistakes from its previous tree, which means GBMs focus on optimizing predictive accuracy. Conversely, EBMs aim to strike a better balance between interpretability and predictive accuracy. Gradient boosting in EBMs builds additive models where each feature has an individualized shape that represents its effect on the model’s output.

Fig 1: Visualization of the EBM process (Source: 2OS Internal)

For each round of boosting in EBMs, the training algorithm will loop through all the features and build small trees around one feature at a time. Each tree will only focus on one feature and use the prediction residuals from the previous tree(s) as the target variable. Like gradient boosting, each tree will try to correct the mistakes from earlier trees. The learning rate, or step size when moving towards minimizing the loss function to “fix” errors from previous trees, is typically set to be small (default is 0.001) so that the ordering of features shouldn’t matter. At the end of training, the global explanation of feature j is given by summing up all the trees built around feature j. Local explanations for observation i can be created using the feature values xᵢⱼ for the jth variable and computing these values based on the global explanations.

Now that we know how EBMs work, what goes into building them?

Like other ML algorithms, EBMs utilize tuned hyperparameters to refine and improve model performance. Part of the empirical analysis of EBMs and their performance was identifying which of the tunable hyperparameters are most impactful.

Among the most important hyperparameters are those that affect model complexity:

Max. bins: adjusts model granularity by setting the maximum number of bins used for input features.
Max. leaves: sets a maximum number of leaves for each tree, which affects the complexity of each tree.
Early stopping rounds: affects the number of boosting rounds, based on improvements to model performance, which the algorithm can use as a means to stop model training early.
Max. rounds: sets a maximum number of boosting rounds, which may be overridden by early stopping.

Other hyperparameters that impact model performance, but to a lesser extent in comparison to the mentioned four hyperparameters above, include:

Outer bags: wrap a bagging³ process around the entire training algorithm, where n different EBMs are trained on different subsamples (drawn with replacement), which are then averaged across the n mini-EBMs. This can improve model predictive performance.
Inner bags: wrap a bagging process when growing individual trees within an EBM, which can improve model predictive performance.
Interactions: set number of interactions to be detected and included in the model.

³ “Bagging” stands for bootstrap aggregation, which generates samples with replacement from the data and fits models to the drawn samples.

How do EBMs stack up against GBMs?

Predictive performance:

While XGB models consistently achieve higher AUCs than EBMs do (approximately 2% increase in performance), we determined that EBMs have more stable performance. This is particularly noticeable when reducing model complexity to achieve more parsimonious models. Thus, EBMs may be an attractive option for users who prefer models with smaller number of features without compromising on model performance.

Model development efficiency:

XGB models are faster to train than EBMs. On average, in our tests, XGB models took about 14.5 seconds to train, whereas EBMs took around 2 minutes and 8 seconds to train. However, we determined that hyperparameter tuning plays a much less important role for EBMs, which saves the time and effort of extensive tuning. Note, the default hyperparameters typically do a pretty good job in terms of model performance. If excluding hyperparameter tuning and including the inherent interpretability, EBMs, throughout the stages of model development, are much more efficient, as little additional time is needed to tune hyperparameters and to produce global and local explanations.

Feature importance:

EBMs produced more stable feature importance than what was output from XGB models (i.e., produces consistent subset of features across different random seeds). The majority of the top 10 important features in the EBMs remain in the top 10 features across different random seeds, while only 70% to 80% have the same behavior for similarly fit XGB models.

Monotonicity:

EBM and XGB algorithms enforce monotonicity differently, as EBMs put post-training monotonic constraints on individual features, by editing the shape of global explanations using isotonic regression⁴, which can result in a drop in model performance (see Fig 2 below). On the other hand, XGB models enforces monotonicity during model training by ignoring splits that violate monotonicity, which avoids any post-training drops in model performance (see Fig 3 below). Of course, XGB models can still deteriorate in model performance during training, if comparing a base case and enforcing monotonicity. Our empirical results suggest that the degeneration of model performance, in either case, is likely minimal. However, enforcing monotonicity in a post-hoc manner is less ideal than doing so during model training.

⁴ Isotonic regression is also sometimes known as monotonic regression, which fits a non-decreasing (i.e., monotonic) line to a set of data.

Fig 3: Example of the effect on PDP when enforcing monotonicity while training a GBM (Source: 2OS Internal)

Global explanations

While the methodologies differ for producing the output, the resulting plots demonstrate similar trends. However, the practical impact of differing methodologies is the interpretation of model outputs. EBMs make predictions based on what is depicted in the global explanations plots (barring modifications due to monotonicity constraints). PDPs on the trained model, which is a post-hoc interpretability method, approximates the feature-target relationship learned by the XGB model.

Local explanations:

EBMs offer local explanations by computing the contribution of individual features for a specific prediction, which provides insight to the factors that influence the outcome of a particular observation. In contrast, Shapley values assigns the influence of each feature by measuring its marginal contribution to the predicted output across all possible feature combinations. Empirically, we saw some overlap in variables with the greatest contributions, which can be attributed to the differences in underlying model algorithm (XGB vs. EBM) as well as differences in how predictions are decomposed. However, EBMs offer local explanations without requiring additional computation, which is necessary for Shapley values (computation time scales exponentially with the number of variables).

Conclusion

While XGB is an extraordinarily popular ML algorithm for structured data, 2OS has determined, based on empirical results, that XGB may not necessarily be the best choice, depending on what modeling goals are prioritized and for what purpose. If model performance and overall accuracy is prioritized, then the XGB will maintain its edge as the algorithm to use. However, if the goal is to explain model behavior more precisely, then EBM is the clear winner.

Do you have questions? We’d love to answer them!

Contact the authors by email at:

Interested in 2OS insights?

Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!

References

Hastie, Trevor, and Robert Tibshirani. Generalized additive models: some applications. Journal of the American Statistical Association, 82(398):371–386, 1987. https://doi.org/10.1080/01621459.1987.10478440.
Nori, Harsha, et al. “InterpretML: A unified framework for machine learning interpretability.” arXiv preprint arXiv:1909.09223 (2019). https://arxiv.org/abs/1909.09223.
Lou, Yin, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 623–631. 2013. https://doi.org/10.1145/2487575.2487579.