Why should you care?
It’s difficult to build good models, regardless of whether you’re a lender building a simple marketing regression or a gradient boosting machine (GBM) to underwrite your customers. Modelers must cycle through a number of complex steps during the build process. One of these steps is cleaning your input data, making your inputs a better fit for the final modeling algorithm. This step is often referred to as variable preprocessing, and it encompasses a wide variety of tasks, including but not limited to:
- Reducing the number of variables in your models,
- Encoding complex data structures into ingestible heuristics for your algorithm, and
- Figuring out how to deal with missing values.
A team of 2OS data scientists (Tosan Johnson*, Alice Liu, Raza Syed, and Aaron McGuire) built an experiment to scientifically ground the optimal preprocessing techniques on a variety of data types, benchmarking their performance against common algorithms. We’ve written this blog post to share some of our findings to a broader audience, in a somewhat more digestible format. We hope readers find this summary useful, and broadly helpful to your organizations!
*Note: Tosan has since moved on to a new company.
Why focus on preprocessing?
Even though preprocessing is a major component of the modelling process, there are very few objective standards to guide modelers on how to get the most out of their data. Most organizations rely on the experience of senior data scientists or engineers to choose the best techniques that fit the specific nuances and constraints of their data. This problem is present across the range of organizational sizes. In larger organizations, changes to recommended “best-in-class” methods are often slow to spread through the large population of quantitative analysts. In smaller organizations, the lower-volume teams may not have the resources to make such decisions.
Our goal with this experiment was to identify and explain behaviors observed for a variety of preprocessing methodologies across three distinct categories: feature selection, categorical encoding, and null imputation. We explored the empirical behavior of popular preprocessing methods to provide a deeper understanding of the selected methodologies. To start out, we picked which preprocessing techniques to include in our horse race. We selected:
Once we finalized our list, we applied each technique to a series of data sets, both real-world and synthetic, and used an XGBoost model to make benchmark assessments of each technique and evaluate their relative performances. With these benchmarks, we established an internal baseline that could be used in future modeling projects with our partner banks and card issuers.
We used three basic data generating functions across experiments: linear, generalized additive model with global interactions (GAM global), and a jumpy generalized additive model with local interactions (jumpy GAM local). We’re defining global interactions as interactions that affect the entire domain, whereas local interactions only affect a portion (usually limited by a maximum or indicator function). Our goal in using these functions was to diversify the model complexity in our experiment, in addition to the real-world data set.
As our core data, we leveraged “Lending Club Loan Data,” a publicly available data set. Lending Club (LC) is a company that offers peer-to-peer personal loans. This data set contains anonymized loan data on customers that were issued loans between 2007 and 2015. It includes fields like credit score, debt-to-income ratio (DTI), term, and months since last delinquency. Loan status is the target variable, which is categorized as “Charged-off” (i.e., the account is considered as a loss) or “Current.”
What Did We Find?
At the end of the day, our conclusions depend heavily on what data the experiment had to work with. With our data, the team uncovered several interesting widely applicable generalizations, that applied well to other client projects — these are listed below, sorted by preprocessing methodology.
- For highly simplistic data structures, the choice of a specific feature selection method is trivial; it won’t dramatically influence your algorithm’s results.
- Once a dataset hits a high level of complexity, permutation-based feature importance methods tend to have high variability when it comes to model performance, and we would recommend avoiding these types of techniques.
- Among all methods considered, our researchers found that the “gain” XGBoost importance was the most consistent and powerful method.
Want to learn more about the feature selection methods used in this experiment? See this post on feature selection methods!
- Frequency encoding methods perform poorly for highly structured data sets but can perform well when there are more complex relationships between categorical variables and other features. We recommend exploring your data to ensure these relationships exist before proceeding with frequency encoding.
- Helmert encoding and OHE hold categorical information in a similar way. Either method is a suitable option for categorical encoding, as they are both comparable in performance.
- The simplest technique tested (the “missing indicator” technique, where you add a binary indicator that flips to 1 when the variable is missing) performs well across all data sets and is a good candidate as a null imputation method.
- “Single point” imputation methods (i.e., mean, median, etc.) all perform similarly. Although they do not perform as well as indicator techniques, they are still acceptable for simple models.
- Tree imputation had the least consistent performance across data sets, is not recommended.
While the tangible results from this paper are the empirical observations and recommendations made for feature selection, categorical encoding, and null imputation methods, the work completed here lays the forward-looking foundation to conduct future investigation in similar areas. This study is the proof-of-concept of the innovative research that is currently in progress at 2OS. We work to ensure our clients receive expert recommendations and knowledge, backed by detailed empirical research.
We’re looking to broaden these results in the future by:
- Assuming different underlying distributions (we used normal distributions for all continuous variables),
- Increasing the number of features generated for the synthetic data sets,
- Diversifying the type and number of categorical features,
- Exploring the effect on other machine learning algorithms, and
- Looking deeper into impacts on model interpretability and explainability.
A discussion of our full process (including data generation, detailed explanation of the preprocessing techniques discussed, and the conclusions our research defined) can be viewed at the full paper on arXiv.org. What we’ve included here is a summary of our observations for each preprocessing category.