In a previous post, we walked through our empirical results comparing a variety of preprocessing methods, where we showed what was happening and what methods worked best. Now, we want to talk about the behind-the-scenes of how those preprocessing methods work.
This is part one of three follow-up posts, which focuses on a deep dive of feature selection methods. We’ll talk through categorical encoding and null imputation in future posts. All content in posts were created and written by a team of 2OS data scientists (Tosan Johnson*, Alice Liu, Raza Syed, and Aaron McGuire).
*Note: Tosan has since moved on to a new company.
Why do we care about reducing and selecting features?
Reducing the number of irrelevant input features is a key task for machine learning (ML) models to perform at optimal levels, mitigate overfitting (i.e., learning too much from the data), and create robust models. If you talk to a statistician or data scientist, you’ll hear them talk about a parsimonious model, in which the simplest model is usually best. This is known as Occam’s Razor or the Law of Parsimony, whereby the simplest or parsimonious solution is usually the right one.
Feature selection helps modelers adhere to Occam’s Razor. By removing “noisy” features, the model will utilize important variables for the prediction process. Along with improvements in performance, using a subset of features can also help reduce training time and improve model interpretability.
What are the methods?
The methods reviewed include:
- Pearson correlation coefficient reduction,
- Spearman’s rank correlation coefficient reduction,
- Variable selection based on XGBoost importance [Chen and Guestrin, 2016],
- Regularization via LASSO regression [Tibshirani, 1996],
- Variable selection based on permutation-based feature importance [Breiman, 2001], and
- Recursive feature elimination [Guyon et al., 2002].
The above methods can be further categorized into “pre-build” and “during build” approaches.
The methods included in this section are generally used to reduce the original pool of variables prior to building any models at all (i.e., “pre-build”). This would typically be done during the exploratory data analysis (EDA) phase of the model development/build process.
(1) Pearson Correlation Reduction: Uses a two-step approach to remove redundant (i.e., correlated) features from the data. The first step deals with multicollinearity, while the second step looks at the correlation between a feature and the target variable. To deal with potential multicollinearity in the data, a correlation matrix is created that measures the observed Pearson correlation rₘₙ of a feature m with another feature n. This is done pairwise, for every pair of features. The formula for calculating the Pearson correlation coefficient is given below. Note this formula looks at the covariance of features m and n in the numerator and the product of the standard deviations of features m and n in the denominator.
Any feature pairs that have a correlation above some user-specified threshold are candidates for removal. Only a single feature from the pair needs to be removed, so the feature in the pair that has the lowest correlation with the target variable y is dropped. This process is repeated for all feature pairs.
(2) Spearman’s Rank Correlation Reduction: Correlation reduction can also be performed using Spearman’s rank correlation coefficients, rather than Pearson correlation coefficients. The formula for calculating Spearman’s rank correlation coefficient is based on the Pearson correlation coefficient formula but uses rank variables. Thus, rather than using the original values, these are converted to ranks. This means that rank variables R(mᵢ) and R(nᵢ) are used in place of the original variables mᵢ and nᵢ in the formula for rₘₙ. Like the process for Pearson correlation reduction, the features are considered for elimination if any feature pairs have a correlation above some specified threshold and is determined based on the relationship with the target variable y.
During Build Approaches
The methods included in this section are always used alongside or with models as they’re being built. Some of these methods are model-agnostic, which means they can be used with any type of ML algorithm. Others are model-specific, which means they only work with certain types of ML algorithms.
(1) XGBoost Importance: XGBoost’s internal feature selection capabilities output a list of importance scores for each feature, from which the top N important features can be identified [Chen and Guestrin, 2016]. XGBoost measures feature importance using different mechanisms — we review two of the most popular below:
- Gain: Roughly calculated by looking at the increase in purity (i.e., how homogeneous the target outputs are) of the child nodes when a feature is used to split in the tree. If the observations in a parent node have balanced classes and the resulting observations in the child nodes have very unbalanced classes, then that feature is considered to contribute to a larger gain.
- Weight: The number of splits that a feature had across all trees generated. The more times a feature is used to split across the trees, the larger the weight of that feature. As a note, depending on the type of features in the data set, this can be a positive or a negative. If the number of unique values for a feature is low, and the tree depth is high, then the number of times it can be used in a split in the tree will be limited and cause its weight metric to be dampened (even if the feature is still important). Thus, there can be limitations to using weight as a feature selection method.
(2) Regularization via LASSO Regression: LASSO (Least Absolute Shrinkage and Selection Operator) Regression (L1 regularization) is a linear model that adds a unique property to its cost function: a penalty term λ that relates to the size of the coefficients in its equation [Tibshirani, 1996]. The higher the penalty term λ, the larger the constraint on the cost function f(x), which results in an absolute reduction of the coefficient values. L1 regularization is used for feature selection because it will reduce the coefficients of less important features (or, likely, redundant features) to 0, which effectively removes them from the model. The cost function f(x) used for L1 regularization is given below. Note that the squared error between the observed and predicted response is being offset by the penalty term λ in this formula.
(3) Permutation-Based Feature Importance: Permutation-based feature importance measures importance by removing the predictive power of individual features and scoring the relative shift in performance of the model [Breiman, 2001]. First, the entire data set is fitted to a model and scored. The performance of this model is stored as the benchmark for future calculations. Next, a single feature is chosen and its values are randomly shuffled. New predictions are generated and the new performance is measured. The difference between the two performance scores is the feature importance score, which is directly related to the magnitude in difference between the two scores. Thus, the larger the difference, the more impactful the feature. This process is repeated for all features in the data set in isolation (only a single feature’s values are shuffled per step).
(4) Recursive Feature Elimination: Recursive feature elimination (RFE) is a brute-force wrapper method that eliminates the lowest performing features from a data set in a stepwise manner [Guyon et al., 2002]. Being a wrapper method, RFE fits an ML model to a data set, scores the importance of each feature, and then removes the features with the lowest contribution. This is an iterative process that continues until the user-specified minimum number of features threshold is met. The size of the steps (i.e., the number of features removed per iteration) is an important parameter than can have a large effect on training time and performance.
What about other feature selection methods?
Of course, this list doesn’t cover all available feature selection methods. However, these are among the most used methods for typical model builds. What other feature selection methods would you like to see 2OS review?
Interested in other insights?
Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!
- Breiman, Leo. “Random Forests.” Machine Learning 45 (2001): 5–32. https://doi.org/10.1023/A:1010933404324.
- Chen, Tianqi, and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016): 785–794. https://doi.org/10.1145/2939672.2939785.
- Guyon, Isabelle, et al. “Gene Selection for Cancer Classification Using Support Vector Machines.” Machine Learning 46 (2002): 389–422. https://doi.org/10.1023/A:1012487302797.
- Tibshirani, Robert. “Regression Shrinkage and Selection Via the LASSO.” Journal of the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.