We previously shared empirical results comparing a variety of preprocessing techniques, where 2OS took a look at how certain preprocessing methods work and which ones seemed (empirically) to work best. We’re continuing the behind-the-scenes look into the more technical details of those preprocessing methods work.
This is part two of three follow-up posts that focuses on a deep dive of null imputation methods. We previously talked through feature selection in part one, and will discuss categorical encoding in the last part of this series in the near future. All content in posts were created and written by a team of 2OS data scientists (Tosan Johnson*, Alice Liu, Raza Syed, and Aaron McGuire.
*Note: Tosan has since moved to a new company
What is null imputation and why do we care?
Null or missing values can occur when there is absent information within a data set. Commonly referred to with “NA” or “None”, null values are a prevailing obstacle in machine learning for a few reasons. First, many machine learning algorithms cannot inherently handle missing values appropriately and may fail to fit a model if null values are present. Second, the presence of missing values may have negative impacts on model performance if those observations contain valuable information that is not present in the rest of the data. Ignoring or deleting observations that have missing values may not be a luxury the analysis can afford (i.e., limited data) and can result in biasing the data set due to eliminating underlying behavior.
What are the methods?
We will discuss several techniques in this post:
- Mean imputation,
- Median imputation,
- Missing indicator imputation,
- Decile imputation,
- Clustering imputation [Lloyd, 1982], and
- Decision tree imputation [Breiman, 1984].
All the above-mentioned methods are deterministic imputation methods, which means a single imputation model (e.g., a simple summary statistic or a decision tree, etc.) is fit and values are imputed as predictions from the model. We don’t include any stochastic (i.e., including noise) imputation models, which draws an imputed value from an output distribution from the imputation model (e.g., regression model to predict imputed values).
To demonstrate these methods, we’ll use a toy data set with two numeric features (missing values are bolded):
(1) Mean Imputation: This is a simple imputation method that replaces missing values for a specific feature with the mean of all non-missing values in the same feature. The toy data set above shows missing values, which will be imputed via mean imputation, resulting in the data below (imputed values are bolded):
While mean imputation is simple and easy to implement, the method does have some drawbacks. If the data is not normally distributed, using the mean to impute missing values may cause a change in the underlying distribution of the data. Additionally, if the percent of data missing is large enough, then mean imputation may result in an underestimation of the feature’s variance.
(2) Median Imputation: Similar to mean imputation, this method replaces all missing values in a feature with the median value of all non-missing values in the same feature. Again, using the toy data set, missing values are imputed via median imputation, resulting in the data below (imputed values are bolded):
Note that the median imputed values differ slightly from the mean imputed values, indicating some skew in the data’s distribution. Median imputation is more robust to non-normally distributed data, addressing one of the issues that mean imputation may face. Median imputation is a simple and fast approach for dealing with null values. However, like mean imputation, if the percent of missing data is high enough, there may be a reduction in the feature’s variance.
(3) Missing Indicator Imputation: This is a simple technique where a binary feature is created to indicate whether the corresponding feature has a missing value present. The toy data set shows missing values that are imputed via missing indicator imputation, resulting in the data below (note: NAs are replaced by -9999 in addition to the creation of the missing indicator column):
The advantage of the indicator column is the ability to highlight differentiating behavior, which is represented by the presence of the missing value (but this is only true if the missing value is not simply a random occurrence). If too many features have missing values, the dimensionality of the dataset may be dramatically increased due to the addition of missing indicator columns for each affected feature.
(4) Decile Imputation: This is an imputation method that takes advantage of the relationship between the target variable and any missing features. For classification, the goal is to group the observations in the feature into percentile groups and create an additional group for missing values. For each group of observations based on the percentiles (including the missing group), the probability of the target class occurring is calculated. The group with a target probability closest to the target probability of the missing group is chosen, and the median value for that group is used as the imputation value for that feature. A similar approach can be taken with numerical variables by adjusting the statistic used (e.g., mean, median, mode, etc. instead of target class proportion).
The biggest advantage of this approach is that it utilizes the relationship of the target variable to associate missing values with a corresponding segment in the feature. Note that this method assumes missing values have a strong relationship with other segments in the data, which, if not satisfied (i.e., missing values are due to data entry errors), the feature may become biased towards the segment chosen.
(5) Clustering Imputation: Clustering imputation assigns clusters to every observation in the data set. Each feature with a missing value is then isolated and the average value of a cluster of observations within a feature is mapped. Clusters are identified using k-means clustering [Lloyd, 1982]. The mean value for a cluster is then used as the imputation value for any missing values assigned to the same cluster in that feature. The toy data set from above is modified to include data clusters:
These are imputed via clustering imputation, resulting in the imputed data below (imputed values are bolded):
Clustering imputation advantageously utilizes information outside of the feature the missing value appears in. It assigns a missing value to the appropriate segment by using information from the rest of the data set, and then uses internal information about the feature to assign a specific value. Since this is a technique that uses a model to extract information from the data, it is high on the spectrum of computational cost relative to other methods mentioned in this post.
(6) Decision Tree Imputation: This is another technique that uses a model to extract more information out of the data to determine an imputation value. Like decile imputation, the goal is to create unique segments or groups for an individual feature and compare the average target value to average target of grouping of the missing values.
This technique uses a decision tree with a single feature to create the groupings. The decision tree is a classification and regression tree or CART [Breiman, 1984]. A t-test is performed to determine the similarity between each leaf node and the missing group’s average target values. The median feature value of the leaf node with the smallest p-value, based on the t-score, is used as the imputation value for that feature, and the process is repeated across all features.
Note that using a decision tree adds to the overhead of training a model for every feature.
What about other null imputation methods?
Of course, this list doesn’t cover all available null imputation methods. As mentioned, we only cover deterministic imputation methods; stochastic imputation methods are quite common as well. What other null imputation methods would you like to see 2OS review?
Interested in other insights?
Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!