Categorical Encoding Methods
Authors: Alice Liu, Syed Raza, and Aaron McGuire
A team of 2OS data scientists (Tosan Johnson*, Alice Liu, Syed Raza, and Aaron McGuire) performed an empirical study (click to see our full paper on arXiv) on modeling preprocessing techniques, which includes categorical encoding methods. That study is reviewed in a separate blog post — this post expands and describes the categorical encoding methods reviewed in the study. We previously explored feature selection and null imputation methods in separate posts.
*Note: Tosan has since moved on to a new company.
What is categorical encoding and why do we care?
Machine learning (ML) algorithms will only accept numerical inputs. For an algorithm to learn the underlying patterns and relationships in the input data, every feature in that data set needs to be represented as a numeric variable(s). Preprocessing via categorical encoding to achieve this representation avoids both logistic and interpretation issues when modeling. Due to this constraint, numerous encoding methods have been developed to translate categorical variables into numeric ones.
There are two different kinds of categorical variables:
- Nominal: have categories that share no intrinsic ordering between them (e.g., colors such as red, blue, and yellow)
- Ordinal: share a clear ordering between each category (e.g., size such as small, medium, large)
All the techniques we considered in this post were applied to nominal variables — we do not assess methods that strictly deal with ordinal variables.
What are the methods?
The nominal categorical encoding methods discussed include:
In addition to the comparison performed by experts at 2OS, another comparison of a select number of the above methods was done by Potdar et al. [2017].
(1) One-Hot Encoding: One-hot encoding (OHE) converts a categorical feature into N binary variables, where N is the number of categories in the feature. Each new binary column corresponds to the original categorical feature where a “1” represents the presence of that category in the original feature.
A similar approach, known as dummy coding, creates N — 1 columns, where the the final category is the “base case” or reference level and is interpreted as the scenario when all other categories are 0. Dummy coding is typically used for regression models, whereas OHE is more commonly used for ML models.
An example of OHE is given below:
OHE is a straightforward technique for handling categorical variables, but it can have issues. When using this method for variables with high cardinality (i.e., many categories), OHE can become memory intensive and lead to high dimensionality within the data.
(2) Helmert Coding: Helmert coding works by comparing a specific level of a categorical variable to the mean of the subsequent categories for that variable using contrasts.¹ Each contrast here is represented by the mean of the target variable for a specific level, subtracted by the mean of the target variable for all categories that come after that level.
For our implementation, we used a “reversed” form of Helmert coding in which previous categories are used as the comparison point instead of subsequent categories.
See the table below for an example:
Typically, this method would be used for ordinal categorical variables since the resulting values are relative quantitative differences between categories, but we adapted this method for use with nominal variables. The discrepancy between the application of this method between nominal and ordinal variables is that each value of the nominal variable will be evaluated as a magnitude and not discretely.
¹A contrast is a linear combination (weighted sum) of statistics (such as means) [Sundström, 2010].
(3) Frequency Encoding: Frequency encoding is a technique that replaces the literal value of a category with the probability of that category occurring within the data set.
For example, if a data set with 100 rows had a categorical variable with three unique categories, where category one has 25 occurrences, category two has 60 occurrences, and category three has 15 occurrences, the newly created adjacent “frequency” column would use 0.25, 0.60, and 0.15 as the new respective corresponding values.
As another example, if a categorical feature has two unique values, where category one has 72 occurrences and category two has 28 occurrences, the newly created “frequency” column would use 0.72 and 0.28 as the new respective corresponding values.
See the table below for an example:
A big advantage to this technique is that it is simple and cost efficient to implement, while also preserving the size of the feature space, as no additional columns are added to the data, beyond the singular “frequency” column. One major drawback from this method is the scenario when duplicate probabilities occur because the resulting values no longer serve to differentiate unique segments in the data.
(4) Binary Encoding: Binary encoding represents each unique category as binary code across columns in the data set. In application, the categories of the variable need to be mapped to unique integers or levels (e.g., 1, 2, or 3 if the variable has three categories). The numeric values are needed to identify unique combinations of binary code to represent the categories. This translation to binary code is done by utilizing the least number of columns necessary to represent every category in the feature using 1s and 0s.
See the table below for an example:
Binary encoding is a memory efficient method for dealing with categorical variables that have high cardinality. It can represent many categories with the addition of a few columns being added to the data set.
What about other categorical encoding methods?
Of course, this list doesn’t cover all available categorical encoding methods. What other categorical encoding methods would you like to see 2OS review?
Do you have questions? We’d love to answer them!
Contact the authors by email at:
Interested in 2OS insights?
Check out the 2OS Insights page, where 2OS has shared industry insights through white papers, case studies, and more!
References
- Kedar Potdar, Taher S Pardawala, and Chinmay D Pai. A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175(4):7–9, 2017.
- Stina Sundström. Coding in multiple regression analysis: A review of popular coding techniques. 2010.