Wednesday, May 10, 2017

Better ways to do One Hot Encoding

While running an ML algorithm on any data, you may have to convert categorical data into numerical data - reason is that a mostly all scikit-learn code requires you to input data which is numeric. Though one may think that it is a scikit limitation, that may not be true. Since ML uses math and vectors behind the scenes, the data has to be numerical for most good algorithms.

One of the common ways to convert categorical data to numeric data is using One Hot Encoding. This kind of encoding uses indicator variables, where each value of the category is replaced by a column of its own. This can lead to column explosion so one must be careful. A lot of times, categories that have an order can be mapped to numerical values that may be helpful as well. 

Several methods of One Hot Encoding have been mentioned. Most prominent and simple of them uses the get_dummies function in pandas:
In this function, we loop through all the categorical variables in the pandas dataframe one by one and for each case, we use get_dummies to create indicator variable columns (which are numeric) and then we delete the original categorical column. This is pretty simple and mostly this is what is recommended the most in forums. However, there are a few catches with this method in practice:

  1. The categorical variable may have different sets of values in training, validation and test data. If you run get_dummies, it may assign the same numeric value to different categorical values. When fed into the ML algorithm this can cause unintended data alternation and results.
  2. The order in which the categorical values are encountered by get_dummies may further contribute to the above issue
  3. Some categorical values may only appear in validation and test data and may be absent in training data. This can cause more problems. Training a model on one kind of values and then testing predictions on another may not make sense. The get_dummies does not help identify this problem. 
The best course of action is to map the categorical values to a set of indicator variables and then use the same set during test and validation. This should not be changed. In addition, if certain categorical values that are going to be seen in practice in validation or test data, we must take that into account. We will leave this specific problem to another post. In this post, lets see how we can fix get_dummies to at least fix the first two problems and alert us of the third one.

Python provides few other alternatives that are a bit complex to use but I feel are totally necessary. One of these uses the LabelEncoder function. LabelEncoder looks at a categorical variable and creates a transformation which maps the values to integer labels. This does not create any indicator variables so one may think is inadequate for our needs. However, what LabelEncoder does is store the mapping as a model which can be used repeatedly later on. Combining LabelEncoder with get_dummies provides the ideal solution:
So what's up with the dictionary and the train variable in this function? Note our initial objective - we must use the same mapping for both the training and test data. The python dictionary holds important storage area for the mappings created by the LabelEncoder for each column in a dataframe. The call made to this function for training data looks as follows:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

The call to an already created transform for encoding also checks to see if it encounters any new values in the test data. If it does, it will warn us and we can go back and take appropriate action.

No comments: