Monday, May 8, 2017

Make your test data comply with training data

When using scikit-learn libraries with pandas, you would often get errors if the test data does not have columns which match the model created from the training data set. You may have deleted columns that you did not need from the training data when creating the model or you may have constructed new columns based on existing data variables (for example, creating 'Age' from 'Date' or combining the effects of multiple variables). In most cases, this is simple to achieve by wringing the test data through the same function as the training data. However, when you are using One-Hot-Encoding of categorical data, the columns created as a result of this coding in the training set and the test set may not match for the simple reason that some values for categorical data may only be present in the training data set and others may be present only in the test data set. I have included here a python function that I wrote and works really well to match the test data with the training data:


No comments: