Tukeys Blog: May 2017

Monday, May 15, 2017

Why Logistic Regression is linear even though it uses a non linear function

Logistic Regression uses a non linear activation function - the logistic function:

\[z = \frac{1}{(1 + e^{-y})}\]

where $\textit{y}$ is linear in $\textit{x}$, the input variable.

Note that this is equivalent to:

\[z = \frac{e^{y}}{(1 + e^{y})}\]

So why is logistic regression considered linear and the result used for classification rather than predicting a continuous output? Having more of a computer science background, this was something that did not initially catch my eye. This related post made it quite easy for me to understand logistic regression. Here I provide some key points related to logistic regression and some references from a theoretical perspective to help develop a better understanding:

Logistic function is used to give the probability of the output being in a binary class. Its output is always between 0 and 1 given any value of inputs in any number of dimensions.
If you rearrange the logistic function, the natural log of the odds (the ratios of probabilities of an event being successful and unsuccessful) is the familiar linear regression equation. The reason why logistic regression is considered linear is that we are combining the outputs using a linear function.
The tanh function, which is a mathematical function of the logistic function is a better choice than logistic function since it has steeper gradient. The steeper gradient is better in backprop training. A steeper gradient passes back feedback from output back to input much faster and having a larger impact on weights closer to input nodes making convergence faster.
While logistic regression works very well in binary classification for any number of dimensions, the softmax function is a much better choice for in multi-class classification. The softmax sums up to one in a multi-class situation over all the classes. The logistic function does not have this property.
In a binary classification, using a softmax function is equivalent to using the sigmoid function.
One may ask why use these complicated exponential functions in the output? If we want probability, we can simply use an average - divide each output by the sum of outputs. The problem with this approach is that individual values can become negative even if they add to one. Exponentiating makes everything positive. Also, exponentiation works well for back propagation since it amplifies errors making algorithms converge faster.
The use of logistic function as an activation function inside the network also has an issue - the vanishing gradient problem which makes deep neural nets very hard to train. ReLU, y = max(x, 0), has been a popular choice since it does not alter the gradients as they are propagated back to the input. This nice blog entry provides a great explanation.
ReLU is also only used for hidden layers. Outputs would still be softmax (classification) or linear (regression).

Wednesday, May 10, 2017

Better ways to do One Hot Encoding

While running an ML algorithm on any data, you may have to convert categorical data into numerical data - reason is that a mostly all scikit-learn code requires you to input data which is numeric. Though one may think that it is a scikit limitation, that may not be true. Since ML uses math and vectors behind the scenes, the data has to be numerical for most good algorithms.

One of the common ways to convert categorical data to numeric data is using One Hot Encoding. This kind of encoding uses indicator variables, where each value of the category is replaced by a column of its own. This can lead to column explosion so one must be careful. A lot of times, categories that have an order can be mapped to numerical values that may be helpful as well.

Several methods of One Hot Encoding have been mentioned. Most prominent and simple of them uses the get_dummies function in pandas:

In this function, we loop through all the categorical variables in the pandas dataframe one by one and for each case, we use get_dummies to create indicator variable columns (which are numeric) and then we delete the original categorical column. This is pretty simple and mostly this is what is recommended the most in forums. However, there are a few catches with this method in practice:

The categorical variable may have different sets of values in training, validation and test data. If you run get_dummies, it may assign the same numeric value to different categorical values. When fed into the ML algorithm this can cause unintended data alternation and results.
The order in which the categorical values are encountered by get_dummies may further contribute to the above issue
Some categorical values may only appear in validation and test data and may be absent in training data. This can cause more problems. Training a model on one kind of values and then testing predictions on another may not make sense. The get_dummies does not help identify this problem.

The best course of action is to map the categorical values to a set of indicator variables and then use the same set during test and validation. This should not be changed. In addition, if certain categorical values that are going to be seen in practice in validation or test data, we must take that into account. We will leave this specific problem to another post. In this post, lets see how we can fix get_dummies to at least fix the first two problems and alert us of the third one.

Python provides few other alternatives that are a bit complex to use but I feel are totally necessary. One of these uses the LabelEncoder function. LabelEncoder looks at a categorical variable and creates a transformation which maps the values to integer labels. This does not create any indicator variables so one may think is inadequate for our needs. However, what LabelEncoder does is store the mapping as a model which can be used repeatedly later on. Combining LabelEncoder with get_dummies provides the ideal solution:

So what's up with the dictionary and the train variable in this function? Note our initial objective - we must use the same mapping for both the training and test data. The python dictionary holds important storage area for the mappings created by the LabelEncoder for each column in a dataframe. The call made to this function for training data looks as follows:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

The call to an already created transform for encoding also checks to see if it encounters any new values in the test data. If it does, it will warn us and we can go back and take appropriate action.

Monday, May 8, 2017

Hail Seaborn!

The seaborn heatmap perhaps is the best visualization of the correlations in a data set.

Much better than Axes.matshow():

Analyzing Predictions to find bugs

Testing should be one of the most frequent steps in the SDLC for any code, whether it is building a website, powering a smart device or machine learning. Tests of machine learning code may reveal lots of issues - transformations that don't work because the right columns are not operated on or data which fails assumptions of the learning algorithm. The list may be endless and we are well attuned to this process. One of the tests often ignored is comparing the test results during the training period with actual values from the data and then trying to find what went wrong and where. This can often reveal attributes or column values that you may have ignored till now and must play a bigger role.

For example, use the following function to plot test results against the real values. The plot should be a straight line. Any deviations from it are to be analyzed for every column.

Make your test data comply with training data

When using scikit-learn libraries with pandas, you would often get errors if the test data does not have columns which match the model created from the training data set. You may have deleted columns that you did not need from the training data when creating the model or you may have constructed new columns based on existing data variables (for example, creating 'Age' from 'Date' or combining the effects of multiple variables). In most cases, this is simple to achieve by wringing the test data through the same function as the training data. However, when you are using One-Hot-Encoding of categorical data, the columns created as a result of this coding in the training set and the test set may not match for the simple reason that some values for categorical data may only be present in the training data set and others may be present only in the test data set. I have included here a python function that I wrote and works really well to match the test data with the training data: