\[z = \frac{1}{(1 + e^{-y})}\]

where $\textit{y}$ is linear in $\textit{x}$, the input variable.

where $\textit{y}$ is linear in $\textit{x}$, the input variable.

Note that this is equivalent to:

\[z = \frac{e^{y}}{(1 + e^{y})}\]

So why is logistic regression considered linear and the result used for classification rather than predicting a continuous output? Having more of a computer science background, this was something that did not initially catch my eye. This related post made it quite easy for me to understand logistic regression. Here I provide some key points related to logistic regression and some references from a theoretical perspective to help develop a better understanding:

- Logistic function is used to give the probability of the output being in a binary class. Its output is always between 0 and 1 given any value of inputs in any number of dimensions.
- If you rearrange the logistic function, the natural log of the odds (the ratios of probabilities of an event being successful and unsuccessful) is the familiar linear regression equation. The reason why logistic regression is considered linear is that we are combining the outputs using a linear function.
- The tanh function, which is a mathematical function of the logistic function is a better choice than logistic function since it has steeper gradient. The steeper gradient is better in backprop training. A steeper gradient passes back feedback from output back to input much faster and having a larger impact on weights closer to input nodes making convergence faster.
- While logistic regression works very well in binary classification for any number of dimensions, the softmax function is a much better choice for in multi-class classification. The softmax sums up to one in a multi-class situation over all the classes. The logistic function does not have this property.
- In a binary classification, using a softmax function is equivalent to using the sigmoid function.
- One may ask why use these complicated exponential functions in the output? If we want probability, we can simply use an average - divide each output by the sum of outputs. The problem with this approach is that individual values can become negative even if they add to one. Exponentiating makes everything positive. Also, exponentiation works well for back propagation since it amplifies errors making algorithms converge faster.
- The use of logistic function as an activation function inside the network also has an issue - the vanishing gradient problem which makes deep neural nets very hard to train. ReLU, y = max(x, 0), has been a popular choice since it does not alter the gradients as they are propagated back to the input. This nice blog entry provides a great explanation.
- ReLU is also only used for hidden layers. Outputs would still be softmax (classification) or linear (regression).