Logistic Regression Classification

ML Learning Note-2

Posted by Algebra-FUN on July 12, 2020

Logistic Regression Classification

ML Learning Note-2

Logistic Regression Model

Overview

Logistic Regression Model is a Classification ML Technique which use regression method to solve the Classification Problem. Logistic Regression Model is a kind of Generalized Linear Model.

Generalized Linear Model

\[h(x)=g(x^T\theta)\]

where function \(g(z)\) is the generalized function which provide non-linear fitting ability to the model.

when \(g(z) = sigmoid(z) = \frac1{1+e^{-z}}\), the current model will be Logistic Regression Model.

Section Technique Expression
Hypothesis Function Logistic Regression Classification Model (LRCM) \(h(x)=sigmoid(x^T\theta)\)
Cost Function Maximum Likelihood Estimate (MLE) \(J(\theta)=-\frac1{m}\sum_{j=1}^m[y^{(j)}\log h(x^{(j)})+(1-y^{(j)})\log h(1-x^{(j)})]\)
Optimization Method 1 Batch Gradient Descent (BGD)  
Optimization Method 2 Normal Equation (NE)  

Notation

Symbol Type Representation
\(n\) \(\mathbb N^+\) the number of features
\(m\) \(\mathbb N^+\) the number of training samples
\(x^{(i)}\) \(\mathbb R^n\) the features vector of the \(i^{th}\) training sample
\(x^{(i)}_j\) \(\mathbb R\) value of feature j in the \(i^{th}\) training sample
\(y^{(i)}\) \(\{0,1\}\) the target value of the \(i^{th}\) training sample
\(\hat{x}\) \(\mathbb R^{n+1}\) \([1\space x]^T\)
\(y\) \(\{0,1\}^m\) the vector of all target values of training samples
\(\theta\) \(\mathbb R^{n+1}\) the vector of weights of model
\(X\) \(\mathbb R^{m\times(n+1)}\) the matrix of all input values of training samples

Hypothesis Function

Model Representation

\[h(x)=sigmoid(x^T\theta)=\frac1{1+e^{-x^T\theta}}\] \[h(X)=sigmoid(X\theta)=\frac1{1+e^{-X\theta}}\]

“logistic function” is a synonym for “sigmoid function”

the graph of sigmoid function:

from the graph of function, we can learn the characteristic of the sigmoid function:

\[sigmoid(t)\in(0,1)\\ \lim_{t\to-\infty}sigmoid(t)=0\\ \lim_{t\to\infty}sigmoid(t)=1\]

Interpretation of Hypothesis output

\(h(x)=\) estimated probability \(y=1\) on input \(x\).

\[P(y=0|x,\theta)+P(y=1|x,\theta)=1\]

Cost Function

We don’t use Mean Square Error to measure the loss of output again, because it will lead to non-convex optimization problem which Gradient Descent can’t solve the global optimized solution.

As a consequence, we will use Maximum Likelihood Estimate to figure out the cost function.

\[Cost(h(x),y)= \begin{cases} \log(h(x)),\qquad y=1\\ \log(1-h(x)),\space y=0\\ \end{cases}\] \[Cost(h(x),y)=-y\log(h(x))-(1-y)\log(1-h(x))\] \[J(\theta)=-\frac1m\sum_{j=1}^m[y^{(j)}\log h(x^{(j)})+(1-y^{(j)})\log h(1-x^{(j)})]\] \[J(\theta)=-\frac1m(y^T\log sigmoid(X\theta)+(1-y)^T\log sigmoid(I-X\theta))\]

Optimization

def

\[ls(z)=-\log sigmoid(z)\\ \Rightarrow \frac{\partial}{\partial z}ls(z)=\frac{-e^z}{1+e^{-z}}\]

then we calculate the derivative term of cost function \(J(\theta)\)

\[J(\theta)=\frac1m(y^Tls(X\theta)+(1-y)^Tls(I-X\theta))\\ \frac{\partial}{\partial \theta}J(\theta)=\frac1m\frac{\partial}{\partial \theta}(y^Tls(X\theta)+(1-y)^Tls(I-X\theta))\] \[=\frac1m(\frac{\partial y^Tls(X\theta)}{\partial \theta}+\frac{\partial (1-y)^Tls(I-X\theta)}{\partial \theta})\] \[=\frac1m(\frac{\partial X\theta}{\partial \theta}\frac{\partial ls(X\theta)}{\partial X\theta}\frac{\partial y^Tls(X\theta)}{\partial ls(X\theta)}+\frac{\partial I-X\theta}{\partial \theta}\frac{\partial ls(I-X\theta)}{\partial I-X\theta}\frac{\partial (1-y)^Tls(I-X\theta)}{\partial ls(I-X\theta)})\] \[=\frac1m(X^T\frac{-e^{-X\theta}}{1+e^{-X\theta}}y-X^T\frac{-e^{X\theta-I}}{1+e^{X\theta-I}}(1-y))\]

as we knew that:

\[e^I=I=1\]

then we continue to simplify the expression:

\[\frac{\partial}{\partial \theta}J(\theta)=\frac1mX^T(\frac1{1+e^{-X\theta}}-y)\\ =\frac1mX^T(sigmoid(X\theta)-y)\\ =\frac1mX^T(h(X)-y)\]

here we surprisingly see that the final simplified expression is same as Linear Regression’s

Optimization Task Representation

Key Item
Parameters \(\theta\)
Function \(-\frac1m[y^T\log sigmoid(X\theta)+(1-y)^T\log sigmoid(I-X\theta)]\)
Goal \(\min_{\theta} J(\theta)\)
\[\min_{\theta} -\frac1m(y^T\log sigmoid(X\theta)+(1-y)^T\log sigmoid(I-X\theta))\]

Gradient Descent Algorithms

this part is quite same as the Linear Regression

Practice with Breast-Cancer Dataset