Logistic Regression Classification
ML Learning Note-2
Logistic Regression Model
Overview
Logistic Regression Model is a Classification ML Technique which use regression method to solve the Classification Problem. Logistic Regression Model is a kind of Generalized Linear Model.
Generalized Linear Model
\[h(x)=g(x^T\theta)\]where function \(g(z)\) is the generalized function which provide non-linear fitting ability to the model.
when \(g(z) = sigmoid(z) = \frac1{1+e^{-z}}\), the current model will be Logistic Regression Model
.
Section | Technique | Expression |
---|---|---|
Hypothesis Function | Logistic Regression Classification Model (LRCM) | \(h(x)=sigmoid(x^T\theta)\) |
Cost Function | Maximum Likelihood Estimate (MLE) | \(J(\theta)=-\frac1{m}\sum_{j=1}^m[y^{(j)}\log h(x^{(j)})+(1-y^{(j)})\log h(1-x^{(j)})]\) |
Optimization Method 1 | Batch Gradient Descent (BGD) | |
Optimization Method 2 | Normal Equation (NE) |
Notation
Symbol | Type | Representation |
---|---|---|
\(n\) | \(\mathbb N^+\) | the number of features |
\(m\) | \(\mathbb N^+\) | the number of training samples |
\(x^{(i)}\) | \(\mathbb R^n\) | the features vector of the \(i^{th}\) training sample |
\(x^{(i)}_j\) | \(\mathbb R\) | value of feature j in the \(i^{th}\) training sample |
\(y^{(i)}\) | \(\{0,1\}\) | the target value of the \(i^{th}\) training sample |
\(\hat{x}\) | \(\mathbb R^{n+1}\) | \([1\space x]^T\) |
\(y\) | \(\{0,1\}^m\) | the vector of all target values of training samples |
\(\theta\) | \(\mathbb R^{n+1}\) | the vector of weights of model |
\(X\) | \(\mathbb R^{m\times(n+1)}\) | the matrix of all input values of training samples |
Hypothesis Function
Model Representation
\[h(x)=sigmoid(x^T\theta)=\frac1{1+e^{-x^T\theta}}\] \[h(X)=sigmoid(X\theta)=\frac1{1+e^{-X\theta}}\]“logistic function” is a synonym for “sigmoid function”
the graph of sigmoid function
:
from the graph of function, we can learn the characteristic of the sigmoid function
:
Interpretation of Hypothesis output
\(h(x)=\) estimated probability \(y=1\) on input \(x\).
\[P(y=0|x,\theta)+P(y=1|x,\theta)=1\]Cost Function
We don’t use Mean Square Error
to measure the loss of output again, because it will lead to non-convex optimization problem which Gradient Descent
can’t solve the global optimized solution.
As a consequence, we will use Maximum Likelihood Estimate
to figure out the cost function.
Optimization
def
then we calculate the derivative term of cost function \(J(\theta)\)
\[J(\theta)=\frac1m(y^Tls(X\theta)+(1-y)^Tls(I-X\theta))\\ \frac{\partial}{\partial \theta}J(\theta)=\frac1m\frac{\partial}{\partial \theta}(y^Tls(X\theta)+(1-y)^Tls(I-X\theta))\] \[=\frac1m(\frac{\partial y^Tls(X\theta)}{\partial \theta}+\frac{\partial (1-y)^Tls(I-X\theta)}{\partial \theta})\] \[=\frac1m(\frac{\partial X\theta}{\partial \theta}\frac{\partial ls(X\theta)}{\partial X\theta}\frac{\partial y^Tls(X\theta)}{\partial ls(X\theta)}+\frac{\partial I-X\theta}{\partial \theta}\frac{\partial ls(I-X\theta)}{\partial I-X\theta}\frac{\partial (1-y)^Tls(I-X\theta)}{\partial ls(I-X\theta)})\] \[=\frac1m(X^T\frac{-e^{-X\theta}}{1+e^{-X\theta}}y-X^T\frac{-e^{X\theta-I}}{1+e^{X\theta-I}}(1-y))\]as we knew that:
\[e^I=I=1\]then we continue to simplify the expression:
\[\frac{\partial}{\partial \theta}J(\theta)=\frac1mX^T(\frac1{1+e^{-X\theta}}-y)\\ =\frac1mX^T(sigmoid(X\theta)-y)\\ =\frac1mX^T(h(X)-y)\]here we surprisingly see that the final simplified expression is same as Linear Regression
’s
Optimization Task Representation
Key | Item |
---|---|
Parameters | \(\theta\) |
Function | \(-\frac1m[y^T\log sigmoid(X\theta)+(1-y)^T\log sigmoid(I-X\theta)]\) |
Goal | \(\min_{\theta} J(\theta)\) |
Gradient Descent Algorithms
this part is quite same as the
Linear Regression