Regularization

Overfitting

First of all, let us begin from the Schematic diagram

Overfitting: the learned hypothesis may fit the training set very well, but fail to generalize to test set.
Underfitting: the learned hypothesis can’t fit even the training set well.

Overcoming the Overfitting

Reduce number of features
- Manually select which features to keep.
- Model selection algorithms(PCA etc.)
Regularization
- Keep all the features, but reduce magnitude of parameter \(\theta_j,j\in\mathbb{N^+}\)

Regularization Term

Notation

Symbol	Type	Representation
\(m\)	\(\mathbb N^+\)	the number of training samples
\(\theta\)	\(\mathbb R^{n+1}\)	the vector of weights of model
\(\lambda\)	\(\mathbb R\)	the penalization rate or parameter of regularization
\(J_0\)	\(\mathbb R^{n+1} \to \mathbb R\)	the original cost function without regularization term
\(J\)	\(\mathbb R^{n+1} \to \mathbb R\)	the generated cost function with regularization term

L1 Norm formula

\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m|\theta_j|\]

L2 Norm formula

\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2\]

Derivative new cost function

we use L2 Norm regularization term.

Normal Formula

\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2)\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m\sum_{j=1}^m{\theta_j}\]

Matrix Formula

\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m} \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta )\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta\]

Regularizing Model

Regularized Linear Regression

\[J(\theta)=\frac1{2m}[(X\theta-y)^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta]\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]\]

Gradient Descent

\[\theta:=\theta-\alpha\frac{\partial}{\partial \theta}J(\theta)\] \[\theta_j:=\theta_j(1-\alpha\frac\lambda m)-\frac\alpha m\sum_{i=1}^m(h(x^{(i)})-y^{(i)})x_j^{(i)},\qquad \forall j\in [0,m] \cap\mathbb N^+\]

you can see this term \(\theta_j(1-\alpha\frac\lambda m)\) in the update rule formula, the formula indicate that:

shrink the \(\theta_j\) firstly.
update the \(\theta_j\) with derivative term.

the term \(1-\alpha\frac\lambda m\) here, we call it as shrink rate

you know parameter \(\lambda\) should be choose manually, **“which \(\lambda\) is proper for the model” ** is the problem.

\(\lambda\) is too big: the learned hypothesis will be underfitting. In fact, it can’t learn nothing.
\(\lambda\) is too small: the regularization term is approximately equal to 0 and there is no difference.

according to the experience by ML Pioneer

\[1-\alpha\frac\lambda m \in [0.5,1)\space is \space great\]

Normal Equation

Equation solvent

let \(\frac{\partial}{\partial \theta}J(\theta)=0\)

\[\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]=0\] \[\Rightarrow\theta=(X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )^{-1} X^Ty\]

Invertibility

If \(\lambda >0\), this matrix \((X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )\) is invertible.

Prove:

for \(\forall \eta \in \mathbb R^{n+1},\eta=\begin{bmatrix}\eta_0,\eta_1,...\eta_n\end{bmatrix}^T\)

\[\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta\] \[=(X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2\geq0\]

when \((X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2=0\),

\[\Rightarrow \begin{cases} (X\eta)^T(X\eta)=0\Rightarrow X\eta=0\\ \eta_k=0,\qquad k\in\{1,2,...,n\} \end{cases}\]

as \(\eta_k=0,k\in\{1,2,...,n\}\)

\[X\eta=\begin{bmatrix} 1 & {x^{(1)}}^T \\ 1 & {x^{(2)}}^T \\ \vdots & \vdots \\1 & {x^{(m)}}^T\end{bmatrix}\begin{bmatrix}\eta_0\\0\\\vdots\\0\end{bmatrix}=\begin{bmatrix}\eta_0\\\eta_0\\\vdots\\\eta_0\end{bmatrix}=0\\\Rightarrow \eta_0=0\]

from what has been discussed above:

\[\forall \eta \in \mathbb R^{n+1}\setminus{\vec0},\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta>0\]

indicate that \((X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )\) is apositive definite matrix so that it is invertible.

Regularized Logistic Regression

pass

Regularization

ML Learning Note-3

Regularization

Overfitting

Overcoming the Overfitting

Regularization Term

Notation

L1 Norm formula

L2 Norm formula

Derivative new cost function

Normal Formula

Matrix Formula

Regularizing Model

Regularized Linear Regression

Gradient Descent

Normal Equation

Equation solvent

Invertibility

Regularized Logistic Regression

CATALOG

FEATURED TAGS

FRIENDS