Regularization

ML Learning Note-3

Posted by Algebra-FUN on July 14, 2020

Regularization

Overfitting

First of all, let us begin from the Schematic diagram

  • Overfitting: the learned hypothesis may fit the training set very well, but fail to generalize to test set.
  • Underfitting: the learned hypothesis can’t fit even the training set well.

Overcoming the Overfitting

  • Reduce number of features
    • Manually select which features to keep.
    • Model selection algorithms(PCA etc.)
  • Regularization
    • Keep all the features, but reduce magnitude of parameter \(\theta_j,j\in\mathbb{N^+}\)

Regularization Term

Notation

Symbol Type Representation
\(m\) \(\mathbb N^+\) the number of training samples
\(\theta\) \(\mathbb R^{n+1}\) the vector of weights of model
\(\lambda\) \(\mathbb R\) the penalization rate or parameter of regularization
\(J_0\) \(\mathbb R^{n+1} \to \mathbb R\) the original cost function without regularization term
\(J\) \(\mathbb R^{n+1} \to \mathbb R\) the generated cost function with regularization term

L1 Norm formula

\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m|\theta_j|\]

L2 Norm formula

\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2\]

Derivative new cost function

we use L2 Norm regularization term.

Normal Formula

\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2)\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m\sum_{j=1}^m{\theta_j}\]

Matrix Formula

\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m} \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta )\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta\]

Regularizing Model

Regularized Linear Regression

\[J(\theta)=\frac1{2m}[(X\theta-y)^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta]\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]\]

Gradient Descent

\[\theta:=\theta-\alpha\frac{\partial}{\partial \theta}J(\theta)\] \[\theta_j:=\theta_j(1-\alpha\frac\lambda m)-\frac\alpha m\sum_{i=1}^m(h(x^{(i)})-y^{(i)})x_j^{(i)},\qquad \forall j\in [0,m] \cap\mathbb N^+\]

you can see this term \(\theta_j(1-\alpha\frac\lambda m)\) in the update rule formula, the formula indicate that:

  1. shrink the \(\theta_j\) firstly.
  2. update the \(\theta_j\) with derivative term.

the term \(1-\alpha\frac\lambda m\) here, we call it as shrink rate

you know parameter \(\lambda\) should be choose manually, **“which \(\lambda\) is proper for the model” ** is the problem.

  1. \(\lambda\) is too big: the learned hypothesis will be underfitting. In fact, it can’t learn nothing.
  2. \(\lambda\) is too small: the regularization term is approximately equal to 0 and there is no difference.

according to the experience by ML Pioneer

\[1-\alpha\frac\lambda m \in [0.5,1)\space is \space great\]

Normal Equation

Equation solvent

let \(\frac{\partial}{\partial \theta}J(\theta)=0\)

\[\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]=0\] \[\Rightarrow\theta=(X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )^{-1} X^Ty\]

Invertibility

If \(\lambda >0\), this matrix \((X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )\) is invertible.

Prove:

for \(\forall \eta \in \mathbb R^{n+1},\eta=\begin{bmatrix}\eta_0,\eta_1,...\eta_n\end{bmatrix}^T\)

\[\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta\] \[=(X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2\geq0\]

when \((X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2=0\),

\[\Rightarrow \begin{cases} (X\eta)^T(X\eta)=0\Rightarrow X\eta=0\\ \eta_k=0,\qquad k\in\{1,2,...,n\} \end{cases}\]

as \(\eta_k=0,k\in\{1,2,...,n\}\)

\[X\eta=\begin{bmatrix} 1 & {x^{(1)}}^T \\ 1 & {x^{(2)}}^T \\ \vdots & \vdots \\1 & {x^{(m)}}^T\end{bmatrix}\begin{bmatrix}\eta_0\\0\\\vdots\\0\end{bmatrix}=\begin{bmatrix}\eta_0\\\eta_0\\\vdots\\\eta_0\end{bmatrix}=0\\\Rightarrow \eta_0=0\]

from what has been discussed above:

\[\forall \eta \in \mathbb R^{n+1}\setminus{\vec0},\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta>0\]

indicate that \((X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )\) is apositive definite matrix so that it is invertible.

Regularized Logistic Regression

pass