Regularization
Overfitting
First of all, let us begin from the Schematic diagram
- Overfitting: the learned hypothesis may fit the training set very well, but fail to generalize to test set.
- Underfitting: the learned hypothesis can’t fit even the training set well.
Overcoming the Overfitting
- Reduce number of features
- Manually select which features to keep.
- Model selection algorithms(PCA etc.)
- Regularization
- Keep all the features, but reduce magnitude of parameter \(\theta_j,j\in\mathbb{N^+}\)
Regularization Term
Notation
Symbol | Type | Representation |
---|---|---|
\(m\) | \(\mathbb N^+\) | the number of training samples |
\(\theta\) | \(\mathbb R^{n+1}\) | the vector of weights of model |
\(\lambda\) | \(\mathbb R\) | the penalization rate or parameter of regularization |
\(J_0\) | \(\mathbb R^{n+1} \to \mathbb R\) | the original cost function without regularization term |
\(J\) | \(\mathbb R^{n+1} \to \mathbb R\) | the generated cost function with regularization term |
L1 Norm formula
\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m|\theta_j|\]L2 Norm formula
\[J(\theta)=J_0(\theta)+\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2\]Derivative new cost function
we use L2 Norm regularization term.
Normal Formula
\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m}\sum_{j=1}^m{\theta_j}^2)\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m\sum_{j=1}^m{\theta_j}\]Matrix Formula
\[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac{\partial}{\partial \theta}(\frac\lambda{2m} \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta )\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac{\partial}{\partial \theta}J_0(\theta)+\frac\lambda m \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta\]Regularizing Model
Regularized Linear Regression
\[J(\theta)=\frac1{2m}[(X\theta-y)^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta^T\theta]\] \[\frac{\partial}{\partial \theta}J(\theta)=\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]\]Gradient Descent
\[\theta:=\theta-\alpha\frac{\partial}{\partial \theta}J(\theta)\] \[\theta_j:=\theta_j(1-\alpha\frac\lambda m)-\frac\alpha m\sum_{i=1}^m(h(x^{(i)})-y^{(i)})x_j^{(i)},\qquad \forall j\in [0,m] \cap\mathbb N^+\]you can see this term \(\theta_j(1-\alpha\frac\lambda m)\) in the update rule
formula, the formula indicate that:
- shrink the \(\theta_j\) firstly.
- update the \(\theta_j\) with derivative term.
the term \(1-\alpha\frac\lambda m\) here, we call it as shrink rate
you know parameter \(\lambda\) should be choose manually, **“which \(\lambda\) is proper for the model” ** is the problem.
- \(\lambda\) is too big: the learned hypothesis will be underfitting. In fact, it can’t learn nothing.
- \(\lambda\) is too small: the regularization term is approximately equal to 0 and there is no difference.
according to the experience by ML Pioneer
\[1-\alpha\frac\lambda m \in [0.5,1)\space is \space great\]Normal Equation
Equation solvent
let \(\frac{\partial}{\partial \theta}J(\theta)=0\)
\[\frac1m[X^T(X\theta-y)+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} \theta]=0\] \[\Rightarrow\theta=(X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )^{-1} X^Ty\]Invertibility
If \(\lambda >0\), this matrix \((X^TX+\lambda \begin{bmatrix} 0 \\ & 1\\ & & \ddots\\ & & & 1 \end{bmatrix} )\) is invertible.
Prove
:
for \(\forall \eta \in \mathbb R^{n+1},\eta=\begin{bmatrix}\eta_0,\eta_1,...\eta_n\end{bmatrix}^T\)
\[\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta\] \[=(X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2\geq0\]when \((X\eta)^T(X\eta)+\lambda\sum_{k=1}^n\eta_k^2=0\),
\[\Rightarrow \begin{cases} (X\eta)^T(X\eta)=0\Rightarrow X\eta=0\\ \eta_k=0,\qquad k\in\{1,2,...,n\} \end{cases}\]as \(\eta_k=0,k\in\{1,2,...,n\}\)
\[X\eta=\begin{bmatrix} 1 & {x^{(1)}}^T \\ 1 & {x^{(2)}}^T \\ \vdots & \vdots \\1 & {x^{(m)}}^T\end{bmatrix}\begin{bmatrix}\eta_0\\0\\\vdots\\0\end{bmatrix}=\begin{bmatrix}\eta_0\\\eta_0\\\vdots\\\eta_0\end{bmatrix}=0\\\Rightarrow \eta_0=0\]from what has been discussed above:
\[\forall \eta \in \mathbb R^{n+1}\setminus{\vec0},\eta^T(X^TX+\lambda\begin{bmatrix}0 \\& 1\\& & \ddots\\& & & 1\end{bmatrix})\eta>0\]indicate that \((X^TX+\lambda
\begin{bmatrix}
0 \\
& 1\\
& & \ddots\\
& & & 1
\end{bmatrix}
)\) is apositive definite matrix
so that it is invertible.
Regularized Logistic Regression
pass