ML-01 Ridge and Lasso Regression

Table of Contents

Notes

Open the page below in a separate page, click here.

Appendix

Graphical representation of Ridge

Reference

image-20221029144536727

The contour plot shows how the RSS changes with different values of $\beta$, and each countor is a connection of points with the same RSS. Obviously, the best OLS estimate gives the lowest RSS.

On the other hand, think about the circle as the penalty brought by the ridge. Since ridge uses a L2 norm, which is $\lambda\sqrt{\beta_1^2+\beta_2^2}$, which controls the size of the circle.

Now there is tradeoff, the OLS has a tendency to minimize the RSS, while larger $\beta$ will bring higher penalty by ridge. The tradeoff can be solved by mathematics, but we can see the solution must lie on the intersection of OLS countour and the ridge circle, which lowers the RSS as much as impossible for given betas.

If $\lambda$ is too large then, then a tiny circle brings sufficiently large penalty, in which case we will allow higher RSS.

Adding White Noise to the predictors

Suppose we are doing a simple linear regression, and we add a white noise to the predictors $X$, in another words, $$Z_i=X_i+\varepsilon_i$$. Now, $$\tilde{\beta}_1=\frac{\operatorname{Cov}\left(Y_i, Z_i\right)}{\operatorname{Var}\left(Z_i\right)}=\frac{\operatorname{Cov}\left(Y_i, X_i+\varepsilon_i\right)}{\operatorname{Var}\left(X_i+\varepsilon_i\right)}$$. Notice that white noise is independent to $X$ and $Y$, then this becomes $$\frac{\operatorname{Cov}\left(Y_i, X_i\right)}{\operatorname{Var}\left(X_i\right)+\sigma^2}=\frac{\operatorname{Var}\left(X_i\right)}{\operatorname{Var}\left(X_i\right)+\sigma^2} \times \beta_1$$. So we really see a $\sigma^2$ shown in the denominator, and its magnitude will shrink the original beta, which acts like a regularization.

Yiming Zhang
Yiming Zhang
Quantitative Researcher Associate, JP Morgan