Suppose is with n observations and p features. Then we have two views in the space for the Linear Regression. The first one is from space, where each observation is a point.
The second view is having each column (feature) as vectors in space.
The orthogonal projection is done by the matrix , which linearly transforms to on the hyperplane spanned by , the features.
由于OLS已经是Best Linear Unbiased Estimator, 在unbiased的估计量中, 已经给出了minimum variance. 我们为了降低整体的prediction error, 只能牺牲掉unbiased这个属性.
Bias and Variance
Bias: Error in the expectation of our estimator. Does not depend on the randomness of our particular training data realization, just the flexibility of our function. Making more flexible usually decreases bias.
Variance: Error from the variance of our estimator around its mean. Does not depend on the true function ! Tends to increase with the flexibility of , since a change in data has more effect.
Ridge Regression
To solve the problem of higher dimensions, and reduce the variance of the model by adding the bias, we can introduce a penalty term to the cost function. A cost function could punish more when coefficients are larger, and drives the coefficients much towards zero.
measure few things in the future, lower computation costs
Know the underlying data. What features are useful?
和之前的ridge regression一样, 我们也需要做centering和scaling.
Ridge and Lasso Comparison
To understand further why different penalty forms can lead to different behaviors, think about the following simplified problem. If we put an orthogonal matrix. Then and OLS will give us the solutions of . The following analysis will give insights for Ridge and Lasso.
Ridge has a direct solution by taking the derivatives. The size of beta, compared with OLS betas, are smaller, and can never be zero.