Review of Linear Regression

Suppose X is n×p with n observations and p features. Then we have two views in the space for the Linear Regression. The first one is from Rp space, where each observation (xi1,,xip,yi) is a point.

image-20211107134206243

The second view is having each column (feature) as vectors in RN space.

image-20211107134222287

The orthogonal projection is done by the matrix H, which linearly transforms Y to Y^ on the hyperplane spanned by X, the features.

H=X(XTX)1XTY^=HYβ^=(XTX)1XTYY^=Xβ^

Statistics and Machine Learning

在传统的统计学中, 我们更关注inference, 需要对beta系数的variance, 显著性, confidence interval等有要求, 这个时候类似collinearity的问题就需要解决. 但是在prediction问题中, 我们则不需要考虑这个问题. 有时候即便model是错误的, 也并不影响其在某个范围有较好的预测性.

Linear Regression in Higher Dimensions

在homework中做过一个simulation. 当Y=4X1+ϵ这个true model保持不变的情况下, 我们增加一些无效的变量Xn, 会导致整个模型的MSE随着无效变量的增多而呈现指数型的增加. 要解决这个问题, 需要引入ridge和lasso这类模型. 出发点是, 牺牲一部分bias来换取variance的降低.

E((Yμ^(X))2)=σx2+Bias(μ^(X))2+Var(μ^(X))

由于OLS已经是Best Linear Unbiased Estimator, 在unbiased的估计量中, 已经给出了minimum variance. 我们为了降低整体的prediction error, 只能牺牲掉unbiased这个属性.

Bias and Variance

bias(μ^(x))=E[μ^(X)X=x]μ(x)

Bias: Error in the expectation of our estimator. Does not depend on the randomness of our particular training data realization, just the flexibility of our function. Making μ^ more flexible usually decreases bias.

var(μ^(x))=E[(μ^(X)E[μ^(X)X=x])2X=x]

Variance: Error from the variance of our estimator around its mean. Does not depend on the true function μ ! Tends to increase with the flexibility of μ^, since a change in data has more effect.

Ridge Regression

To solve the problem of higher dimensions, and reduce the variance of the model by adding the bias, we can introduce a penalty term to the cost function. A cost function could punish more when coefficients are larger, and drives the coefficients much towards zero.

image-20211107141131364

image-20211107141142221

上面两幅图可以看到ridge对coefficients作了shrinkage, 从而降低model的variance.

β^ridge =argminβRpi=1n(yixiTβ)2+λj=1pβj2=argminβRpyXβ22Loss +λβ22Penalty 

特别注意, 只要存在lambda, 那么一定会削弱model在training set上的MSE表现, 只是可能在prediction上表现更好.

Centering and Scaling

一般而言, 我们不希望这个shrinkage effect发生在intercept上. 所以假如我们将intercept这一项从beta向量中剔除, 那么我们的目标function变成了.

β^0,β^ridge =argminβ0R,βRpyβ01Xβ22+λβ22

如果X的columns被center过, 那么β0=y¯. (因为regression line一定会通过x¯,y¯. 如果Y也被center过, 那我们不再需要intercept项. 所以实际当中, 通常需要将X和Y都中心化, 减去mean, 这样也会让推导容易很多.

更重要的是, 在ridge中, 所有的predictors需要scaling. 原因是β22=j=1pβj2 is unfair if the variables are not measured in the same units. Scale the X to have sample variance 1.

image-20211107151913492

Ridge的缺陷也很明显, 上图表示两类coefficients, 一类是true value = 0, 另一类是nonzero. Ridge的penalty大部分作用在了那些nonzero的coefficients上, 当lambda逐渐增大的时候, 我们关注的coefficients收到了很大的影响, 但是这些本该是zeros的coefficients基本停滞了.

Lasso

Lasso adds a l1 penalty (the absolute value of the coefficients) to the cost function.

β^lasso =argminβRpyXβ22+λj=1p|βj|=argminβRpyXβ|22Loss +λβ1Penalty 

l1 penalty的特性导致Lasso可以将一些coefficients压缩到0, 这是ridge办不到的.

image-20211107152421653

image-20211107152459185

我们发现, 那些true values = 0的coefficients确实被压缩到了0, 这是我们愿意看到的结果. 我们将这种一部分coefficients变成零的情况,叫做sparsity。这有助于帮助我们得到以下特性,

  1. interpretability - understanding f^.
  2. measure few things in the future, lower computation costs
  3. Know the underlying data. What features are useful?

和之前的ridge regression一样, 我们也需要做centering和scaling.

Ridge and Lasso Comparison

To understand further why different penalty forms can lead to different behaviors, think about the following simplified problem. If we put X=I an orthogonal matrix. Then and OLS will give us the solutions of β^=Y. The following analysis will give insights for Ridge and Lasso.

β^ridge =argminβi=1n(yiβi)2+λβi2β^lasso =argminβi=1n(yiβi)2+λ|βi|

Ridge has a direct solution by taking the derivatives. The size of beta, compared with OLS betas, are smaller, and can never be zero.

β^i=yi1+λ

For Lasso, βi depends on the minimization of (yiβi)2+λ|βi|.

这个表达式求导之后, 会得到一个分段函数. 当yi[λ2,λ2]之间的时候, beta = 0, 除此以外的其他情况, beta都被压缩或者提升了λ2. 所以, 当OLS的beta(也就是这里的yi)越小的时候, 它更有可能被挤压到0.