02 Linear Regression Ridge Lasso

Review of Linear Regression

$X$ $n\times p$ $\R^p$ $\left(x_{i 1}, \ldots, x_{i p}, y_{i}\right)$ is a point.

$\R^N$ space.

$H$ $Y$ $\hat{Y}$ $X$ , the features.

\begin{matrix} H = X {(X^{T} X)}^{- 1} X^{T} \\ \hat{Y} = H Y \\ \hat{β} = {(X^{T} X)}^{- 1} X^{T} Y \\ \hat{Y} = X \hat{β} \end{matrix}

Statistics and Machine Learning

$f(x)$ .
$\hat{f}(x)$
$Y$ $\hat{f}(X)$ .

在传统的统计学中, 我们更关注inference, 需要对beta系数的variance, 显著性, confidence interval等有要求, 这个时候类似collinearity的问题就需要解决. 但是在prediction问题中, 我们则不需要考虑这个问题. 有时候即便model是错误的, 也并不影响其在某个范围有较好的预测性.

Linear Regression in Higher Dimensions

$Y=4X_1+\epsilon$ $X_n$ , 会导致整个模型的MSE随着无效变量的增多而呈现指数型的增加. 要解决这个问题, 需要引入ridge和lasso这类模型. 出发点是, 牺牲一部分bias来换取variance的降低.

E ((Y - \hat{μ} (X))^{2}) = σ_{x}^{2} + Bias (\hat{μ} (X))^{2} + Var (\hat{μ} (X))

由于OLS已经是Best Linear Unbiased Estimator, 在unbiased的估计量中, 已经给出了minimum variance. 我们为了降低整体的prediction error, 只能牺牲掉unbiased这个属性.

Bias and Variance
$bias (\hat{μ} (x)) = E [\hat{μ} (X) ∣ X = x] - μ (x)$
$\hat{\mu}$ more flexible usually decreases bias.
$var (\hat{μ} (x)) = E [(\hat{μ} (X) - E [\hat{μ} (X) ∣ X = x])^{2} ∣ X = x]$
$\mu$ $\hat{\mu}$ , since a change in data has more effect.

Ridge Regression

To solve the problem of higher dimensions, and reduce the variance of the model by adding the bias, we can introduce a penalty term to the cost function. A cost function could punish more when coefficients are larger, and drives the coefficients much towards zero.

上面两幅图可以看到ridge对coefficients作了shrinkage, 从而降低model的variance.

\begin{aligned} {\hat{β}}^{ridge} & = \underset{β \in R^{p}}{argmin} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β)}^{2} + λ \sum_{j = 1}^{p} β_{j}^{2} \\ = \underset{β \in R^{p}}{argmin} \underset{Loss}{\underset{⏟}{∥ y - X β ∥_{2}^{2}}} + λ \underset{Penalty}{\underset{⏟}{∥ β ∥_{2}^{2}}} \end{aligned}

当lambda = 0的时候, 等价于OLS
当lambda = infinity的时候, 所有的系数都趋近于0.

特别注意, 只要存在lambda, 那么一定会削弱model在training set上的MSE表现, 只是可能在prediction上表现更好.

Centering and Scaling

Centering: to avoid penalty on the intercept term.
Scaling: to avoid imbalanced penalty due to the magnitude of the coefficients.

一般而言, 我们不希望这个shrinkage effect发生在intercept上. 所以假如我们将intercept这一项从beta向量中剔除, 那么我们的目标function变成了.

{\hat{β}}_{0}, {\hat{β}}^{ridge} = \underset{β_{0} \in R, β \in R^{p}}{argmin} {‖ y - β_{0} 1 - X β ‖}_{2}^{2} + λ ∥ β ∥_{2}^{2}

$X$ $\beta_0=\bar y$ $\bar x, \bar y$ $Y$ 也被center过, 那我们不再需要intercept项. 所以实际当中, 通常需要将X和Y都中心化, 减去mean, 这样也会让推导容易很多.

$\|\beta\|_{2}^{2}=\sum_{j=1}^{p} \beta_{j}^{2}$ is unfair if the variables are not measured in the same units. Scale the X to have sample variance 1.

Ridge的缺陷也很明显, 上图表示两类coefficients, 一类是true value = 0, 另一类是nonzero. Ridge的penalty大部分作用在了那些nonzero的coefficients上, 当lambda逐渐增大的时候, 我们关注的coefficients收到了很大的影响, 但是这些本该是zeros的coefficients基本停滞了.

Lasso

$l_1$ penalty (the absolute value of the coefficients) to the cost function.

\begin{aligned} {\hat{β}}^{lasso} & = \underset{β \in R^{p}}{argmin} ∥ y - X β ∥_{2}^{2} + λ \sum_{j = 1}^{p} | β_{j} | \\ = \underset{β \in R^{p}}{argmin} \underset{Loss}{\underset{⏟}{∥ y - {X β |}_{2}^{2}}} + λ \underset{Penalty}{\underset{⏟}{∥ β ∥_{1}}} \end{aligned}

$l_1$ penalty的特性导致Lasso可以将一些coefficients压缩到0, 这是ridge办不到的.

我们发现, 那些true values = 0的coefficients确实被压缩到了0, 这是我们愿意看到的结果. 我们将这种一部分coefficients变成零的情况，叫做sparsity。这有助于帮助我们得到以下特性,

$\hat f$ .
measure few things in the future, lower computation costs
Know the underlying data. What features are useful?

和之前的ridge regression一样, 我们也需要做centering和scaling.

Ridge and Lasso Comparison

$X = I$ $\hat{\beta}=Y$ . The following analysis will give insights for Ridge and Lasso.

\begin{aligned} {\hat{β}}_{ridge} = \underset{β}{argmin} \sum_{i = 1}^{n} {(y_{i} - β_{i})}^{2} + λ β_{i}^{2} \\ {\hat{β}}_{lasso} = \underset{β}{argmin} \sum_{i = 1}^{n} {(y_{i} - β_{i})}^{2} + λ | β_{i} | \end{aligned}

Ridge has a direct solution by taking the derivatives. The size of beta, compared with OLS betas, are smaller, and can never be zero.

{\hat{β}}_{i} = \frac{y_{i}}{1 + λ}

$\beta_i$ $\left(y_{i}-\beta_{i}\right)^{2}+\lambda\left|\beta_{i}\right|$ .

$y_i\in[-\frac{\lambda}{2},\frac{\lambda}{2}]$ $\frac{\lambda}{2}$ $y_i$ )越小的时候, 它更有可能被挤压到0.