Multicollinearity Discussion
Table of Contents
This post is a summary of a very good reading for this topic. View the PDF at the last section.
Linear Algebra Views
Recall that, for multiple linear regression,
$$ \begin{aligned} \widehat{\beta}&=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{Y}\\ \operatorname{Var}[\widehat{\beta}]&=\sigma^{2}\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \end{aligned} $$Treat the $\mathbf{X}^{T} \mathbf{X}$ as similar the covariance matrix, then when there exits perfect linear relationship among features, then this matrix is very close to constant matrix that doesn’t have full rank, which equally means that the matrix is non-invertible or singular. Rank is not full, then the determinant equals to zero, indicating at least one eigenvalue is also 0.
Consider the determinant as the magnitude of the matrix, then even if $\det \left(\mathbf{X}^{T} \mathbf{X} \right)$ is close to zero and convertible, then the det of its inverse will have tremendous large value, which leads to an explosion of the $\hat\beta$ variance.
Please not that the $\mathbf{X}$ is the design matrix with the first column equals to constant 1.
Diagnostic
- The best way to identify is the pair plot unless the features are of many. Then use correlations as proxy is fine.
- One draw back of the pair plot is that, sometimes when the multicollinear relationship involves three or more features, then it will not be so easily spotted. For example, when $X_3=(X_1+X_2)/2$. See the details in the paper.
- Use $V I F_{i}=1 /\left(1-R_{i}^{2}\right)$ where where $R_{i}^{2}$ is the $R^{2}$ you get by regressing $X_{i}$ on all the other covariates. A threshold of 10 is frequently used but not fully justified.
- See the eigenvalues of the design matrix $\mathbf{X}^{T} \mathbf{X}$, if there is any eigenvalues close to 0.
Ridge and Lasso
The $l_p$ norms is defined as
$$ \|b\|_{p}=\left(\sum_{i=1}^{p}\left|b_{i}\right|^{p}\right)^{1 / p} $$- Note $l_0$ means the number of non-zero coefficients, which is not realistic, as there is no easy way to compute all the different combinations of variables.
- Ridge can guarantee a solution, as the $\widehat{\beta}_{\lambda}=\left(\mathbf{X}^{T} \mathbf{X}+\lambda \mathbf{I}\right)^{-1} \mathbf{X}^{T} \mathbf{Y}$, and the inverse always exist.
- Ridge and Lasso need standardization before training
Others
For high-dimensional regression, namely $n<p$, more features than the samples, then predictors are forced to be collinear! No solution in this case, and we have to reduce the dimensions first.