Multicollinearity Discussion
Table of Contents
This post is a summary of a very good reading for this topic. View the PDF at the last section.
Linear Algebra Views
Recall that, for multiple linear regression,
Treat the
Consider the determinant as the magnitude of the matrix, then even if
Please not that the
Diagnostic
- The best way to identify is the pair plot unless the features are of many. Then use correlations as proxy is fine.
- One draw back of the pair plot is that, sometimes when the multicollinear relationship involves three or more features, then it will not be so easily spotted. For example, when
. See the details in the paper. - Use
where where is the you get by regressing on all the other covariates. A threshold of 10 is frequently used but not fully justified. - See the eigenvalues of the design matrix
, if there is any eigenvalues close to 0.
Ridge and Lasso
The
- Note
means the number of non-zero coefficients, which is not realistic, as there is no easy way to compute all the different combinations of variables. - Ridge can guarantee a solution, as the
, and the inverse always exist. - Ridge and Lasso need standardization before training
Others
For high-dimensional regression, namely