PCA Tutorial
Table of Contents
Reference document: PDF
Eigenvalue decomposition
Consider a matrix $\mathbf{X}$ with $m$ features and $n$ observations, which means a dimension of $m\times n$. Suppose the row means are subtracted, then its covariance matrix can be estimated as:
$$ \mathbf{S}_{\mathbf{X}} \equiv \frac{1}{n-1} \mathbf{X X}^{T} $$
Our goal is to manipulate this covariance matrix to $\mathbf{S_Y}$ so that the redundancy among features are reduced. More specifically, we can do a linear transformation of $\mathbf{X}$ by changing the basis:
Goal: Find some orthonormal matrix $\mathbf{P}$ where $\mathbf{Y}=\mathbf{P} \mathbf{X}$ such that $\mathbf{S}_{\mathbf{Y}} \equiv \frac{1}{n-1} \mathbf{Y} \mathbf{Y}^{T}$ is diagonalized. The rows of $\mathbf{P}$ happens to be the eigenvectors of $\mathbf{S}_{\mathbf{X}}$
Through simple derivation, we find the following: (here $\mathbf{P}$ is still pending to find)
$$ \begin{aligned} \mathbf{S}_{\mathbf{Y}} &=\frac{1}{n-1} \mathbf{Y} \mathbf{Y}^{T} \\ &=\frac{1}{n-1}(\mathbf{P} \mathbf{X})(\mathbf{P} \mathbf{X})^{T} \\ &=\frac{1}{n-1} \mathbf{P} \mathbf{X} \mathbf{X}^{T} \mathbf{P}^{T} \\ &=\frac{1}{n-1} \mathbf{P}\left(\mathbf{X X}^{T}\right) \mathbf{P}^{T} \\ \mathbf{S}_{\mathbf{Y}} &=\frac{1}{n-1} \mathbf{P} \mathbf{A} \mathbf{P}^{T} \end{aligned} $$The last step above defined $\mathbf{XX^T}=\mathbf{A}$. And of course, $\mathbf{A}$ is a symmetric matrix, thus it has a eigenvalue decomposition, where $\mathbf{E}$ is the orthogonal matrix of eigenvectors on each column, and $\mathbf{D}$ is a diagonal matrix with eigenvalues on each diagonal entries, so we have $\mathbf{A} = \mathbf{E D E}^{T}$.
Thus, we have found a $\mathbf{P} = \mathbf{E}^T$ that satisfies the goal, making $\mathbf{P A P}^T = \mathbf{D}$. Note that here $\mathbf{E}$ is a orthonormal matrix (from eigenvalue decomposition), so $\mathbf{E}^T = \mathbf{E}^{-1}$.
$$ \begin{aligned} \mathbf{S}_{\mathbf{Y}} &=\frac{1}{n-1} \mathbf{P} \mathbf{A} \mathbf{P}^{T} \\ &=\frac{1}{n-1} \mathbf{P}\left(\mathbf{P}^{T} \mathbf{D} \mathbf{P}\right) \mathbf{P}^{T} \\ &=\frac{1}{n-1}\left(\mathbf{P} \mathbf{P}^{T}\right) \mathbf{D}\left(\mathbf{P} \mathbf{P}^{T}\right) \\ &=\frac{1}{n-1}\left(\mathbf{P} \mathbf{P}^{-1}\right) \mathbf{D}\left(\mathbf{P} \mathbf{P}^{-1}\right) \\ \mathbf{S}_{\mathbf{Y}} &=\frac{1}{n-1} \mathbf{D} \end{aligned} $$Additional Notes
- Each eigenvector is called a principle axis.
- The projections of the data onto the principle axes are called principle components.
- Suppose the eigenvalues are sorted in descending order, the first row vector of $\mathbf{P}$ is the first principle axis, denote as $\mathbf{P}_1$. The first principle component is $\mathbf{P}_1 \mathbf{X}$, which is the projection of $\mathbf{X}$ onto the first principle axis.
- Then the first eigenvalue is the variance of the first principle component, which is $\mathbf{P}_1 \mathbf{X} \mathbf{X}^T \mathbf{P}_1^T = \mathbf{P}_1 \mathbf{A} \mathbf{P}_1^T$.