2 Mathematical Foundations of PCA

To fully understand how Principal Component Analysis (PCA) works, it’s essential to have a solid grasp of several mathematical concepts. These include variance, covariance, and the decomposition of matrices through eigenvalues, eigenvectors, and Singular Value Decomposition (SVD).

In this chapter, we’ll introduce and explore these key concepts, building the mathematical foundation required for understanding and implementing PCA.

2.1 Linear Algebra Review

PCA relies heavily on linear algebra. Before diving into the algorithm itself, let’s briefly review some important linear algebra concepts that form the foundation of PCA.

2.1.1 Vectors and Matrices

A vector is a one-dimensional array of numbers, and a matrix is a two-dimensional array of numbers. We denote vectors by bold lowercase letters and matrices by bold uppercase letters.

A vector \(\mathbf{x}\) of size \(p\) is written as:

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_p \end{bmatrix} \]
A matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\), with \(n\) rows (observations) and \(p\) columns (features), is written as:

\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1p} \\ a_{21} & a_{22} & \dots & a_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \dots & a_{np} \end{bmatrix} \]

2.1.2 Matrix Operations

Some important matrix operations used in PCA include:

Matrix Transposition: Flipping a matrix over its diagonal. Given matrix \(\mathbf{A}\), its transpose \(\mathbf{A}^\top\) is defined as:

\[ \mathbf{A}^\top_{ij} = \mathbf{A}_{ji} \]
Matrix Multiplication: For two matrices \(\mathbf{A} \in \mathbb{R}^{n \times p}\) and \(\mathbf{B} \in \mathbb{R}^{p \times m}\), the matrix product \(\mathbf{A} \mathbf{B} \in \mathbb{R}^{n \times m}\) is defined as:

\[ (\mathbf{A} \mathbf{B})_{ij} = \sum_{k=1}^{p} \mathbf{A}_{ik} \mathbf{B}_{kj} \]

2.1.3 Covariance Matrix

The covariance matrix, denoted as \(S^2\), is central to PCA. It captures the pairwise covariances between features (dimensions) of the dataset.

Given a matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\), where \(n\) is the number of observations and \(p\) is the number of features, we first center the matrix by subtracting the mean of each column (feature). Let \(\mathbf{A}_c\) denote the centered data matrix.

The covariance matrix \(S^2 \in \mathbb{R}^{p \times p}\) is then computed as:

\[ S^2 = \frac{1}{n-1} \mathbf{A}_c^\top \mathbf{A}_c \]

Each entry \(S^2_{ij}\) represents the covariance between feature \(i\) and feature \(j\).

Definition 2.1 (Covariance Matrix) A matrix, denoted \(S^2\), that summarizes the relationships (covariances) between pairs of features in the dataset.

Covariance values:

If \(S^2_{ij} > 0\), then features \(i\) and \(j\) tend to increase together.
If \(S^2_{ij} < 0\), then features \(i\) and \(j\) tend to move in opposite directions.
If \(S^2_{ij} = 0\), then features \(i\) and \(j\) are uncorrelated.

2.2 Eigenvalues and Eigenvectors

2.2.1 Eigenvalue Decomposition

Eigenvalue decomposition plays a key role in PCA, allowing us to break down the covariance matrix \(S^2\) into its principal components.

For a square matrix \(S^2 \in \mathbb{R}^{p \times p}\), the eigenvalue equation is defined as:

\[ S^2 \mathbf{v} = \lambda \mathbf{v} \]

where:

\(\mathbf{v}\) is an eigenvector of \(S^2\).
\(\lambda\) is the corresponding eigenvalue.

In PCA, the eigenvectors represent the directions (principal components) along which the data varies the most, and the eigenvalues represent the amount of variance in each direction.

2.2.2 Principal Components

The eigenvectors of the covariance matrix \(S^2\) are called principal components. They represent the directions in which the variance of the data is maximized. The corresponding eigenvalues tell us how much variance is captured by each principal component.

When performing PCA, we:

Compute the covariance matrix \(S^2\) from the centered data matrix \(\mathbf{A}_c\).
Perform eigenvalue decomposition on \(S^2\) to find its eigenvectors and eigenvalues.
Sort the eigenvectors by the magnitude of their eigenvalues, and select the top \(k\) eigenvectors to form the reduced feature space.

Definition 2.2 (Eigenvalues and Eigenvectors) Eigenvalues measure the variance explained by each principal component (eigenvector), while eigenvectors represent the directions of maximum variance in the data.

2.2.3 Example: Eigenvalue Decomposition

Consider the following covariance matrix \(S^2 \in \mathbb{R}^{2 \times 2}\):

\[ S^2 = \begin{bmatrix} 4 & 1 \\ 1 & 3 \end{bmatrix} \]

To perform eigenvalue decomposition, we solve the characteristic equation:

\[ \det(S^2 - \lambda \mathbf{I}) = 0 \]

Expanding this, we get:

\[ \det\left(\begin{bmatrix} 4 - \lambda & 1 \\ 1 & 3 - \lambda \end{bmatrix}\right) = (4 - \lambda)(3 - \lambda) - 1 = 0 \]

Solving this gives the eigenvalues \(\lambda_1 = 5\) and \(\lambda_2 = 2\).

To find the eigenvectors, we solve the system \((S^2 - \lambda \mathbf{I}) \mathbf{v} = 0\) for each eigenvalue. For \(\lambda_1 = 5\), the eigenvector is:

\[ \mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \]

And for \(\lambda_2 = 2\), the eigenvector is:

\[ \mathbf{v}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix} \]

Thus, the matrix \(S^2\) can be decomposed as:

\[ S^2 = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top \]

where \(\mathbf{V}\) is the matrix of eigenvectors and \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues.

2.3 Singular Value Decomposition (SVD)

Another important decomposition in PCA is the Singular Value Decomposition (SVD), which generalizes eigenvalue decomposition to non-square matrices.

For a matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\), the SVD is given by:

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top \]

where:

\(\mathbf{U} \in \mathbb{R}^{n \times n}\) is the matrix of left singular vectors.
\(\mathbf{\Sigma} \in \mathbb{R}^{n \times p}\) is a diagonal matrix containing the singular values, which are the square roots of the eigenvalues of \(\mathbf{A}^\top \mathbf{A}\).
\(\mathbf{V} \in \mathbb{R}^{p \times p}\) is the matrix of right singular vectors (the principal components).

2.3.1 Intuition Behind SVD

The singular values in \(\mathbf{\Sigma}\) represent the amount of variance captured by each of the principal components (right singular vectors).
The left singular vectors in \(\mathbf{U}\) describe how the original observations (rows of \(\mathbf{A}\)) relate to each principal component.
The right singular vectors in \(\mathbf{V}\) give the directions (axes) along which the data varies the most.

The SVD allows us to decompose the data matrix \(\mathbf{A}\) into three parts that capture its underlying structure. When applying PCA, we often use SVD to project the data onto a lower-dimensional space.

Definition 2.3 (Singular Value Decomposition (SVD)) A matrix factorization technique that expresses a matrix \(\mathbf{A} \in \mathbb{R}^{n \times p}\) as the product of three matrices: \(\mathbf{U}\) (left singular vectors), \(\mathbf{\Sigma}\) (singular values), and \(\mathbf{V}^\top\) (right singular vectors, or principal components).

2.3.2 Example: SVD Computation

Consider the following matrix \(\mathbf{A} \in \mathbb{R}^{3 \times 2}\):

\[ \mathbf{A} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \]

To compute the SVD, we decompose \(\mathbf{A}\) as:

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top \]

where:

\(\mathbf{U}\) contains the left singular vectors.
\(\mathbf{\Sigma}\) contains the singular values.
\(\mathbf{V}^\top\) contains the right singular vectors (the principal components).

Performing the SVD for this matrix, we obtain:

\[ \mathbf{U} = \begin{bmatrix} -\frac{1}{\sqrt{2}} & 0 & -\frac{1}{\sqrt{2}} \\ 0 & 1 & 0 \\ -\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \end{bmatrix}, \quad \mathbf{\Sigma} = \begin{bmatrix} \sqrt{2} & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}, \quad \mathbf{V}^\top = \begin{bmatrix} -\frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \\ -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{bmatrix} \]

The singular values in \(\mathbf{\Sigma}\) provide us with information about the variance captured by each principal component. The right singular vectors in \(\mathbf{V}^\top\) are the directions in feature space that capture the most variance.

Thus, SVD offers a powerful way to decompose and analyze the structure of the data matrix, and it is often more numerically stable than eigenvalue decomposition, especially for large or noisy datasets.

In the context of PCA, the right singular vectors in \(\mathbf{V}\) are the principal components, and the singular values in \(\mathbf{\Sigma}\) relate to the amount of variance captured by each component.

2.4 Conclusion

In this chapter, we covered the fundamental mathematical concepts required to understand PCA. We explored vectors, matrices, the covariance matrix (denoted \(S^2\)), eigenvalue decomposition, and SVD. These concepts form the backbone of the PCA algorithm, allowing us to reduce dimensionality by finding directions of maximum variance in the data.

In the next chapter, we will discuss the PCA algorithm itself, walking through each step in detail, from data preprocessing to projecting data onto principal components.