5  Properties of PCA

Published

September 22, 2024

In this chapter, we will explore the key properties of Principal Component Analysis (PCA). These properties are essential to understanding how PCA works, how much information is retained after dimensionality reduction, and how the original data can be reconstructed from the principal components.

5.1 Explained Variance

One of the core properties of PCA is its ability to maximize variance in the data. The principal components are ordered such that the first principal component captures the largest possible variance, and each succeeding component captures the maximum remaining variance under the constraint that it is orthogonal to the previous components.

5.1.1 Definition of Explained Variance

The explained variance of the \(i\)-th principal component is the amount of variance in the data that is captured by that component. The total variance in the data is the sum of all the variances along the principal components (the eigenvalues).

Let \(\lambda_i\) represent the eigenvalue corresponding to the \(i\)-th principal component. The explained variance of the \(i\)-th principal component is:

\[ \text{Explained Variance}_i = \lambda_i \]

The total variance in the data is the sum of all eigenvalues:

\[ \text{Total Variance} = \sum_{i=1}^{p} \lambda_i \]

The explained variance ratio for the \(i\)-th principal component is the proportion of the total variance that is captured by that component:

\[ \text{Explained Variance Ratio}_i = \frac{\lambda_i}{\sum_{i=1}^{p} \lambda_i} \]

5.1.2 Scree Plot

A Scree Plot is a visual tool used to show the eigenvalues (or explained variance ratios) in descending order. It helps in determining the number of principal components to retain.

  • The x-axis of the Scree Plot shows the number of principal components.
  • The y-axis shows the corresponding eigenvalues or explained variance ratios.

A “knee” in the plot, where the eigenvalues start to level off, indicates that most of the important variance is captured by the components to the left of the knee.

Example 5.1 (Example: Scree Plot) Below is an example Python code for generating a Scree Plot using the Iris dataset and PCA:

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X = iris.data

# Apply PCA
pca = PCA()
pca.fit(X)

# Plot Scree Plot
plt.figure(figsize=(8,6))
plt.plot(range(1, len(pca.explained_variance_)+1), pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)
plt.show()

In this example, the Scree Plot helps you decide how many components to retain by looking for the “elbow” or “knee” in the curve.

5.2 Dimensionality Reduction

PCA is commonly used for dimensionality reduction, where the goal is to reduce the number of features (dimensions) in the dataset while retaining as much information (variance) as possible.

5.2.1 Selecting the Number of Components

To reduce dimensionality, we select the top \(k\) principal components that explain the most variance. The number of components \(k\) is typically chosen based on one of the following criteria:

  1. Explained Variance Threshold: Retain enough components to capture a certain proportion of the total variance, e.g., 90% or 95%.

    \[ \sum_{i=1}^{k} \lambda_i \geq \text{threshold} \times \sum_{i=1}^{p} \lambda_i \]

  2. Scree Plot: Use a Scree Plot to find the “elbow” where the explained variance starts to decrease more slowly. Retain components up to the point where the curve levels off.

  3. Cross-Validation: Use cross-validation to determine how many components yield the best performance in a machine learning model.

5.2.2 Benefits of Dimensionality Reduction

Dimensionality reduction has several benefits:

  • Simplified Models: By reducing the number of features, we can simplify models and reduce the risk of overfitting.
  • Improved Efficiency: Fewer dimensions lead to faster computations, especially in high-dimensional datasets.
  • Better Visualization: Reducing the data to 2 or 3 dimensions allows for easier visualization of patterns and clusters in the data.

5.3 Orthogonality of Principal Components

An important property of PCA is that the principal components are orthogonal to each other. This means that the directions of the components (the eigenvectors) are mutually perpendicular in the feature space.

5.3.1 Mathematical Explanation

Let \(\mathbf{V}_k \in \mathbb{R}^{p \times k}\) be the matrix of the top \(k\) eigenvectors (principal components). The orthogonality property ensures that:

\[ \mathbf{V}_i^\top \mathbf{V}_j = 0 \quad \text{for} \quad i \neq j \]

This means that each principal component captures a distinct, independent source of variance in the data, and there is no overlap in the information represented by the different components.

5.3.2 Benefits of Orthogonality

The orthogonality of the principal components has several important implications:

  • Uncorrelated Features: The principal components are uncorrelated, which can be useful in regression and classification models where multicollinearity is a concern.
  • Distinct Sources of Variance: Each component represents a unique direction of variance, helping to separate different patterns in the data.

5.4 Reconstruction of Data

Another key property of PCA is that we can use the principal components to reconstruct the original data. However, if we reduce the dimensionality by keeping only the top \(k\) principal components, the reconstructed data will be an approximation of the original.

5.4.1 Data Reconstruction Formula

Let \(\mathbf{X} \in \mathbb{R}^{n \times p}\) be the original centered data matrix. The reduced data matrix \(\mathbf{X}_{\text{proj}} \in \mathbb{R}^{n \times k}\) is obtained by projecting \(\mathbf{X}\) onto the top \(k\) principal components:

\[ \mathbf{X}_{\text{proj}} = \mathbf{X} \mathbf{V}_k \]

The original data can be approximately reconstructed from the reduced data as:

\[ \mathbf{X}_{\text{reconstructed}} = \mathbf{X}_{\text{proj}} \mathbf{V}_k^\top = \mathbf{X} \mathbf{V}_k \mathbf{V}_k^\top \]

Here, \(\mathbf{X}_{\text{reconstructed}}\) is an approximation of \(\mathbf{X}\). The more principal components we retain, the closer the approximation is to the original data.

5.4.2 Loss of Information

When reducing the number of principal components, some information (variance) is inevitably lost. The amount of information lost is proportional to the variance explained by the discarded components:

\[ \text{Information Loss} = 1 - \sum_{i=1}^{k} \frac{\lambda_i}{\sum_{i=1}^{p} \lambda_i} \]

This formula shows the proportion of the total variance that is lost by reducing the dimensionality from \(p\) to \(k\) components.

Example 5.2 (Example: Data Reconstruction in Python) The following Python code demonstrates how to reconstruct the data from the top 2 principal components using the Iris dataset:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X = iris.data

# Apply PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Reconstruct the data from the 2 principal components
X_reconstructed = pca.inverse_transform(X_pca)

# Print original and reconstructed data for comparison
print("Original Data (first 5 rows):")
print(X[:5])

print("\nReconstructed Data (first 5 rows):")
print(np.round(X_reconstructed[:5], 2))
Original Data (first 5 rows):
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Reconstructed Data (first 5 rows):
[[5.08 3.52 1.4  0.21]
 [4.75 3.16 1.46 0.24]
 [4.7  3.2  1.31 0.18]
 [4.64 3.06 1.46 0.24]
 [5.07 3.53 1.36 0.2 ]]

In this example, the data is projected onto the top 2 principal components and then reconstructed from these components. You can compare the original data and the reconstructed data to see how much information has been retained.

5.5 Conclusion

In this chapter, we covered several important properties of PCA:

  • Explained variance: The proportion of total variance captured by each principal component.
  • Dimensionality reduction: Reducing the number of features while retaining as much information as possible.
  • Orthogonality of principal components: Ensuring that the components are mutually uncorrelated.
  • Reconstruction of data: Approximating the original data from

the principal components.

Understanding these properties is essential for effectively applying PCA to your data, interpreting the results, and determining how many principal components to retain.