Topic 9: Principal Component Analysis¶
Vector Quantization¶
- Recall using k-means for vector quantization:
- Run k-means to find a set of "means" \(w_c\).
- This gives a cluster \(\hat{y}_i\) for each object \(i\).
- Replace features \(x_i\) by mean of cluster: \(\hat{x}_i \approx w_{\hat{y}_i}\).
- This can be viewed as a (really bad) latent-factor model.

Vector Quantization (VQ) as Latent-Factor Model¶
-
When \(d=3\), we could write \(x_i\) exactly as:
\[ x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ x_{i3} \end{bmatrix} = z_{i1} \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + z_{i2} \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} + z_{i3} \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \]- In this "pointless" latent-factor model we have \(z_i = [x_{i1} \; x_{i2} \; x_{i3}]\).
- If \(x_i\) is in cluster 2, VQ approximates \(x_i\) by mean \(w_2\) of cluster 2:
- So in this example we would have \(z_i = [0 \; 1 \; 0 \; 0]\).
- The "parts" are the means from k-means.
- VQ only uses one part (the "part" from the cluster).
Vector Quantization vs. PCA¶
-
Viewing vector quantization as a latent-factor model:
\[ X = \begin{bmatrix} -9.0 & -7.3 \\ -10.9 & -9.0 \\ 13.7 & 19.3 \\ 13.8 & 20.4 \\ 12.8 & 20.6 \\ \vdots & \vdots \end{bmatrix} \rightarrow \text{vector quantization} \rightarrow Z = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} \]- VQ: "Parts" \(w_1, w_2, \dots, w_k\)
- Suppose we’re doing supervised learning, and the colors are the true labels \(y\):
- Classification would be really easy with this "k-means basis" \(Z\).

-
But it only uses 1 part, it’s just memorizing ‘k’ points in \(x_i\) space.
- What we want is combinations of parts.
- PCA is a generalization that allows continuous \(z\):
- It can have more than 1 non-zero.
- It can use fractional weights and negative weights.
PCA Notation¶
- PCA takes in a matrix \(X\) and an input \(k\), and outputs two matrices:
- For row ‘c’ of W, we use the notation \(w_c\).
- Each \(w_c\) is a "part" (also called a "factor" or "principal component").
- For row ‘i’ of Z, we use the notation \(z_i\).
- Each \(z_i\) is a set of "part weights" (or "factor loadings" or "features").
- For column ‘j’ of W, we use the notation \(w^j\).
- Index ‘j’ of all the "k" "parts" (value of pixel ‘j’ in all the different parts).
- PCA takes in a matrix \(X\) and an input \(k\), and outputs two matrices:
-
With this notation, we can write our approximation of one \(x_{ij}\) as:
\[ \hat{x}_{ij} \approx z_{i1} w_{1j} + z_{i2} w_{2j} + \cdots + z_{ik} w_{kj} = \sum_{c=1}^k z_{ic} w_{cj} = (w^j)^\top z_i = \langle w^j, z_i \rangle \quad \text{(NEW NOTATION)} \]- K-means: "take index ‘j’ of closest mean".
- PCA: "\(z_i\) gives weights for index ‘j’ of all means".
- We can write approximation of the vector \(x_i\) as:
Different Views¶
- PCA approximates each \(x_{ij}\) by the inner product \(\langle w^j, z_i \rangle\).
- PCA approximates each \(x_i\) by the matrix-vector product \(W^\top z_i\).
-
PCA approximates matrix \(X\) by the matrix-matrix product \(ZW\):
\[ X \approx ZW \quad \text{where} \quad X \in \mathbb{R}^{n \times d}, \; Z \in \mathbb{R}^{n \times k}, \; W \in \mathbb{R}^{k \times d} \]- PCA is also called a "matrix factorization" model.
- Both \(Z\) and \(W\) are variables.
- This can be viewed as a "change of basis" from \(x_i\) to \(z_i\) values.
- The "basis vectors" are the rows of \(W\), the \(w_c\).
- The "coordinates" in the new basis of each \(x_i\) are the \(z_i\).
Covariance & Covariance Matrix¶
RV, \(X^A, X^B \sim P(x^A, x^B)\) with with \(\mu^A = E[x^A] = 0, \text{ and } \mu^B = E[x^B] = 0\). This symplifies assumption.
The Variance: \(Var(x^A) = E[(x^A-\mu^A)^2] = E[x^{2A}]\)
Let \(X^A\) and \(X^B\) be random variables with a joint distribution \(P(x^A, x^B)\), where:
- \(\mu^A = \mathbb{E}[x^A] = 0\)
- \(\mu^B = \mathbb{E}[x^B] = 0\)
This simplifies our calculations since the mean values are zero.
Covariance¶
-
Definition of Covariance:
\[ \text{Cov}(X^A, X^B) = \mathbb{E}[(X^A - \mu^A)(X^B - \mu^B)] \]Given the assumption that \(\mu^A = \mu^B = 0\), this simplifies to:
\[ \text{Cov}(X^A, X^B) = \mathbb{E}[X^A X^B] \] -
Special Case (Variance):
- The variance of \(X^A\), which is the covariance of \(X^A\) with itself, is given by:
\[ \text{Var}(X^A) = \mathbb{E}[(X^A - \mu^A)^2] = \mathbb{E}[{X^A}^2] \] -
Interpretation of Covariance:
- If \(\text{Cov}(X^A, X^B) > 0\), then \(X^A\) and \(X^B\) are positively correlated (as one increases, the other tends to increase).
- If \(\text{Cov}(X^A, X^B) = 0\), then \(X^A\) and \(X^B\) are uncorrelated (no linear relationship).
- If \(\text{Cov}(X^A, X^B) < 0\), then \(X^A\) and \(X^B\) are negatively correlated (as one increases, the other tends to decrease).
Covariance Matrix¶
- Definition:
- For a random vector \(\vec{X} = [X_1, X_2, \dots, X_d]^T\), the covariance matrix \(\Sigma\) (denoted as \(C\) in some notes) is a \(d \times d\) matrix describing the covariance between each pair of components \(X_i\) and \(X_j\).
-
Assume a Dataset:
- Suppose we have data \(D = \{\vec{x_1}, \vec{x_2}, \dots, \vec{x_n}\} \subseteq \mathbb{R}^d\), where each sample \(\vec{x_i}\) is a \(d\)-dimensional vector.
- The mean vector \(\vec{\mu}\) of the data is:
\[ \vec{\mu} = \mathbb{E}[\vec{X}] = \frac{1}{n} \sum_{i=1}^n \vec{x_i} \] -
Covariance Matrix Formula:
- The covariance matrix \(\Sigma\) of \(\vec{X}\) is given by:
\[ \Sigma = \text{Cov}(\vec{X}) = \mathbb{E}[(\vec{X} - \vec{\mu})(\vec{X} - \vec{\mu})^T] \]- For sample data, this can be approximated as:
\[ \Sigma = \frac{1}{n} \sum_{i=1}^n (\vec{x_i} - \vec{\mu})(\vec{x_i} - \vec{\mu})^T \]- Alternatively, if the data is already centered (i.e., mean \(\vec{\mu}\) has been subtracted), this simplifies to:
\[ \Sigma = \frac{1}{n} \sum_{i=1}^n \vec{x_i} \vec{x_i}^T \] -
Structure of the Covariance Matrix:
- Each element \(\Sigma_{ij}\) in the matrix represents the covariance between the \(i\)-th and \(j\)-th features:
- \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\).
- Diagonal elements \(\Sigma_{ii}\) represent the variance of each feature \(X_i\): \(\Sigma_{ii} = \text{Var}(X_i)\).
- Off-diagonal elements represent the covariances between different features.
- Each element \(\Sigma_{ij}\) in the matrix represents the covariance between the \(i\)-th and \(j\)-th features:
Steps to Construct PCA¶
Note
❗
What ChatGPT said about the steps of PCA:
We will use Singular Value Decomposition (SVD) and Eigenvalue Decomposition, along with the Lagrangian approach for maximizing variance, dimensionality reduction, and reconstruction in PCA.
Step 1: Standardize the Data¶
-
- For a data matrix \(X\) with \(n\) samples and \(p\) features, compute the mean for each feature:
Center the Data:
\[ \mu_j = \frac{1}{n} \sum_{i=1}^n X_{ij} \]where \(X_{ij}\) represents the \(j\)-th feature of the \(i\)-th sample.
- Subtract the mean vector \(\mu\) from each data point to center the data:
\[ X_{centered} = X - \mu \]where \(X_{centered}\) is the matrix of centered data.
-
Optional: Standardize:
- If features have different units or variances, you may standardize each feature by dividing by its standard deviation \(\sigma_j\):
\[ X_{standardized} = \frac{X_{centered}}{\sigma} \]where \(\sigma_j = \sqrt{\frac{1}{n} \sum_{i=1}^n (X_{ij} - \mu_j)^2}\).
Step 2: Compute the Covariance Matrix¶
The covariance matrix \(\Sigma\) captures the linear relationships between the features in the data.
- Calculate the Covariance Matrix:
This is a \(p \times p\) matrix where each element \(\Sigma_{jk}\) represents the covariance between feature \(j\) and feature \(k\).
Step 3: Set Up and Solve the Eigenvalue Problem Using the Lagrangian¶
We now maximize the variance along a particular direction \(w\) (the principal component) under the constraint that \(w\) is a unit vector.
-
- We want to maximize the variance of the data projected onto direction \(w\), which is given by \(w^T \Sigma w\).
Define the Objective:
Note
Deriving \(w^T \Sigma w\): Variance of Projected Data¶
In PCA, our goal is to project data onto a direction $w$ that maximizes the variance of the data along that direction. Let's derive the expression $w^T \Sigma w$ step-by-step, starting from the basics. ### Step 1: Define the Centered Data and Projection 1. Let $X \in \mathbb{R}^{n \times d}$ represent our dataset, where $n$ is the number of data points and $d$ is the number of features. 2. Assume $X$ is **centered**, so each feature has a mean of zero: $\mathbb{E}[X] = 0$. 3. Let $w \in \mathbb{R}^d$ be a unit vector representing the direction we want to project onto. The unit length condition is $w^T w = 1$. ### Step 2: Project the Data onto $w$ 1. The projection of each data point $x_i \in \mathbb{R}^d$ onto $w$ is given by the dot product $x_i^T w$. 2. This projection gives a scalar value for each data point along the direction $w$, effectively reducing the data to one dimension along $w$. ### Step 3: Variance of Projected Data 1. Since $X$ is centered, the variance of the projected data along $w$ can be written as: $$ \text{Var}(X \cdot w) = \mathbb{E}[(X \cdot w)^2] $$ 1. Expanding this, we have: $$ \mathbb{E}[(X \cdot w)^2] = \mathbb{E}[(x_i^T w)^2] $$ 1. Using the expectation notation for matrix calculations, this can be expressed as: $$ \mathbb{E}[(X \cdot w)^2] = w^T \mathbb{E}[X^T X] w $$ ### Step 4: Substituting the Covariance Matrix 1. The covariance matrix $\Sigma$ of $X$ is defined as: $$ \Sigma = \mathbb{E}[X^T X] $$ 1. Thus, we can substitute $\Sigma$ into our expression for the variance: $$ \text{Var}(X \cdot w) = w^T \Sigma w $$ - Add the constraint that $w$ has unit length, i.e., $w^T w = 1$.-
Formulate the Lagrangian:
\[ \mathcal{L}(w, \lambda) = w^T \Sigma w - \lambda (w^T w - 1) \]Here:
- \(w^T \Sigma w\) represents the variance we wish to maximize.
- \(\lambda\) is a Lagrange multiplier enforcing the unit length constraint.
-
Optimize by Differentiating:
- Take the derivative of \(\mathcal{L}\) with respect to \(w\) and set it to zero:
\[ \frac{\partial \mathcal{L}}{\partial w} = 2 \Sigma w - 2 \lambda w = 0 \]Simplifying, we get:
\[ \Sigma w = \lambda w \]- This equation is known as the eigenvalue equation, meaning \(w\) must be an eigenvector of \(\Sigma\), and \(\lambda\) is the corresponding eigenvalue.
-
Interpretation:
- Eigenvectors \(w\) represent directions of maximum variance (principal components).
- Eigenvalues \(\lambda\) represent the variance captured along each principal component direction.
Step 4: Singular Value Decomposition (SVD) - Optional (Story behind PCA)¶
-
- For the centered data matrix \(X\), perform SVD to decompose it as:
SVD Decomposition of \(X\):
\[ X = U S V^T \]where:
- \(U\) contains the left singular vectors, which are equivalent to the principal components.
- \(S\) is a diagonal matrix of singular values, with entries related to the square roots of the eigenvalues of \(\Sigma\).
- \(V\) contains the right singular vectors, which correspond to directions in the original feature space.
-
Connection Between SVD and Eigenvalue Decomposition:
- The squared singular values in \(S\) correspond to the eigenvalues \(\lambda_i\) of the covariance matrix \(\Sigma\).
- The columns of \(U\) are the principal components of \(X\).
Step 5: Project the Data onto the Principal Components¶
-
- Sort the eigenvalues \(\lambda_i\) in descending order, and select the top \(k\) eigenvectors corresponding to the largest eigenvalues.
- These top eigenvectors form the projection matrix \(W\):
Select Top \(k\) Eigenvectors:
\[ W = [v_1 \; v_2 \; \ldots \; v_k] \] -
Project Data into the Principal Component Space:
- Transform the centered data \(X_{centered}\) into the new \(k\)-dimensional space:
\[ X_{reduced} = X_{centered} \cdot W \]- Each row of \(X_{reduced}\) represents the data point in the principal component space.
Step 6: Reconstruction from the Reduced Space¶
After reducing the data dimensionality, we can approximate the original data by reconstructing it from the lower-dimensional representation.
-
- Given the reduced data \(X_{reduced}\), the reconstruction in the original space is given by:
Reconstruction Formula:
\[ X_{reconstructed} = X_{reduced} \cdot W^T + \mu \]- Here, \(X_{reduced} \cdot W^T\) approximates the original centered data, and adding \(\mu\) shifts it back to the original mean.
-
Perfect Reconstruction Condition:
- If we use all \(p\) dimensions (no dimensionality reduction), then \(X_{reconstructed} = X\).
- Otherwise, \(X_{reconstructed}\) is an approximation of \(X\) based on the retained principal components.
Step 7: PCA Decorrelates Dimensions¶
One of PCA's key properties is that it produces uncorrelated (orthogonal) principal components, meaning each principal component captures an independent direction of variance.
-
- After projecting the data onto the principal components, the covariance matrix of the reduced data \(X_{reduced}\) becomes diagonal. This indicates that the principal components are uncorrelated.
- The covariance matrix of \(X_{reduced}\), denoted \(\Sigma_{X_{reduced}}\), is given by:
Decorrelated Dimensions:
\[ \Sigma_{X_{reduced}} = \frac{1}{n} \sum_{i=1}^n X_{reduced, i} X_{reduced, i}^T = \Lambda \]- Here, \(\Lambda\) is a diagonal matrix with the eigenvalues \(\lambda_1, \lambda_2, \ldots, \lambda_k\) of the original covariance matrix \(\Sigma\) along its diagonal:
\[ \Sigma_{X_{reduced}} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_k) \] -
Implication:
- Each principal component represents a unique, uncorrelated direction in the data, capturing independent sources of variance.
- This decorrelation is achieved because the principal components (eigenvectors) are orthogonal to each other by construction.
Step 8: Choosing the Number of Principal Components (Variance Explained)¶
To retain the most important information while reducing dimensionality, we select the smallest number of principal components that capture the majority of the variance.
- Variance Explained by Each Component:
- The eigenvalues \(\lambda_i\) represent the variance along each principal component.
-
- To decide the number of components \(r\), choose the smallest \(r\) such that the cumulative variance explained is above a threshold (e.g., 95%):
Cumulative Variance Criterion:
\[ \frac{\sum_{i=1}^r \lambda_i}{\sum_{i=1}^p \lambda_i} \geq 0.95 \] -
Interpretation:
- By selecting the top \(r\) components, we keep the dimensions that retain the majority of the variance, effectively "denoising" the data by removing minor sources of variance.
Summary of Complete PCA Process¶
- Standardize the Data: Center the data to have zero mean (and optionally scale to unit variance).
- Compute Covariance Matrix: \(\Sigma = \frac{1}{n-1} X_{centered}^T X_{centered}\).
- Solve the Eigenvalue Problem: Use the Lagrangian to derive \(\Sigma w = \lambda w\) and find eigenvalues \(\lambda_i\) and eigenvectors \(w_i\).
- Singular Value Decomposition (SVD): Decompose \(X\) as \(X = U S V^T\) to relate to principal components.
- Project the Data: Use \(X_{reduced} = X_{centered} \cdot W\), where \(W\) contains the top \(k\) eigenvectors.
- Reconstruct the Data: Approximate original data with \(X_{reconstructed} = X_{reduced} \cdot W^T + \mu\).
- Decorrelate Dimensions: : The covariance matrix of \(X_{reduced}\) is diagonal, indicating uncorrelated components.
- Select Components by Variance: Choose the smallest \(r\) for which \(\frac{\sum_{i=1}^r \lambda_i}{\sum_{i=1}^p \lambda_i} \geq 0.95\).
Note
❗
ChatGPT said:
Given a matrix \(A\) of size \(m \times n\), the goal of SVD is to decompose \(A\) into three matrices: \(A = U \Sigma V^T\) where:
- \(U\): \(m \times m\) orthogonal matrix (left singular vectors)
- \(\Sigma\): \(m \times n\) diagonal matrix containing singular values
- \(V^T\): \(n \times n\) orthogonal matrix (right singular vectors)
Steps to Compute SVD:
- Compute \(A^T A\) and \(A A^T\)
- \(A^T A\) is an \(n \times n\) matrix. Solve the eigenvalue problem: \((A^T A)v = \lambda v\) The eigenvectors \(v_i\) form the columns of \(V\), and the eigenvalues \(\lambda_i\) are used to compute the singular values.
- \(A A^T\) is an \(m \times m\) matrix. Solve the eigenvalue problem: \((A A^T)u = \gamma u\) The eigenvectors \(u_i\) form the columns of \(U\).
-
Compute singular values The singular values \(\sigma_i\) are the square roots of the non-zero eigenvalues of either \(A^T A\) or \(A A^T\): \(\sigma_i = \sqrt{\lambda_i} = \sqrt{\gamma_i}\) The singular values are arranged in descending order on the diagonal of \(\Sigma\):
\[ \Sigma = \begin{bmatrix} \sigma_1 & 0 & \dots & 0 \\ 0 & \sigma_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_k \end{bmatrix} \] -
Construct the matrices
- The columns of \(V\) are the eigenvectors of \(A^T A\).
- The columns of \(U\) are the eigenvectors of \(A A^T\).
- The diagonal entries of \(\Sigma\) are the singular values \(\sigma_i\).
Relation to PCA:
PCA and SVD are closely related. Here’s how:
- Center the data: \(X_{\text{centered}} = X - \mu\) where \(\mu\) is the mean of each feature.
- Compute the covariance matrix: \(C = \frac{1}{m-1} X_{\text{centered}}^T X_{\text{centered}}\)
- Perform eigendecomposition on \(C\): \(C v = \lambda v\)
This is equivalent to performing SVD on the centered data matrix: \(X_{\text{centered}} = U \Sigma V^T\)
- The principal components are the columns of \(V\).
- The eigenvalues of \(C\) correspond to the squared singular values from \(\Sigma\).
Note
❗
ChatGPT said:
The Singular Value Decomposition (SVD) plays a crucial role in Principal Component Analysis (PCA) by providing a computational method for performing the analysis. Here's how SVD relates to PCA:
1. Covariance Matrix Computation
PCA focuses on finding the eigenvalues and eigenvectors of the covariance matrix, \(\frac{1}{n-1} \mathbf{X}\mathbf{X}^\top\), where \(\mathbf{X}\) is the data matrix (with rows as observations and columns as features).
2. SVD Decomposition
By applying SVD to the data matrix \(\mathbf{X}\):
where:
- \(\mathbf{U}\): Contains the left singular vectors.
- \(\Sigma\): Diagonal matrix with singular values.
- \(\mathbf{V}\): Contains the right singular vectors.
3. Relation to Covariance Matrix
Using the SVD, the covariance matrix can be expressed as:
Simplifying:
This shows that:
- The eigenvalues of \(\mathbf{X} \mathbf{X}^\top\) are given by the squared singular values \(\Sigma^2\).
- The eigenvectors of \(\mathbf{X} \mathbf{X}^\top\) are the columns of \(\mathbf{U}\), which are the principal components.
4. Dimensionality Reduction
- SVD enables dimensionality reduction by truncating the matrices \(\mathbf{U}, \Sigma, \mathbf{V}\) to retain only the top \(k\) singular values and corresponding singular vectors.
- The truncated SVD approximation of \(\mathbf{X}\) provides the best low-rank approximation, preserving the most variance in the data.
5. Key Insight
The singular values in \(\Sigma\) correspond to the square roots of the eigenvalues of the covariance matrix. Thus, SVD directly reveals the principal components and their importance, bypassing the need to compute the covariance matrix explicitly.
In summary, SVD simplifies and accelerates the PCA computation, making it numerically stable and efficient for large datasets.