Principal Component Analysis - A Geometric Approach

Overview

Principal Component Analysis is a neat statistical trick, with a simple bit of linear algebra backing it up.  Imagine you've got a dependent, or response variable, $Y$ and a large number of independent, a.k.a. explanatory, variables $X_1, ... X_p$.  You also have $n$ measurements of each, giving you a matrix

$$
\begin{align}
X = &
\begin{pmatrix}
x_{11} & x_{12} & ... & x_{1n}\\
\vdots & & & \\
x_{p1} & x_{p2} & ... & x_{pn}\\
\end{pmatrix} \\
= &
\begin{pmatrix}
-\mathbb{x_1}- \\
\vdots \\
-\mathbb{x_p}- \\
\end{pmatrix} \\
\end{align}
$$

You want to build a model explaining how $Y$ depends on the $X_i$, perhaps a linear model like

$$
Y = \beta_0 + \sum_{i=1}^p \beta_i X_i
$$
but first you need to check that the $X_i$ are more or less independent of each other, otherwise there's no way of uniquely setting the $\beta_i$ values and the stats package is likely to produce an unreliable output.  In general the $X_i$ are not independent, which presents a problem.  This is the problem that PCA seeks to solve, and it does it by rotating the $X_i$ variables in p-space to produce new variables $X_i'$ which are completely independent of eachother.  In the process it provides a mechanism for reducing the number of explanatory variables, enabling a more parsimonious model.

Variance

From here on let's assume that the $X_i$ have been scaled and centered as it will make everything that follows simpler.  We can think of each observation as being a geometric vector in $p$ dimensional space.  These vectors can be squared and summed to give a variance $\sigma^2$ i.e.
$$
\sigma^2 = \frac{1}{n}\sum_{j=1}^n{\left|
\begin{pmatrix}
x_{1j} \\
\vdots \\
x_{pj}\\
\end{pmatrix}
\right|^2
}
$$
This can be rewritten using the per variable variances to get
$$
\begin{align}
\sigma^2 &= \frac{1}{n}\sum_{i,j}{x_{ij}^2} \\
&= \sigma_1^2 + ... + \sigma_p^2 \\
&= p
\end{align}
$$

Covariance

Since the $X_i$ are centered, the requirement that distinct variables $X_i$ and $X_j$ are independent is the same as saying that the covariance matrix $C = \frac{1}{n}XX^{T}$ is diagonal.  In general this is not the case, but perhaps we can find linear combinations $X_i'$ of the original $X_i$ such that it is!

Rotation

We know that $C$ is symmetric and real, and so it follows from basic linear algebra that it has $p$ linearly independent and orthonormal eigenvectors $\mathbb{e_1}, ...,\mathbb{e_p}$ corresponding to real eigenvalues $\lambda_1, ... \lambda_p$.  I.e. we can find column vectors $\mathbb{e_i}$ and scalars $\lambda_i$ such that
$$
C \mathbb{e_i} = \lambda_i \mathbb{e_i}
$$
from which it follows that whenever $i \ne j$
$$
\mathbb{e_j}^TXX^T \mathbb{e_i} = 0
$$
But the LHS above is just the dot product of $\mathbb{x_j}'$ and $\mathbb{x_i}'$ defined by the rule
$$
\mathbb{x_i}' = \mathbb{e_i}^T X
$$
Hey presto! We've defined a rotation of the $X_i$ which results in what we wanted: independent explanatory variables.  (I'm calling it a rotation because just like with rotations in 3D space, the matrix defined by the $\mathbb{e_i}$ is inverse to its transpose.)  These transformed variables are referred to as Principal Components.

Parsimony

Typically we want to build a model with as few independent variables as possible. If a model with lots of explanatory variables explains most of the variation in the response variable, it could well be just that there's enough degrees of freedom to explain any response!  So it's much better to get rid of a few variables, as long as those remaining can still explain most of the variation.  With the original $X_i$ there's no obvious way to decide which to ditch but, as we shall see, with the transformed $X_i'$ variables the eigenvalues can help us find the subset that is most "important".
 
The original variance $\sigma^2$ is just the average of the squares of $n$ p-vectors.  Since our transformation merely rotates these p-vectors their sizes do not change, and consequently
$$
\sigma'^2 = \sigma^2 = p
$$
It turns out that the $\lambda_i$ are the same as the per-principal component variances since
$$
\begin{align}
\sigma_i'^2 &= \frac{1}{n}\mathbb{x_i'}\mathbb{x_i'}^T \\
&= \frac{1}{n} \mathbb{e_i}^T X X^T\mathbb{e_i} \\
&=\mathbb{e_i}^T C\mathbb{e_i} \\
&=\lambda_i \\
\end{align}
$$
This tells us that the eigenvalues (variances) must all be positive and sum to $p$. Remember, we're looking at exactly the same data, just rotated in p-space.  Thus, if any eigenvalues happen to be near zero, this says that we can ditch the corresponding principal component.  In general we arrange the eigenvalues in decreasing order and name the corresponding principal components PC1, PC2, and so on, and then we ditch the ones at the end that contribute less than 5% of the variance between them.  This is equivalent to projecting the observations onto a hypersurface in such a way as to lose the least amount of information.  Very often just two principal components suffice to cover most of the variance enabling the creation of a simple scatterplot which can help identify modal groups.

Example

The following short R script shows PCA at work for an imagined experiment.  Photons are fired from a torch and the $x,y,z$ distances travelled are measured, along with the time taken $t$.  Obviously $t$ depends wholly on the other components, so let's see if PCA spots that we can reduce these 4 variables to just 3.

# Print real numbers, not scientific notation.
options(scipen=999)

# Produce 1000 x,y,z,t displacement observations for photons emitted by a torch
n <- 1000
x <- rnorm(n, mean=10, sd=1)
y <- rnorm(n, mean=8,  sd=0.5)
z <- rnorm(n, mean=12, sd=1.5)
t <- (x^2 + y^2 + z^2)^(1/2)

# Combine into dataframe
df <- data.frame(x=x, y=y, z=z, t=t)

# Show how t is not independent of x,y,z
library(car)
scatterplotMatrix(df)

# Do a PCA
pca <- prcomp(df, scale=T, center=T)
summary(pca)
print(pca$rotation)

The dependency between the original variables can be seen by the scatterplot matrix the script generates:

The text output shows that the PCA does in fact detect this

> summary(pca)
Importance of components:
                          PC1    PC2    PC3     PC4
Standard deviation     1.3934 1.0327 0.9950 0.04177
Proportion of Variance 0.4854 0.2666 0.2475 0.00044
Cumulative Proportion  0.4854 0.7520 0.9996 1.00000

The above output shows that half the variation is covered by PC1 on it's own (corresponding to an eigenvalue of 2 where $p =4$). This is because R has noticed that if you scale each of $x, y$, and $z$ and add them together you get a first order approximation for $t$ as well.  The final PC adds almost nothing to the variance, as expected, and can be dropped.

Finally, R lets you see exactly how each PC is defined in terms of the original variables, and you can check that these form an orthonormal set.

> print(pca$rotation)
        PC1         PC2         PC3        PC4
x 0.2746490  0.89112242 -0.08053689 -0.3521116
y 0.1857603 -0.02987745  0.97020834 -0.1526310
z 0.6142378 -0.44980563 -0.22700113 -0.6073362
t 0.7160817  0.05179825 -0.02607824  0.6956033

Comments

Popular posts from this blog

How To Make ASCII Diagrams Beautifuller

Why growth is falling in all developed countries (as a long term trend)

Three ways to look at the Bell/GHZ experiment