Understanding PCA

Pricinpal component analysis (PCA) is often used as a means of exploratory analysis to uncover relatedness in a data set. Generating PCA plots is as straightforward as running prcomp() on a data set in R. Interpreting PCA results, however, is not as trivial as suggesting clustered data points are similar - what does it exactly mean when we say two data points are similar?

I use PCA as a quality-control type of an analysis in gene expression data. An unsupervised clustering type of an analysis is useful in identifying differences within a data set. For example, I would expect data points to cluster by their origin (i.e., gene expression data from kidney cells should cluster together, while separating from data from brain cells). What exactly is being plotted in a PCA plot though? The answer actually is founded on some concepts in linear algebra and statistics. I've been reading a lot of resources on the derivation of PCA so I thought I would share some key points here to make it more digestible.

Some statistics

Firstly we need to define some basics in stats. The sample mean of a variable A can be described by the following equation:

\(\mu_A = \frac{1}{n} (a_1 + \cdots + a_n)\)

for n measurements.

The sample mean describes the centrality of the obtained measurements. The spread of the measurements is described by the sample variance.

\(Var(A) = \frac{1}{n - 1} ((a_1 - \mu_A)^2 + \cdots + (a_n - \mu_A)^2)\)

The key concept in PCA is the covariance between two variables. The covariance of two variables A and B is described by:

\(Cov(A,B) = \frac{1}{n - 1} ((a_1 - \mu_A)(b_1 - \mu_B) + ... + (a_n - \mu_A)(b_n - \mu_B))\)

Some linear algebra

Now we have to start talking about our data. Let's say we have ran a survey in university students of their academic performance. Our sample pool is 20 students and we collect 3 pieces of information from each student: 1) number of hours spent studying, 2) number of hours spent doing homework, and 3) number of extracurricular hours. This way we have 20 samples of 3 dimensional data. For the ith individual, their measurement can be described by a vector:

\[\mathbf{\overrightarrow{x_i}} = \left[\begin{array} {rrr} 82 \\ 77 \\ 41 \end{array}\right] \]

This vector describes a student who (apparently) studies 82 hours, does homework for 77 hours, and has 41 hours allocated to extracurriculars. Therefore if we were to look at our entire data set, we would have an m x n matrix where m = 3 (for each of 3 pieces of data per student) and n = 20 (for 20 students in our pool). This is a matrix with 3 rows and 20 columns.

Before carrying on it's useful to define some definitions from linear algebra. If A is a m x n matrix, assuming A is symmetric (i.e., \(A\) = \(A^T\) where \(A^T\) denotes the transpose) the following relationship holds true:

\(A\overrightarrow{v_i} = \lambda_i\overrightarrow{v_i}\)

where \(\lambda_i\) denotes real numbers and \(v_i\) are orthogonal vectors. \(\lambda_i\) are called eigenvalues and \(v_i\) the corresponding eigenvectors. Now we define a matrix called the covariance matrix. Let's pull 3 vectors from our data set:

\[\mathbf{\overrightarrow{x_1}} = \left[\begin{array} {rrr} a_1 \\ a_2 \\ a_3 \end{array}\right] = \left[\begin{array} {rrr} 82 \\ 77 \\ 41 \end{array}\right] \]

\[\mathbf{\overrightarrow{x_2}} = \left[\begin{array} {rrr} b_1 \\ b_2 \\ b_3 \end{array}\right] = \left[\begin{array} {rrr} 81 \\ 89 \\ 42 \end{array}\right] \]

\[\mathbf{\overrightarrow{x_3}} = \left[\begin{array} {rrr} c_1 \\ c_2 \\ c_3 \end{array}\right] = \left[\begin{array} {rrr} 89 \\ 94 \\ 32 \end{array}\right] \]

The sample mean vector is described as:

\[\mathbf{\overrightarrow{\mu}} = \left[\begin{array} {rrr} \mu_1 \\ \mu_2 \\ \mu_3 \end{array}\right] \]

Then the covariance matrix S is defined as where \(S_{ij}\) - the i, jth element of the matrix - describes the covariance between the ith and the jth variables. If i = j - that is, the diagonal entries of the matrix - the entry describes the variance of the ith variance alone. In our example,

\(S_{11} = \frac{1}{3-1}((a_1 - \mu_1)^2 + (b_1 - \mu_1)^2 + (c_1 - \mu_1)^2)\)

Notice that this equation corresponds to the equation for sample variance. Indeed, the 1,1 entry in S defines variance of the first variable alone - which in our example, is the number of hours spent studying by the student.

How about entries where i does not equal to j?

\(S_{21} = \frac{1}{3-1}((a_1 - \mu_1)(a_2 - \mu_2) + (b_1 - \mu_1)(b_2 - \mu_2) + (c_1 - \mu_1)(c_2 - \mu_2))\)

See that this equation corresponds to the equation for covariance. Indeed, this describes the covariance between the first and second variable in our data - number of hours studying and average grade per student.

An exercise in linear algebra will show that S is a symmetric matrix. Therefore, according to our theorem from before, there exists a set of eigenvalues such that

\(S\overrightarrow{v_i} = \lambda_i\overrightarrow{v_i}\)

The fundamentals of PCA is built on the interpretation that for a given covariance matrix S, you can find the corresponding eigenvalues \(\lambda_1, ..., \lambda_i\) whose sum accounts for the variance of all variables and the corresponding eigenvectors \(\overrightarrow{v_1}, ..., \overrightarrow{v_i}\) which represent our principal components. If \(\lambda_1\) >>> \(\lambda_2\) > \(\lambda_3\) this would indicate that the first variable accounts for a significant portion of the variance in the data set and this would be reflected in the dispersion of data in the first principal component \(\overrightarrow{v_1}\). The sum of all eigenvalues, then, will describe the variance due to the sum of all variables.

\(\frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3}\) ~ variance in data due to \(a_1\)

A reduction in dimensionality is possible if say the first two eigenvalues \(\lambda_1\) and \(\lambda_2\) account for a significant portion of the variance. In this case we can plot the first two principal components wherein our data occupy the x,y plane.

The eigenvectors themselves can tell us information on the behaviour of our data. Let's say for our data set, the calculated eigenvalues and eigenvectors were as follows:

\(\lambda_1 = 2145.42, \lambda_2 = 621.32, \lambda_3 = 7.20\)

The results suggest the first principal component \(\overrightarrow{v_1}\) accounts for \(\frac{2145.42}{2145.42 + 621.32 + 7.20} = 0.77\) - 77% of variance in the data. \(\overrightarrow{v_2}\) meanwhile accounts for 22% and \(\overrightarrow{v_3}\) accounts for just 0.2%. Now let's say our calculated corresponding eigenvectors are:

\[\mathbf{\overrightarrow{v_1}} = \left[\begin{array} {rrr} 0.72 \\ 0.75 \\ 0.23 \end{array}\right] \]

\[\mathbf{\overrightarrow{v_2}} = \left[\begin{array} {rrr} 0.32 \\ 0.45 \\ -0.32 \end{array}\right] \]

\[\mathbf{\overrightarrow{v_3}} = \left[\begin{array} {rrr} 0.32 \\ -0.42 \\ -0.21 \end{array}\right] \]

Out of these, we saw that only \(\overrightarrow{v_1}\) and \(\overrightarrow{v_2}\) were of significant interest. This indicates that in our survey of academic performance, two factors were of importance. How do we interpret these results?

In \(\overrightarrow{v_1}\), all values in the vector have the same sign, with the first two having the larger relative magnitude. This suggest that the first factor in influencing our data is related to all 3 variables with more weight on hours spent studying and hours spent doing homework. This factor is thus more affected by a change in these two variables.

In \(\overrightarrow{v_2}\), our third variable has a negative sign. This indicates that in the second factor influencing our data, our data is negatively related to the hours spent in extracurriculars.

Concluding remarks

Even though PCA plots are ubiquitous in data analysis and scientific literature, understanding the fundamentals behind PCA is a strenuous exercise. Despite the long-winded definitions in statistics and linear algebra, however, I think the pay-off in gaining a deeper appreciation in PCA - or any type of a visual representation of large data - is important. A misinterpretation or a shallow glance at a data visualization often leads to dire consequences - something we can all agree that we'd rather avoid!

For extra resources I'd highly recommend this explanation, which was used in writing of this post: http://www.math.union.edu/~jaureguj/PCA.pdf, as well as the following DataCamp tutorial on PCA: https://www.datacamp.com/community/tutorials/pca-analysis-r and this article: https://towardsdatascience.com/principal-component-analysis-pca-101-using-r-361f4c53a9ff?gi=3da3a1eb4111.