---
title: "Principal Components Analysis"
author: "James H. Steiger"
date: "February 16, 2015"
output: html_document
---
#Introduction
Here are some examples of elementary factor analyses performed by the method of Principal Component Analysis. These examples are taken from the excellent textbook *Exploratory Multivariate Analysis by Example Using R*, by Husson, Le, and Pages.
Recall that Principal Component Analysis (PCA) is a special case of the regression component analysis system discussed in Steiger and Schonemann (1976). In this case, the components are defined specifically to be those linear combinations that are maximum variance, with the restrictions that (a) the vector of linear weights defining each component must be of unit length, (b) the components are orthogonal to each other (i.e., uncorrelated) and (c) each component must have the largest possible variance, given restrictions (a) and (b) above.
To avoid scaling effects, variables are standardized prior to analysis. Otherwise, variables with large variance would dominate the analysis and distort the results.
It turns out that these requirements are precisely satisfied when the vectors of linear weights are the eigenvectors of the correlation matrix $R$ corresponding to the eigenvalues $\lambda_j$ of $R$ in descending order, ranging from highest to lowest.
It is easy to show that, in that situation, the variance of the principal component is equal to the corresponding eigenvalue. Moreover, since the trace of a correlation matrix is equal to the sum of its eigenvalues, and is obviously equal to $p$, the number of variables, and (b) the principal components are uncorrelated, it immediately follows that the proportion of the total variance of the $p$ observed variables explained by the $j$th principal component is simply $\lambda_j/p$.
Recall that the dual goals of any factor analytic investigation are (a) to learn about the structure of the variables, and (b) to learn about the attributes of the objects (often persons) being studied.
Table 1.1 from Husson, et al. shows some examples of the kind of data that can be analyzed by PCA.
Let's examine some PCAs to see how this is accomplished.
# The Orange Juice Data
Here are some data on orange juices. We're going to use an R package called FactoMineR to illustrate some aspects of PCA.
We'll load in the FactoMineR library and the Oranges dataset.
```{r}
library(FactoMineR)
options(width=70)
orange <- read.csv("orange.csv",row.names=1)
```
The first 7 variables are sensory variables to be factor analyzed. The next 8 variables are *supplementary variables* that we will end up plotting in the same space as original 7 variables. The idea behind this dual plotting is that we can learn additional information about the components by studying their relationship with variables from a different category from those factor analyzed.
There are several plots that can be constructed with FactoMineR library functions. Let's take a look. We begin by creating the PCA object, but delaying the production of graphs.
```{r}
res.pca <- PCA(orange,quanti.sup=8:15,quali.sup=16:17,graph=FALSE)
```
Generally, one has to decide how many components to retain. Sometimes it will be only one, and sometimes more than two will be required. A number of ``rules of thumb'' for retaining components have been proposed.
The *Kaiser-Guttman rule* states that components based on eigenvalues greater than 1 should be retained. This is based on the notion that, since the sum of the eigenvalues is $p$, an eigenvalue larger than 1 represents an ``above average'' component.
The *SCREE test* examines the plot of eigenvalues. Scree is a name for the pile of rocks found at the base of a mountain. Cattell advanced the notion that a clear break should occur between meaningful components, and a long run of components with eigenvalues that ``trail off'' and are essentially meaningless. The components occurring up to the break point are retained.
Another rule is to $stop adding components if the total variance explained exceeds some high value*, like 80\%. (The value varies.)
Use of such rules (which often disagree in their recommendations with real data) can be facilitated by a simple device called a ``scree plot,'' a plot of the eigenvalue versus the component number.
The scree plot generated by Husson using a bar plot strikes me as hard to read. Below is a function I pulled out of my advanced factor functions. I compute the correlation matrix of the first 7 variables, then generate a scree plot. The red dotted line corresponds to an eigenvalue of 1, and so the plot can also be used to implement the Kaiser-Guttman rule.
```{r}
Scree.Plot <- function(R,main="Scree Plot",sub=NULL){
roots <- eigen(R)$values
x <- 1:dim(R)[1]
plot(x,roots,type="b",col='blue',ylab="Eigenvalue",
xlab="Component Number",main=main,sub=sub)
abline(h=1,lty=2,col="red")
}
R <- cor(orange[,1:7])
Scree.Plot(R,main ="Scree Plot (Husson orange Data)")
```
In this case, the two rules may differ. The Kaiser-Guttman rule suggests retaining 2 components, the scree test 3, and the 80\% rule 2. We'll go with 2.
FactoMineR generates two primary PCA plots, labeled *Individuals factor map* and *Variables factor map*.
The Variables factor map presents a view of the projection of the observed variables projected into the plane spanned by the first two principal components. This shows us the structural relationship between the variables and the components, and helps us name the components. The projection of a variable vector onto the component axis allows us to directly read the correlation between the variable and the component.
```{r}
plot(res.pca,choix="var",invisible="quanti.sup")
```
In this case, we can see that the first principal component explains about 68\% of the total variation, and the second principal component an additional 19\%. So the first two principal components explain nearly 87\% of the total variance. The proportions corresponding to those percentages are obtained by simply dividing a respective eigenvalue by $p$, the number of variables.
The first component correlates almost perfectly with the variable *Odor Typicality*, while the second component correlates very highly with *Odor Intensity* and *Pulpiness*.
The Individuals factor map is a plot of the Principal Component Scores for individuals on the first two principal components. Superimposed on the plot are mean scores on the components for qualitative (categorical) variables that are included in the PCA command.
We can see that Florida orange juice scores high on component 1, while Other juices score low. And, component 2 seems connected with the Fresh-Ambient distinction.
```{r}
plot(res.pca,choix="ind")
```
#The Decathlon Data
This dataset contains performance data from two decathlon events held in 2004. The Olympics took place in August, the Decastar event took place a month later. Performance data are given for the 10 events, as well as total decathlon points and ranking in the competition.
```{r}
data(decathlon)
Scree.Plot(cor(decathlon[,1:10]),main ="Scree Plot (Husson Decathlon Data)")
```
There are 4 eigenvalues greater than 1, and the components account for 74.7\% of the total variation, with the first two accounting for about 50\%. We can create a table of eigenvalues by creating the PCA object and then printing the table.
```{r}
res.pca <- PCA(decathlon,quanti.sup=11:12,quali.sup=13,graph=FALSE)
round(res.pca$eig,2)
```
Let's look at the Variables plot. I've superimposed Points and Rank variables, which are, of course, negatively correlated. Note, then, that Dimension 1 seems to represent Overall Performance. Visualizing these data is rendered a bit problematic by the fact that small values represent better performance for the running events, while high values represent better performance for field events and the long jump and pole vault.
```{r}
plot(res.pca,choix="var")
```
Let's try reflecting all the running events to get a slightly different picture. We confirm that the percentages of variance accounted for remain the same.
```{r}
decathlon2 <- decathlon
decathlon2$`100m` <- -decathlon2$`100m`
decathlon2$`400m` <- -(decathlon2$`400m`)
decathlon2$`1500m` <- -(decathlon2$`1500m`)
decathlon2$`110m.hurdle` <- -(decathlon2$`110m.hurdle`)
res2.pca <- PCA(decathlon2,quanti.sup=11:12,quali.sup=13,graph=FALSE)
round(res2.pca$eig,2)
```
Let's look at the variables plot again.
```{r}
plot(res2.pca,choix="var")
```
Now we get a much more unified picture. Dimension 1 represents overall athletic performance, while Dimension 2 represents Speed vs. Strength.
Plotting individuals on these dimensions allows us to characterize an athlete's overall ability, plus whether he is more a speed or a strength athlete.
```{r}
plot(res2.pca,choix="ind")
```
We see that apparently Drews is unbalanced toward Speed/Endurance, and therefore overall only an average performer, while Karpov is an outstanding performer who is balanced on the Speed/Strength continuum. Showing the means of the Olympics and Decastar points, we see that performance was better at the Olympics.
Below, we show how to calculate principal component loadings from scratch.
```{r}
V <- eigen(cor(decathlon2[,1:10]))$vectors[,1:2]
D <- eigen(cor(decathlon2[,1:10]))$values[1:2]
D <- diag(D)
D.half <- sqrt(D)
F <- V%*%D.half
rownames(F)<-colnames(decathlon2)[1:10]
F
D
```
Notice that the first column of the factor pattern is reflected. How can you fix that?
```{r}
F[,1] <- -F[,1]
F
```