Tutorial: Basic PCA approaches#

This lab focuses on principal component analysis and regression as well as partial least squares regression in R.

Goals:#

  • Understand principal component analysis, and use it for visualization

  • Implement principal component regression using pcr() in the pls library

  • Implement partial least squares using plsr() in the pls library

This lab draws from the practice sets at the end of Chapter 6 in James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). “An introduction to statistical learning: with applications in r.”


Principal component analysis: the big picture#

Principal component analysis (PCA) is a handy way to decrease the dimensionality of your data. PCA helps you identify the dimensions along which your data varies the most, so you can capture the important variance while also simplifying.

This can be a bit hard to get a physical intuition for, so if you’d like more discussion about how to think about PCA and dimensionality reduction more generally, I’d recommend checking out this video from StatQuest.

However, before we move on to PCR and PLS, let’s do a quick example of applying PCA in a simple context. In this example, we’ll focus on PCA as a tool for visualization using the iris dataset. This data set reports petal and sepal lengths for three different subspecies of iris flowers.

head(iris)
A data.frame: 6 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

First, let’s convince ourselves that there is some redundant information in this data set. How correlated are the four quantitative variables?

cor(iris[,1:4])
A matrix: 4 × 4 of type dbl
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length 1.0000000-0.1175698 0.8717538 0.8179411
Sepal.Width-0.1175698 1.0000000-0.4284401-0.3661259
Petal.Length 0.8717538-0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411-0.3661259 0.9628654 1.0000000

It looks like Petal.Width and Petal.Length are highly correlated with each other, and they’re also both correlated with Sepal.Length.

Let’s also plot how the three subspecies are distributed with respect to these variables.

require(ggplot2)
ggplot(iris,aes(x=Petal.Length, y = Sepal.Length, col=Species)) + geom_point() + theme_light()
ggplot(iris,aes(x=Petal.Width, y = Sepal.Width, col=Species)) + geom_point() + theme_light()
../_images/51e84acf2c21c8c78a8a521843ab011a4abdf7dd1d71c01a4e785bea60c2232a.png ../_images/ba775e7653010e9be4c90f97515aaecb2847d1aef01f232f7a2e6ae4e46d9331.png

There are other combinations of variables we could plot, of course, but this is enough to show that all of these variables are providing some information about how these species differ.

How can we decrease the dimensionality of this data from 4 variables to 2 without losing too much information? We’ll use the prcomp() function to do this.

iris.pca <- prcomp(iris[,1:4],scale. = TRUE) 
# ^^ "scale." just indicates that all variables should be scaled before PCA is applied, so that units don't matter

The first thing we may want to look at is how much variance is explained by each of the components. The sdev attribute gives the square root of the eigenvalue for each component - we can turn this into percentages by just dividing by the sum of the vector.

ggplot(data.frame(pc=1:4,var.exp=iris.pca$sdev/sum(iris.pca$sdev)),
        aes(x=pc,y=var.exp)) + 
    geom_point() + 
    theme_light()
../_images/532befb6a88f7041856f2b09a8a8bf5a8ba3ef4d447e64727180f1cb7238803f.png

As you can see, together the first and second components explain over 80% of the variance present with all 4 variables.

Let’s also look at how much each of the variables contributes to each PC.

iris.pca$rotation
A matrix: 4 × 4 of type dbl
PC1PC2PC3PC4
Sepal.Length 0.5210659-0.37741762 0.7195664 0.2612863
Sepal.Width-0.2693474-0.92329566-0.2443818-0.1235096
Petal.Length 0.5804131-0.02449161-0.1421264-0.8014492
Petal.Width 0.5648565-0.06694199-0.6342727 0.5235971

PC1 seems to be drawing almost evenly from Sepal.Length, Petal.Length, and Petal.Width - remember that these were quite correlated in the original data. PC1 seems to be picking up on the common variability across those three variables. PC2, on the other hand, is drawing most heavily from Sepal.Width. Let’s also look at the coordinates of each observation in iris in the new coordinate system we’ve generated.

head(iris.pca$x)
A matrix: 6 × 4 of type dbl
PC1PC2PC3PC4
-2.257141-0.4784238 0.12727962 0.024087508
-2.074013 0.6718827 0.23382552 0.102662845
-2.356335 0.3407664-0.04405390 0.028282305
-2.291707 0.5953999-0.09098530-0.065735340
-2.381863-0.6446757-0.01568565-0.035802870
-2.068701-1.4842053-0.02687825 0.006586116

This matrix gives us the coordinates of every observation (i.e., each flower measured) with respect to each principal component. Instead of “How long is the sepal of flower A?” or “How wide is the petal of flower B?”, we’re now asking things like “Where does flower A fall along principal component 1?”

Finally, let’s look at how well we can visualize the species clusters using PC1 and PC2, instead of our original variables.

ggplot(data.frame(PC1=iris.pca$x[,1],PC2=iris.pca$x[,2],Species=iris$Species),
        aes(x=PC1,y=PC2,col=Species)) + 
    geom_point() + 
    theme_light()
../_images/88ad256ec3c226a2a11567bb00f495bf09fa20b6f17a2d1e8c69e924511a6280.png

We can see that the first two PCs do a pretty good job of capturing the variance across the four original variables. Compare this plot to the two original scatter plots. Reason about why it looks the way it does, given how the first two PCs drew from the original variables.


Principal component regression#

Now that we’ve got a firmer grasp on what PCA is doing generally, let’s try out principal component regression using the pcr() function from the pls library. We’ll use the Hitters dataset for this example.

install.packages("pls") # Uncomment if not installed
library(pls) # load for the pcr function
install.packages("ISLR") # Uncomment if not installed
library(ISLR) # For Hitters dataset
library(tidyverse)
hit.dat <- Hitters %>% drop_na() # get rid of missing values in Hitters dataset.

Let’s apply PCR to the Hitters dataset, trying to predict Salary. The scale input just indicates that we want to scale the variables prior to dimensionality reduction, and setting validation to “CV” triggers the use of 10-fold cross validation to evaluate the number of PCs to use.

set.seed(2)
pcr.fit=pcr(Salary~., data=hit.dat ,scale=TRUE, validation ="CV")
summary(pcr.fit)
Data: 	X dimension: 263 19 
	Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV             452    351.9    353.2    355.0    352.8    348.4    343.6
adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       345.5    347.7    349.6     351.4     352.1     353.5     358.2
adjCV    344.7    346.7    348.5     350.1     350.7     352.0     356.5
       14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
CV        349.7     349.4     339.9     341.6     339.2     339.6
adjCV     348.0     347.7     338.2     339.7     337.2     337.6

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
        16 comps  17 comps  18 comps  19 comps
X          99.89     99.97     99.99    100.00
Salary     53.01     53.85     54.61     54.61

The % variance explained section of the output above shows us the cumulative variance explained by each additional PC, reaching 100% after 19 PCs. Since there were 19 variables to start with (not counting Salary, since that’s our dependent variable), this makes sense since unless some variables contribute no information at all, you’ll need the same number of PCs to explain all variance as the number of variables you originally put in. The first 5 PCs explain 84% of the variance in the data.

The VALIDATION section of the output above shows us the cross-validated root mean squared-error for each number of components used (we can just square this to get the normally used mean squared-error). We can also visualize this using the validationplot() function.

validationplot(pcr.fit,val.type="MSEP") # MSEP shows cross-validated mean-squared error as error metric. 
../_images/dba93d92032c6475fd1d4d62b558409b3a4112a92100e7cc2dd92abb1fd6eced.png

The best performance is around 16 PCs, but it looks like that doesn’t do much better than 6 PCs.

Let’s try this again, fitting and cross-validating PCR with a training subset and then testing it on the remaining data.

set.seed(1)
train=sample(1:nrow(hit.dat), nrow(hit.dat)/2) # Identify train observations
test=(-train) # Identify test observations
y.test=hit.dat$Salary[test] # Identify test dependent variable values
x.test=select(hit.dat,-Salary)[test,] # Identify test independent variable values
pcr.fit=pcr(Salary~., data=hit.dat,subset=train,scale=TRUE, validation ="CV") # Run cross-validated PCR fit on training set
validationplot(pcr.fit,val.type="MSEP") # plot CV MSE for num of PCs
../_images/12f98236cbee26375986cf89854dc4939a2e45512c0f2d9cb5c50037bf7c3a78.png

It looks like, for the training set, the best number of PCs is 5. How does that look on the test set?

# calculate prediction error on test set 
pcr.pred=predict(pcr.fit,x.test,ncomp=5)
mean((pcr.pred-y.test)^2)
142811.810376319

Now that we’ve got a good model selected (with M = 7 principal components), let’s fit it to the whole dataset.

pcr.fit = pcr(Salary~., data=hit.dat, scale=TRUE, ncomp=5) # fit PCR to the full data set using our selected number of PCs, M = 5
summary(pcr.fit)
Data: 	X dimension: 263 19 
	Y dimension: 263 1
Fit method: svdpc
Number of components considered: 5
TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps
X         38.31    60.16    70.84    79.03    84.29
Salary    40.63    41.58    42.17    43.22    44.90

Partial least squares#

The function for performing partial least squares is plsr(), also in the pls package. Let’s apply cross-validated PLS to the Hitters training dataset, again trying to predict Salary.

set.seed(1)
pls.fit <- plsr(Salary~.,data = hit.dat, subset=train, scale=TRUE, validation="CV")
summary(pls.fit)
Data: 	X dimension: 131 19 
	Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           428.3    325.5    329.9    328.8    339.0    338.9    340.1
adjCV        428.3    325.0    328.2    327.2    336.6    336.1    336.6
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       339.0    347.1    346.4     343.4     341.5     345.4     356.4
adjCV    336.2    343.4    342.8     340.2     338.3     341.8     351.1
       14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
CV        348.4     349.1     350.0     344.2     344.5     345.0
adjCV     344.2     345.0     345.9     340.4     340.6     341.1

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X         39.13    48.80    60.09    75.07    78.58    81.12    88.21    90.71
Salary    46.36    50.72    52.23    53.03    54.07    54.77    55.05    55.66
        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X         93.17     96.05     97.08     97.61     97.97     98.70     99.12
Salary    55.95     56.12     56.47     56.68     57.37     57.76     58.08
        16 comps  17 comps  18 comps  19 comps
X          99.61     99.70     99.95    100.00
Salary     58.17     58.49     58.56     58.62

Interestingly, the lowest RMSE is when only one partial least squares direction is used. Let’s look at how a model with only one PLS direction performs on the test data.

pls.pred = predict(pls.fit,x.test,ncomp=1)
mean((pls.pred - y.test)^2)
151995.259555806

This is higher than the test error we got using 5 principal components in the PCR example above, but still comparable. Now let’s look at PLS for the full data set, using M = 1.

pls.fit = plsr(Salary~.,data=hit.dat,scale=TRUE,ncomp=1)
summary(pls.fit)
Data: 	X dimension: 263 19 
	Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 1
TRAINING: % variance explained
        1 comps
X         38.08
Salary    43.05

Note that the amount of variance in Salary explained by the one-component PLS fit, 43.05%, is similar to that explained by four principal components, 43.22%, in the PCR example. This is because, while PCR is only trying to maximize variance explained in the predictor variables, PLS is trying to maximize variance explained in both the predictors and the response.

Notebook authored by Patience Stevens and edited by Amy Sentis.