Exercise 18: Principal component methods#
This homework assignment is designed to give you an intuition on principal component approaches to understanding high dimensional statistical relationships.
Like earlier homework, you will need to download the unrestricted_trimmed_1_7_2020_10_50_44.csv file from the Homework/hcp_data folder in the class GitHub repository.
This data is a portion of the Human Connectome Project database. It provides measures of cognitive tasks and brain morphology measuresments from 1206 participants. The full description of each variable is provided in the HCP_S1200_DataDictionary_April_20_2018.csv file in the Homework/hcp_data folder in the class GitHub repository.
1. Loading data (1 point)#
We are going to look for low dimensional relationships between brain volume measures and working memory capacity.
First, we will need to load the pls
, tidyverse
, and ggplot2
libraries for this assignment.
# WRITE YOUR CODE HERE
Use read.csv
function to load data from the unrestricted_trimmed_1_7_2020_10_50_44.csv file in the hcp_data folder.
(a) Using the tidyverse tools, make a new dataframe d1
that only includes the Flanker Task performance (Flanker_Unadj
) and all freesurfer volume measures for the right and left hemispheres together. Remove both “na’s” and any columns that consist of only zeros.
Hint: Look up using the ends_with
function to only select variables that end with “_Vol”
Use the head
function to look at the first few rows of each data frame.
# WRITE YOUR CODE HERE
2. Correlational structure (4 points)#
(a) Take a look at the correlation between all of the freesurfer volume measures (“FS_”) using the cor
function. Create a new variable called fs_cor
that is the correlation matrix for only the freesurfer volumes
# WRITE YOUR CODE HERE
(b) Load (and install locally if needed) the reshape2
library in order to use the melt
function on the new fs_cor
object. Use head
to show the new, melted fs_cor
object.
# WRITE YOUR CODE HERE
Plot the correlation as a heatmap using ggplot2
.
Hint: use the scale_fill_gradient2
function to scale the colors between red and blue, capping the values at -1 and +1.
# WRITE YOUR CODE HERE
What patterns do you see in the correlations?
Write your response here *
3. Principal component analysis (3 points)#
Let’s see how many principal components explain at at least 95% of the data.
(a) Create a new object called fs_d.pca
using the princomp
function (do not forget to scale the data).
# WRITE YOUR CODE HERE
(b) Calculate the cumulative variance explained (not unique variance explained as in the tutorial) across the principal components and plot the results using ggplot
.
# WRITE YOUR CODE HERE
(c) Determine exactly how many principal components explain at least 95% of the variance.
Hint: Look up the which
function.
# WRITE YOUR CODE HERE
What does this tell you about the underlying dimensionality of the brain volume measures?
Write your response here *
4. Associating with Flanker task peformance (4 points)#
Now apply PCR to the d1
object you created at the beginning (which includes the Flanker task scores) to find how freesurfer volumes predict Flanker task performance. Set the random seed to “2”. Use cross-validation as the validation type and don’t forget to scale your data. Show the summary
of the model fit.
Hint: If you receive an error applying the “scale=TRUE” flag, then you likely still have columns of all zeros somewhere in your data table.
# WRITE YOUR CODE HERE
(b) Use the validationplot
function to evaluate the bias-variance tradeoff using the cross-validated mean squared-error for each component.
# WRITE YOUR CODE HERE
(c) Extract the MSEP values of the cross validated fit (“CV” not “adj_CV”) from the model object using the MSEP
funtion. Create a new array of these values. Use the drop
function to remove the singleton dimension (the original array is 2x1x53 and we want a 2x53 object). Find the minimum value of the first row (the “CV”).
# WRITE YOUR CODE HERE
What what does this plot tell you about how many components best explain variance in Flanker task performance?
Write your response here *
5. Reflection (2 points)#
Compare the number of components that explain variance in X alone (the brain volumes) to the number of components that explain performance in the Flanker task. What does the difference in these two numbers tell you about how variation in brain volumes relates to task performance?
Write your response here
DUE: 5pm EST, April 15, 2024
IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.
Someone’s Name