Exercise 12: Cross validation#


In this exercise, we’ll practice implementing cross validation techniques, including leave-one-out and k-fold cross validation. We’ll use the PimaIndiansDiabetes2 practice dataset, which has medical data on a group of Pima Native American women, including whether or not they have diabetes. This dataset is part of the mlbench package. We’ll be using each person’s medical history to predict whether or not they have been diagnosed with diabetes.

1: Data (1 pts)#


Load the tidyverse, boot, and mlbench packages (you may need to install boot and mlbench).

Load the PimaIndiansDiabetes2 dataset using the data() function. Drop the insulin column (it just has a lot of missing data) and then drop NAs from the rest of the dataset. Save your updated dataset to a new variable name. Finally, print the dimensions of your new dataset, and look at the first few lines of data.

# INSERT CODE HERE

(Note that in medical contexts, pedigree refers to a system of measuring family history of a condition. So here, higher numbers mean greater family history of diabetes. You can read more about this dataset here.)

2. Leave-one-out Cross Validation (4 pts)#

In the tutorial, we learned how to fit leave-one-out cross validation using the cv.glm function from the boot package. But we can also do this manually using predict() like we have in the past.

Let’s predict diabetes, a dichotomous outcome, using all the other variables in our modified dataset.

First, fit a logistic regression model using all of the observations except the very first one. Then use your fitted model to predict whether your holdout case is positive or negative for diabetes. Remember that logistic regression coefficients are in log-odds, meaning that if an output is positive, the probability of the outcome is greater than 50%; if the output is negative, the probability of the outcome is less than 50%.

Compare your result to the actual response in row one above. Did your model correctly classify this observation?

# INSERT CODE HERE

So we just calculated a single iteration of LOOCV. We used 531 rows of our data to fit a model to predict the outcome of the last row.

Below, use a for loop to iterate through the rest of your dataset doing the same thing. You will need to:

  • Create a data frame results with two columns: one named actual which holds the true classification for each observation, and one named predicted, which should be filled with NAs. This is where you’ll store the output of your loop.

  • Create a loop that runs through each row of your data, pulls that observation out, trains your model on the remaining data, and then tests the fitted model on your test observation.

  • Store your model predictions (“pos” or “neg” – not the log-odds) in the predicted column of your results dataframe

After you run your loop, print the first few lines of results.

# Initialize `results` data frame
# INSERT CODE HERE

#for loop
for (i in 1:nrow(dat)){ #don't forget to change this to your data set name
    # separate individual observation `i` from the rest of your data
    # INSERT CODE HERE
    
    # train your model
    # INSERT CODE HERE
    
    # test model on hold out observation
    # INSERT CODE HERE
    
    # classify model prediction as "pos" or "neg" and add to `results`
    # INSERT CODE HERE
    
}

Now, calculate the overall error of your model. What proportion of cases were incorrectly classified?

# INSERT CODE HERE

3. Compare to cv.glm (3 pts)#

Now, let’s compare this result to the cv.glm function. Using the tutorial as a guide, use cv.glm to run LOOCV on the data, using the same model (i.e., still using all of the variables to predict diabetes diagnosis).

Note that, because this is a classification problem and not a regression problem like in the tutorial, we need to adjust the cost argument of cv.glm. We can read more about this in the docs:

#?cv.glm

Here, we see cost is defined as:

“A function of two vector arguments specifying the cost function for the cross-validation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model.”

In the example code (scroll to bottom of the docs), we see that the appropriate cost function for a binary classification is

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

Where r is the vector of observed responses (technically “pos” and “neg”, but R treats these as 1 and 0 under the hood), and pi is the vector of probabilities (not log-odds) fit by the model. Thus, this boils down to our error: what proportion of observations were incorrectly classified. You will need to include this code below.

# INSERT CODE HERE

How do your results compare to your manual LOOCV above?

  • Write response here

4. Adjusting K and Reflection (2 pts)#

Recall that LOOCV has some drawbacks. In particular, it has quite high variance which can lead to poor performance on new test data. We can reduce this variance by increasing K.

Below, re-run your cross validation using cv.glm with k set to 3, 5, 10, and 15.

set.seed(1)
#INSERT CODE BELOW

# K = 3


# K = 5


# K = 10


# K = 15

Reflection#

How do your errors compare to your LOOCV error above? How do they change as k increases?

  • Write response here

If you change the random seed above, you’ll get slightly different errors. If you were to do the same with your LOOCV above , would you expect to get different results each time? Why or why not?

  • Write response here

DUE: 5pm March 25, 2024

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name