Exercise 11: The beauty of kNN#

In this exercise, you’ll gain practice working with kNN. We’ll use the diamonds dataset, which comes as part of ggplot2. This dataset provides information on the quality and price of 50,000 diamonds

1. Data, Plotting, and Train/Test Sets (2 pts)#


  • Load the the class and tidyverse packages.

  • Assign the diamonds data set to a simpler name. Then, create a new variable price_bin that splits the price variable into a binary variable, where 1 indicates that the diamond costs greater than the mean price, and 0 indicates that the diamond costs less than the mean price. Set price_bin to be a factor. (Hint: use the if_else() function)

  • Select just the carat, depth, table, x, y, and your new price_bin variables

  • Print the first few lines of the data set

  • Print the dimensions of the data set

# INSERT CODE HERE

Plot#

Create a scatterplot of the link between carat and depth, and use the color aesthetics mapping to differentiate between diamonds that cost above versus below the mean price.

# INSERT CODE HERE

Based on the above scatterplot, how do you think kNN will perform using only these two variables to predict diabetes diagnosis? Which variable, carat or depth, gives us the most information about which price class the diamond will belong to?

  • Write response here

Test vs Train#

Before we run KNN on these data, we need to set aside a portion of the observations as our test set. Below, randomly divide the data such that 30% are allotted to the test set and the rest are allotted to the train set. Print the first few lines of each set, and print the dimensions of each set to double check your division of the data.

set.seed(2023)

# INSERT CODE HERE

2: KNN (3 points)#


Now, use the knn() function from the class library to predict price_bin from the carat and depth. Set k = 3.

Hint: Review the format required for the arguments of knn()

set.seed(2023)
# INSERT CODE HERE

Now, output a confusion matrix and calculate the test error to evaluate model performance.

# INSERT CODE HERE

How did your model perform?

  • Write your response here

Let’s try to improve our model by adding all of the other variables in our data set as predictors. Rerun your knn() below, keeping k = 3. Again, output a confusion matrix and error rate for your updated model fit.

set.seed(2023)
# INSERT CODE HERE

Did your model predictions improve?

  • Write your response here

3: for loop (3 points)#


So adding additional predictors didn’t shift our error much. Let’s see if adjusting k has a larger impact on model accuracy.

Using your initial model above with just carat and depth, run a for loop that runs the same model 30 times, for k = 1:30.

Output a data frame that has k and the overall error as columns.

The structure of the output data frame and for loop are provided for you below. Note that your loop will take a minute or two to run because there are so many observations in the dataset. It may be helpful while you are writing and testing your loop to run it on a subset of the data with only a handful of rows.

# this is provided
# setting up empty table to store for loop output
output  <- data.frame(k = seq(1:30),
                     error = rep(NA, 30))
head(output)
for (k in seq(1:30)) {
    knn_fits  <- # your knn function here
    
    #overall error
    conf_df  <- # data frame of test predictions versus actual test
    output$error[k]  <- #calculate error from conf_df and add to your output dataframe
   
}
head(output)

Create a line plot of your output object using ggplot. Add a (non-linear) geom_smooth layer.

# INSERT CODE HERE

Interpret your plot. What would you select as the best value of k? How much does this improve your test error?

  • Write your response here

4: Standardizing predictors (2)#


Because knn is based on distances between points, it is very sensitive to the scale of your variables. Looking at our predictor variables, we can see that carat and depth are orders of magnitude different in terms of scales. Maybe we can improve our fit even more by addressing this!

Below, use the scale() function to standardize your predictors. (Note that you don’t need to standardize price_bin.)

Then, run your model a final time with your standardized predictors (just carat and depth still). Set k to the optimal value you determined in your plot above. Output the confusion matrix and error rate again.

set.seed(2023)
#INSERT CODE HERE

What impact did rescaling the data have on your error rate?

  • Write response here

DUE: 5pm March 20, 2024

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name