Exercise 11: The beauty of kNN#
In this exercise, you’ll gain practice working with kNN. We’ll use the diamonds dataset, which comes as part of ggplot2
. This dataset provides information on the quality and price of 50,000 diamonds
1. Data, Plotting, and Train/Test Sets (2 pts)#
Load the the
class
andtidyverse
packages.Assign the
diamonds
data set to a simpler name. Then, create a new variableprice_bin
that splits theprice
variable into a binary variable, where 1 indicates that the diamond costs greater than the mean price, and 0 indicates that the diamond costs less than the mean price. Setprice_bin
to be a factor. (Hint: use the if_else() function)Select just the
carat
,depth
,table
,x
,y
, and your newprice_bin
variablesPrint the first few lines of the data set
Print the dimensions of the data set
# INSERT CODE HERE
Plot#
Create a scatterplot of the link between carat
and depth
, and use the color
aesthetics mapping to differentiate between diamonds that cost above versus below the mean price.
# INSERT CODE HERE
Based on the above scatterplot, how do you think kNN will perform using only these two variables to predict diabetes diagnosis? Which variable, carat or depth, gives us the most information about which price class the diamond will belong to?
Write response here
Test vs Train#
Before we run KNN on these data, we need to set aside a portion of the observations as our test set. Below, randomly divide the data such that 30% are allotted to the test
set and the rest are allotted to the train
set. Print the first few lines of each set, and print the dimensions of each set to double check your division of the data.
set.seed(2023)
# INSERT CODE HERE
2: KNN (3 points)#
Now, use the knn()
function from the class
library to predict price_bin
from the carat
and depth
. Set k = 3
.
Hint: Review the format required for the arguments of knn()
set.seed(2023)
# INSERT CODE HERE
Now, output a confusion matrix and calculate the test error to evaluate model performance.
# INSERT CODE HERE
How did your model perform?
Write your response here
Let’s try to improve our model by adding all of the other variables in our data set as predictors. Rerun your knn()
below, keeping k = 3
. Again, output a confusion matrix and error rate for your updated model fit.
set.seed(2023)
# INSERT CODE HERE
Did your model predictions improve?
Write your response here
3: for loop (3 points)#
So adding additional predictors didn’t shift our error much. Let’s see if adjusting k
has a larger impact on model accuracy.
Using your initial model above with just carat
and depth
, run a for loop
that runs the same model 30 times, for k = 1:30
.
Output a data frame that has k
and the overall error
as columns.
The structure of the output data frame and for loop
are provided for you below. Note that your loop will take a minute or two to run because there are so many observations in the dataset. It may be helpful while you are writing and testing your loop to run it on a subset of the data with only a handful of rows.
# this is provided
# setting up empty table to store for loop output
output <- data.frame(k = seq(1:30),
error = rep(NA, 30))
head(output)
for (k in seq(1:30)) {
knn_fits <- # your knn function here
#overall error
conf_df <- # data frame of test predictions versus actual test
output$error[k] <- #calculate error from conf_df and add to your output dataframe
}
head(output)
Create a line plot of your output
object using ggplot
. Add a (non-linear) geom_smooth
layer.
# INSERT CODE HERE
Interpret your plot. What would you select as the best value of k
? How much does this improve your test error?
Write your response here
4: Standardizing predictors (2)#
Because knn is based on distances between points, it is very sensitive to the scale of your variables. Looking at our predictor variables, we can see that carat
and depth
are orders of magnitude different in terms of scales. Maybe we can improve our fit even more by addressing this!
Below, use the scale()
function to standardize your predictors. (Note that you don’t need to standardize price_bin
.)
Then, run your model a final time with your standardized predictors (just carat
and depth
still). Set k
to the optimal value you determined in your plot above. Output the confusion matrix and error rate again.
set.seed(2023)
#INSERT CODE HERE
What impact did rescaling the data have on your error rate?
Write response here
DUE: 5pm March 20, 2024
IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.
Someone’s Name