Exercise 9: Classification

Exercise 9: Classification#

This homework assignment is designed to give you practice with classification models. We’ll try to predict which words are more likely to be responded to correctly during a lexical decision task, based on their length and frequency.

We will be using data from the English Lexicon Project again. However, this time we will use response correctness as our dependent variable. Load LexicalData_withIncorrect.csv, which includes incorrect trials as well as correct ones, and also Items.csv. Both can be found in the Homework/lexDat folder in the class GitHub repository.

This data is a subset of the English Lexicon Project database. It provides response correctness and reaction times (in milliseconds) of many subjects as they are presented with letter strings and asked to decide, as quickly and as accurately as possible, whether the letter string is a word or not. The Items.csv provides characteristics of the words used, namely frequency (how common is this word?) and length (how many letters?).

Data courtesy of Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.

1. Loading and formatting the data (1 point)#

Load in data from the LexicalData_withIncorrect.csv and Items.csv files. Use left_join to add word characteristics Length and Log_Freq_Hal from Items to the LexicalData, and use drop_na() to get rid of any observations with missing values. Then use head() to look at the first few rows of the data.

Note: We’re just working with Correct in this homework, so no need to worry about reformatting reaction times.

# WRITE YOUR CODE HERE

2. Visualizing the data (1 point)#

First, we’ll try to visualize whether trials that are responded to correctly versus incorrectly differ from each other in terms of word length and log frequency. The code is included below, so that this homework doesn’t get too cumbersome. All you have to do is change the name of the data set, run the code, and write some observations about the output.

vrequire(tidyverse) # Load the tidyverse package, if you haven't yet
fdata$Correct <- as.factor(fdata$Correct) # so that R knows that Correct is categorical, not numeric. 

# plot the Correct / Incorrect clusters
ggplot(fdata,aes(x=round(Log_Freq_HAL,1),y=Length,col=Correct)) + geom_point(position="jitter",alpha=0.5) + theme_light() 

What do you observe about the “Correct” and “Incorrect” clusters?

Write your reponse here

3. Logistic Regression: Fitting the model (2 points)#

Fit a logistic regression model to the data using Length, Log_Freq_HAL, and their interaction to predict Correct. Use glm() to fit the model, and look at its output using summary().

# WRITE YOUR CODE HERE

What can you conclude from this output? (a brief gist is fine)

Write your response here

4. Interpreting predictions from the model (3 points)#

Finally, look at how well this logistic regression model does at predicting correctness. Use predict() and a threshold of 0.5 to generate predicted Correct values for each trial, then output a confusion matrix and overall accuracy for these predictions.

Hint: see the Classifiers tutorial.

# WRITE YOUR CODE HERE

Did the model do well at predicting lexical decision correctness? Why or why not?

Write your response here

5. QDA (3 points)#

Load in the MASS library and fit a QDA model to the data set. The predictors are still Length, Log_Freq_HAL, and their interaction, just like the logistic regression model you just ran, and the dependent variable is still Correct.

Hint: see the Classifiers tutorial.

# WRITE YOUR CODE HERE

Now look at how well the predicted Correct values compare with actual Correct values for the whole data set. Output a confusion matrix and overall prediction accuracy.

# WRITE YOUR CODE HERE

How does QDA prediction performance differ from that of logistic regression?

Write your response here

DUE: 5pm EST, March 12, 2025

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name

Exercise 9: Classification

Contents

Exercise 9: Classification#

1. Loading and formatting the data (1 point)#

2. Visualizing the data (1 point)#

3. Logistic Regression: Fitting the model (2 points)#

4. Interpreting predictions from the model (3 points)#

5. QDA (3 points)#