Exercise 3: Data objects#

These exercises are designed to get you comfortable extracting information from data objects.

We’ll work with the Credit dataset which comes in the ISLR package in R. This is a simulated dataset that provides credit and demographic information on 10,000 hypothetical customers.


1. Load packages, data, model (1 point)#

Install and load ISLR below.

#install.packages("ISLR")
library(ISLR)

Take a look at the first few rows of the Credit dataset.

# INSERT CODE HERE

We can see that we have a nice tidy data frame here. Each column is a separate variable and each row is a different observation (in this case, simulated customers).

The code below fits a linear model to predict credit card balance from the card limit and the card owner’s credit rating, age, gender, and student status. This model is saved as the cred_lm model object. The summary() function extracts important summary information from the model object so we can interpret the results.

cred_lm  <- lm(Balance ~ Limit + Rating + Age + Gender + Student, Credit)
summary(cred_lm)

2. Replicating summary outputs (5 pts)#

Let’s see if we can replicate some of the values included in the summary() output.

Let’s start with the residual standard error, aka sigma. We can see above that this is 195.9 for this model. You can directly extract sigma as follows:

sigma(cred_lm)

In lm, sigma is calculated as

\[ \sqrt{\frac{SSE}{n-p}} \]

Where SSE is the sum of squared errors, n is the number of observations, and p is the number of parameters estimated (hint: this includes the intercept). So the denominator boils down to the degrees of freedom.

Below, use what you’ve learned about extracting information from model objects to calculate the SSE and extract n and p.

Hint: remember that R is really good at vectorized operations, meaning it easily applies the same operation individually to each element of a given vector.

# INSERT CODE HERE
# Calculate SSE


# Extract n


# Extract p

Now, combine your work above to write a function that takes any fitted linear model and returns the residual standard error. Then test your function on the cred_lm model object. Compare your answer to sigma extracted directly from the model object.

# INSERT CODE HERE
# Test and compare results. 

your_function(cred_lm) #Replace with your own function name
sigma(cred_lm)

2. Summary table and indexing (4 pts)#

Let’s say we wanted to extract the entire coefficient table provided to us by the summary() function above, maybe for use in a publication. You might expect this to be pulled by:

cred_lm$coefficients

But as we saw in the tutorial, this pulls just the variable name and estimate, and not the standard error, t-statistic, or p-value. You could try to find where all this information is stored in the cred_lm object using the str() function.

str(cred_lm)

But you actually won’t find it in there! That’s because the information in the coefficient table is a component of summary(), not a component of the model object itself. That’s right, summary() creates it’s own object that you can further pull information from.

Knowing this, pull the coefficient table from the summary() object.

# INSERT CODE HERE

Maybe we are not interested in including the t-statistic in our final table. Pull just the estimate, SE, and p-value from the summary() object.

# INSERT CODE HERE

Now, pull the table again but drop the (Intercept) term. (Don’t save and alter your table above – practice pulling the same table, minus the intercept term, directly from the summary.)

# INSERT CODE HERE

That’s all for Exercise 3! When you are finished, save the notebook as Exercise3.ipynb, push it to your class GitHub repository and send the instructors a link to your notebook via Canvas. You can send messages via Canvas by clicking “Inbox” on the left and then pressing the icon with a pencil inside a square.

DUE: 5pm EST, Feb 7, 2024

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name