Exercise 16: Model selection#

This homework is designed to give you practice implementing model selection techniques, including best subset selection and forward/backward stepwise selection.

You won’t need to load in any data for this homework, we will be simulating our own.


1. Best subset selection (4 points)#

In this question, we will first generate simulated data, and then use it to perform best subset selection.

a) Use rnorm() to generate a dataset including a predictor \(X\) of length \(n = 100\) and a noise vector \(\epsilon\) of length \(n = 100\). Generate data for a response variable \(Y\) of length \(n = 100\) according to the model

\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)

where \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\) are constants of your choice.

# WRITE YOUR CODE HERE

b) Use regsubsets() to perform best subset selection to determine the best model that contains the predictor variables \(X\), \(X^2\), … , \(X^{10}\). Print the model summary.

# WRITE YOUR CODE HERE
# Note: if your model summary doesn't show up when viewing your notebook on github, copy and paste the output below.
# Otherwise you can either delete this cell or leave it blank.

```
# paste here

```

c) Plot the Mallows’ Cp, Bayesian Information Criteria (BIC) and the adjusted coefficient of determination, \(R^2\), for each model tested. Which is the best model? Report the coefficients of the best model.

# WRITE YOUR CODE HERE

Write your response here


2. Forward and backwards stepwise selection (3 points)#

Using the same simulated data from question 1, use forward stepwise selection, and backwards stepwise selections to determine the best model. Again, for both model selection methods, plot the Mallows’ Cp, Bayesian Information Criteria (BIC) and the adjusted coefficient of determination, \(R^2\), for each model tested. Report the coefficients of the best model.

a) Forward stepwise selection:

# WRITE YOUR CODE HERE
# Note: if your model summary doesn't show up when viewing your notebook on github, copy and paste the output below.
# Otherwise you can either delete this cell or leave it blank.

```
# paste here

```

b) Backwards stepwise selection:

# WRITE YOUR CODE HERE
# Note: if your model summary doesn't show up when viewing your notebook on github, copy and paste the output below.
# Otherwise you can either delete this cell or leave it blank.

```
# paste here

```

c) Compare your results from parts a and b with those of question 1.

Write your response here


3. Training and test error (3 points)#

This question will explore the relationship between training and test error and the number of features included in a model. We will again use a simulated dataset.

a) Simulate dataset with features \(p = 20\) and observations \(n = 1,000\). Generate data for a response variable \(Y\) according to the model

\(Y = X\beta + \epsilon\)

where \(\beta\) is random with some elements that are exactly zero.

Split your simulated dataset into a training set containing \(n=100\) observations and a test set containing \(n=900\) observations.

# WRITE YOUR CODE HERE

b) Perform best subset selection on the training set, and plot the associated training and test set MSE for the best model of each size.

# WRITE YOUR CODE HERE

c) Identify the best model according the test MSE. How does this model compare to the actual model used to simulate the dataset?

Note: If the model identified contains only an intercept or all of the features, then generate a new dataset (i.e. repeat part a with a different random seed) until the test set MSE is minimized for an intermediate model size.

# WRITE YOUR CODE HERE

Write your response here

DUE: 5pm EST, April 8, 2024

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name