Tutorial: Data as Objects and Tidy Data

Tutorial: Data as Objects and Tidy Data#

Goals:#

Learn about Tidy Data, long data, and wide data
Learn how to interact with model objects in R, including extracting coefficients

Tidy Data#

When you are working with data frames, it is best practice to keep your data tidy. Tidy data is a very specific term in which the following are true:

Columns are variables
Rows are individual observations
Observation for each variable is stored in the corresponding cell

There are two versions of the tidy data format - long and wide. To illustrate the difference, imagine a repeated measures dataset, where you gather four observations for each subject at four different time points. In the long data format, each row represents an observation at a specific timepoint for a specific subject. Each subject has multiple rows of observations. The columns will be subject, timepoint, and observation. For example:

long_data <- data.frame(subject = rep(1:3, each=4),  # rep repeats a vector n times
                        timepoint = rep(1:4, times=3), # same as previous
                        observation = round(rnorm(12), 2)) #rnorm simulates a normal distribution
long_data # print out long data table

A data.frame: 12 × 3
subject	timepoint	observation
<int>	<int>	<dbl>
1	1	-1.09
1	2	0.10
1	3	0.84
1	4	1.09
2	1	-0.81
2	2	-0.13
2	3	0.76
2	4	-0.44
3	1	2.09
3	2	-1.62
3	3	0.95
3	4	-0.09

Alternatively, the data can be organized in a wide format, in which each row is a single subject, the four different timepoints are represented as different columns, and the observation for each timepoint is stored at a cell within that column. Like this (don’t worry about the code for now, it is explained in the next tutorial):

library(tidyverse) #We'll learn about this package in the next tutorial!

long_data  %>% # This is a pipe. We will learn about this later
    mutate(timepoint = paste0('time',timepoint))  %>%  # mutate makes a new column
    spread(timepoint, observation) # spread pushes an observation across multiple columns

A data.frame: 3 × 5
subject	time1	time2	time3	time4
<int>	<dbl>	<dbl>	<dbl>	<dbl>
1	-1.09	0.10	0.84	1.09
2	-0.81	-0.13	0.76	-0.44
3	2.09	-1.62	0.95	-0.09

Interacting with model objects in R#

Whenever you run a regression or any other type of model, the resulting model is saved in an object with properties you can access. This is handy for fitting a model once, and then referencing different aspects of that model’s performance and fit. Let’s see an example:

# generate data where y is a function of two variables, x1 and x2
x1 <- seq(-4,4,0.5) #generate data ranging from -4 to 4 in steps of 0.5
x2 <- seq(-4,4,0.5)

#specify a sigmoidal relationship between x1 and y
#and a linear relationship between x2 and y
dat <- expand.grid(x1=x1, x2=x2) %>%
    mutate(y = 1/(1+exp(-x1)) + 0.1*x2)

head(dat)

A data.frame: 6 × 3
	x1	x2	y
	<dbl>	<dbl>	<dbl>
1	-4.0	-4	-0.3820138
2	-3.5	-4	-0.3706878
3	-3.0	-4	-0.3525741
4	-2.5	-4	-0.3241418
5	-2.0	-4	-0.2807971
6	-1.5	-4	-0.2175745

Note that expand.grid just creates a data frame with all possible pairs of the input variables. So here, each value of x1 was paired with an x2 value of -4, then with an x2 value of -3.5, etc…

We’ll learn more about the pipe (%>%) and mutate in the next tutorial!

Here is how y depends on x1 and x2:

# setting figure size
options(repr.plot.width=12, repr.plot.height=8)

dat  %>%
    gather(var, value, x1, x2) %>% # make it long format - only x1 or x2 in each row
    ggplot(aes(value, y)) + # we will learn aboug ggplot in a later tutorial
    stat_summary(geom='point',fun=mean) +
    stat_summary(geom='line',fun=mean) +
    facet_wrap(~var) # optimally wraps figures as needed

../_images/70f9b71921cb130aa56957e84e0b934481be1be34f0c2257a2a0e3c443e1f8be.png

Now, let’s add some noise and try to fit a regression model

#create new y_obs variable with normally distributed noise
dat <- mutate(dat, y_obs = y + rnorm(length(y), mean=0, sd=0.1))

#fit a linear model to find the regression coefficients
model_fit <- lm(y_obs ~ x1 + x2, data=dat)

The resulting model is saved in the model_fit variable. This is the “model object”, and you can do a number of things with it. First, the str() command, which is a general function for printing the structure of r objects, will tell you what you can find in the model object. As you can see below, it is a list of different variables, which you can access by typing the name of the model, followed by $variable.

str(model_fit) #print the structure of the model object

List of 12
 $ coefficients : Named num [1:3] 0.4996 0.1507 0.0999
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "x1" "x2"
 $ residuals    : Named num [1:289] 0.0767 0.1405 0.0611 -0.0984 -0.0678 ...
  ..- attr(*, "names")= chr [1:289] "1" "2" "3" "4" ...
 $ effects      : Named num [1:289] -8.4925 6.2743 4.1608 -0.1134 -0.0814 ...
  ..- attr(*, "names")= chr [1:289] "(Intercept)" "x1" "x2" "" ...
 $ rank         : int 3
 $ fitted.values: Named num [1:289] -0.503 -0.427 -0.352 -0.277 -0.201 ...
  ..- attr(*, "names")= chr [1:289] "1" "2" "3" "4" ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
  ..$ qr   : num [1:289, 1:3] -17 0.0588 0.0588 0.0588 0.0588 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:289] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:3] "(Intercept)" "x1" "x2"
  .. ..- attr(*, "assign")= int [1:3] 0 1 2
  ..$ qraux: num [1:3] 1.06 1.08 1.09
  ..$ pivot: int [1:3] 1 2 3
  ..$ tol  : num 1e-07
  ..$ rank : int 3
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 286
 $ xlevels      : Named list()
 $ call         : language lm(formula = y_obs ~ x1 + x2, data = dat)
 $ terms        :Classes 'terms', 'formula'  language y_obs ~ x1 + x2
  .. ..- attr(*, "variables")= language list(y_obs, x1, x2)
  .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:3] "y_obs" "x1" "x2"
  .. .. .. ..$ : chr [1:2] "x1" "x2"
  .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
  .. ..- attr(*, "order")= int [1:2] 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(y_obs, x1, x2)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "y_obs" "x1" "x2"
 $ model        :'data.frame':	289 obs. of  3 variables:
  ..$ y_obs: num [1:289] -0.426 -0.287 -0.291 -0.375 -0.269 ...
  ..$ x1   : num [1:289] -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 ...
  ..$ x2   : num [1:289] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language y_obs ~ x1 + x2
  .. .. ..- attr(*, "variables")= language list(y_obs, x1, x2)
  .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:3] "y_obs" "x1" "x2"
  .. .. .. .. ..$ : chr [1:2] "x1" "x2"
  .. .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
  .. .. ..- attr(*, "order")= int [1:2] 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(y_obs, x1, x2)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:3] "y_obs" "x1" "x2"
 - attr(*, "class")= chr "lm"

For example, if we want to extract the estimated regression coefficients, we can type this, which returns a named vector.

model_fit$coefficients

(Intercept): 0.499559922079091
x1: 0.150675486478537
x2: 0.0999191218958173

If we run multiple models, we might want to store these values in a data.frame, rather than a named vector like this:

#store the coefficients in a variable
coef <- model_fit$coefficients

#save the name/value of the coefficients in a dataframe
data.frame(name = names(coef), value = coef)

A data.frame: 3 × 2
	name	value
	<chr>	<dbl>
(Intercept)	(Intercept)	0.49955992
x1	x1	0.15067549
x2	x2	0.09991912

You can also extract the prediction of the model. They are stored under $fitted.values, or you can get them by typing fitted(model_fit), which returns the same values. Then, we can plot the observed values against the predicted values:

#extract model estimate of y for each x and store in df as fitted
dat$fitted <- model_fit$fitted.values

head(dat) #we see "fitted" column added

A data.frame: 6 × 5
	x1	x2	y	y_obs	fitted
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	-4.0	-4	-0.3820138	-0.4261378	-0.5028185
2	-3.5	-4	-0.3706878	-0.2869985	-0.4274808
3	-3.0	-4	-0.3525741	-0.2910417	-0.3521430
4	-2.5	-4	-0.3241418	-0.3752198	-0.2768053
5	-2.0	-4	-0.2807971	-0.2692938	-0.2014675
6	-1.5	-4	-0.2175745	-0.2070880	-0.1261298

#plot fitted against the observed data
ggplot(dat, aes(fitted, y_obs)) +
    stat_summary(geom='point',fun=mean)

../_images/d011e77675af2853e39e38f860fb8210795bb4c4c99981a551f1e69f8f4d88ec.png

Given your fitted model, you can also easily generate predictions of y given new values of x using the predict function:

new_data <- data.frame(x1=-5, x2=10) #new x-values to give to the model
predict(model_fit, newdata = new_data) #predict the y value for a given x

1: 0.745373708644577

That’s all for now! We’ll dive into some more advanced functions for data manipulation in R in the next tutorial.

Notebook authored by Ven Popov and edited by Krista Bond, Charles Wu, Patience Stevens, Amy Sentis, and Fiona Horner.

Tutorial: Data as Objects and Tidy Data

Contents

Tutorial: Data as Objects and Tidy Data#

Goals:#

Tidy Data#

Interacting with model objects in R#