Tutorial: Introduction to R, functions, and good coding habits#

Goals:#

  • Learn some basic R commands

  • Learn how to write a function in R

  • Learn best practices for writing and maintaining code

This tutorial draws from Software Carpentry: Programming with R section 2, the R tidyverse style guide, and the Good Research Code Handbook.


Basic R Commands#

Arithmetic#

In R, you can do basic arithmetic just like any other programming language e.g.,

6 + 2
10 - 3 * 4
5^3
8
-2
125

Data structures#

Vectors#

Vectors are lists of numbers or strings.

a <- c(1,2,3)
b <- c(4,5,6)
print(paste("mean of vector a:",mean(a)))
print(paste("sum of vector b:",sum(b)))
[1] "mean of vector a: 2"
[1] "sum of vector b: 15"

You can do element-wise operations on two vectors easily:

a*b
  1. 4
  2. 10
  3. 18

Data frames#

Data frames can contain data of mixed types such as numbers and strings. Usually, each column is a different variable (e.g., Age, Test Score), and items within each column are of the same type.

c <- c("one", "two", "three")
data <- data.frame(avar = a,bvar = b,cvar = c)
data
A data.frame: 3 × 3
avarbvarcvar
<dbl><dbl><chr>
14one
25two
36three

In the data frame printed above, you can see that both variables avar and bvar are type <dbl> (short for “double”, another term for numeric) while variable cvar is type <chr>, or “character”.

Indexing#

In a vector, you can index data by its position (starting at index 1):

a[2]
2

In a data frame, you can index data by its column and row number using data[row, col]:

data[2,1]
2

You can index a particular column in a data frame by its variable name using $:

data$bvar
  1. 4
  2. 5
  3. 6

Or, you can access a particular column or row by its position in the data frame:

print("Column 3")
data[,3]
[1] "Column 3"
  1. 'one'
  2. 'two'
  3. 'three'
print("Row 1")
data[1,]
[1] "Row 1"
A data.frame: 1 × 3
avarbvarcvar
<dbl><dbl><chr>
114one

Or with square brackets and the row or column name:

data[,"cvar"]
  1. 'one'
  2. 'two'
  3. 'three'

How to write a function in R#

In the lecture and readings, you went over how to construct a testable hypothesis of the form:

\(Y=f(X)\)

This also describes the basic form of a function. A function takes an input set of variables (\(X\)), perfoms a basic operation on them (\(f\)), and generates an output or set of outputs (\(Y\)). What you will learn here is how to implement that with code.

Functions are useful in elminating redundant or repetitive code, as well as for separating discrete tasks. Function definitions contain certain components:

  • function name

  • input parameters

  • function operation

  • return statement

  • return parameters

The example function below converts temperatures in Fanrenheit to temperatures in Celsius. Note the formatting structure of the function:

  • the first line contains a descriptive function name, fahrenheit_to_celsius, and the input parameter, temp_F

  • the body of the function performs the temperature conversion, the function operation, and is contained within curly braces

  • the last line contains the return statement and the return parameter, temp_C

Aside: R does not require a formal return statement. The variable in the last line of the function body will be return automatically. However, especially while in the learning phase, it is best to explicitly define the return statement.

The functional form of the conversion from \(temp_F\) to \(temp_C\) is: \(temp_C = \frac{(temp_F-32)*5}{9}\). Here we will just translate that into R form.

fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

When you execute the cell above, R will build the function fahrenheit_to_celsius and it will be available in your global environment.

To then call the function, simply invoke the function name with the required input parameters, just like you would any pre-defined function. For example:

# freezing point of water
fahrenheit_to_celsius(32)

# boiling point of water
fahrenheit_to_celsius(212)
0
100

There are two ways that functions in R can receive input parameters, somewhat dependent on the function definition. Parameters can be assigned to a variable, named in the function call itself, e.g., function_name(variable = value). Alternatively, parameters can be assigned simply by order and matched from left to right. Finally, not all input parameters are required if default values are set in the function definition.

Let’s illustrate with the following example:

input_1 <- 20
my_sum <- function(input_1, input_2 = 10) { #input_2 is given a default value of 10
  output <- input_1 + input_2
  return(output)
}

Now, let’s try running our function a few different ways. Note the differing behavior depending on the form of the input parameters:

my_sum(2)
12
my_sum(3, 4)
7
my_sum(input_1 = 1, 3)
4
my_sum(input_2 = 3)
Error in my_sum(input_2 = 3): argument "input_1" is missing, with no default
Traceback:

1. my_sum(input_2 = 3)

Why do we receive an error on the last function call? Looking at the error message is informative. The function definition doesn’t contain a default value for input_1, and since the only input parameter comes in the form of a named variable for input_2, the function call doesn’t provide a value to use for input_1. Thus, the addition operation in the function body can not be performed.


Good research code practices#

The Good Research Code Handbook is an excellent resource written by Patrick Mineault, a software engineer at Google with a PhD in computational neuroscience. The handbook is particularly useful for grad students and postdocs whose research incorporates lots of programming. It contains helpful information about code organization such that it is clear and easy to understand and work reliably.

Let’s walk through some gems together:

Keep code consistent#

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” - The tidyverse style guide

The R tidyverse style guide, derived from Google’s original R style guide, provides a set of consistent (if somewhat arbitrary) rules that facilitate writing clear code.

This tutorial will only focus on some basic rules, but as always, there are more resources available that provide more detailed information.

File names should be meaningful and limited to letters, numbers, dashes and underscores. Avoid using special characters. Some examples of good and bad file names:

# good
fit_models.R
utility_functions.R

# bad
fit models.R
foo.r
stuff.r

It is also often helpful to use a numerical prefix if a set of files are designed to be run in a specific order.

Similarly, object names (variables and functions) should be meaningful and limited to lowercase letters and numbers, with underscores used to separate words. This style is also known as snake case. For example:

# good
day_one
day_1

# bad
dayOne    #camel case
dayone

(Note: some people use camel case (dayOne) because it is easier to write, but snake case is generally preferred because it’s easier to read.)

Note that certain object names should be avoided, such those that overlap with common operation, e.g., mean.

Long lines: code should be limited to 80 characters per line. There are specific ways to handle longer lines that call or define functions.

If a function call doesn’t fit on a single line, separate the input parameters on the subsequent lines. For example:

# good
do_something_very_complicated(
  something = "that",
  requires = many,
  arguments = "some of which may be long"
)

# bad
do_something_very_complicated("that", requires, many, arguments,
                              "some of which may be long"
                              )

If a function definition doesn’t fit on a single line, separate the input parameters on subsequent lines, indented to match the opening bracket of the function, as shown below:

long_function_name <- function(a = "a long argument",
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

Assignment should use the left arrow convention, <-, instead of =. On the surface, it may seem that these two symbols operate the same way, but they actually don’t and are not interchangeable. Run the code blocks below to see if you can figure out how they differ.

mean(x = 1:10)
5.5
x
Error in eval(expr, envir, enclos): object 'x' not found
Traceback:
mean(x <- 1:10)
5.5
x
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10

There are also instances where the assignment arrow cannot be used, like when defining arguments in function. Using <- can therefore help with code readability. Fun fact, you can also assign objects the other direction, e.g., value -> name, but this is rarely used.

Since the arrow is somewhat annoying to type, handy shortcuts exist:

Mac: Option + -

PC: Alt + -

Logical variables and vectors should be dfined using TRUE and FALSE instead of the shortened T and F forms.

Keep jupyter notebooks tidy#

There are additional considerations to keep in mind when working with jupyter notebooks. Since they are particularly useful for literate programming (combining code, graphics and text), they provide lots of flexibility and don’t necessitate a rigid, linear flow through the notebook cells. It therefore requires even more discipline to maintain notebooks and the code within them.

One of the most helpful tips for keeping a notebook tidy is to ensure that your notebook runs from top to bottom. Restarting the notebook kernel and running through all the cells before commiting to git is a good habit to get into and can save time in the future.

Thinking more broadly, using jupyter notebook for everything is not best practice. Developing code and analysis pipeline inside an integrated development environment (IDE), such as RStudio for R, is helpful for efficiency and developing good coding habits.

Delete dead code#

FINAL.doc

Code that gets developed over time, as is often the case in research, can accumulate lots of no-longer-necessary components. This can lead to the phenomenon of dead code, code that never gets called or run.

“You know who hates dead code? You, in three months.” - Patrick Mineault.

Not only can dead code create problems, it can also be an enormous waste of time to wade through. It is good to develop a habit of cleaning up code projects as you work, but particularly when wrapping up a project or putting it on the back burner. Remember that when using git to version control your projects, it is always possible to go back in time and recover deleted code or old versions if necessary.

Use pure functions#

Pure functions follow a particular structure, with inputs coming from the input parameters and outputs returning in the return statement. For instance, using our sum function from earlier in the tutorial:

my_sum <- function(input_1, input_2) {
  output <- input_1 + input_2
  return(output)
}

A pure function can also be thought of as static black box. Something goes into the box, operations occur, a result is output. In this case, both the inputs and the black box itself remains unchanged.

If you are building a very complex function, it is best practice to break it into smaller functions that each do a clearly defined step in the larger function. This will greatly help with debugging and troubleshooting. These can then be combined into a “master” function:

small_fun1  <- function(a, b){
    #does step 1
    #returns step 1
}

small_fun2  <- function(a, b){
    #does step 2
    #returns step 2
}

big_fun  <- function(a, b){
    small_fun1(a, b)
    small_fun2(a, b)
}

Bad code: things not to do#

Sometimes it can be equally helpful to know what not to do. Here’s a brief list of some common pitfalls to avoid in your code:

  • mysterious object names that don’t provide helpful info about function

  • magic numbers, or hard-coded values without explanation

  • redundant or duplicated code

  • large functions that do too much and become unwieldy (e.g., mixing IO and computation)

  • too many nested ifs and for loops

Documenting#

Most of you are probably familiar with comments within code as a way of documenting. However, this is only one type of documentation, albeit a very useful form. Documentation can really be though of as any meta-information that you write about the code. Therefore, all of the following can be considered documentation, and indeed, all are helpful to incorporate into your own practice:

  • comments (both single-line and multi-line)

  • docstrings located at the top of functions that describe function operations, inputs and outputs

  • README.md documents in github (and elsewhere)

  • usage documentation

  • jupyter notebooks as tutorials for using code/pipelines

Notebook authored by Amy Sentis and Fiona Horner.