Exercise 5: Using ggplot#

This homework assignment is designed to get you comfortable working with ggplot for generating data visualizations.

We will be using the gapminder dataset. It contains information about population, life expectancy and per capita GDP by country over time.


1. Color, plot type and layers (6 points)#

Install and load the gapminder dataset. Look at the first few rows of the data frame.

# INSERT CODE HERE

Now, let’s create a basic scatterplot using ggplot2 that shows how life expectancy has changed over time.

# INSERT CODE HERE

We can add another layer of detail by using color to indicate continent. Modify the code from the previous question to to do so.

What trends can you identify in the data?

# INSERT CODE HERE

Write your response here. *

Using a scatterplot probably isn’t the best type of plot for effectively visualizing change over time. Instead, modify the code from the previous question to create a line plot.

# INSERT CODE HERE

Hmm, this plot looks a bit weird and unexpected, making it difficult for us to easily interpret the data trends. What is causing this?

Write your response here. *

Now, let’s try to separate the data by country, plotting one line for each country. Modify the code from the previous plot to do so.

# INSERT CODE HERE

But what if we want to visualize both lines and points on our graph? There are two primary ways to do this - taking advantage of what we know about using layers for ggplot (each layer is drawn on top of the previous layer).

  • Method 1: Plot black points on top of the colorful lines

  • Method 2: Plot black points underneath the colorful lines

Modify the code from the previous question to create two plots, one for each method.

Hint: to control the color of the points, think about where the aesthetic color attribute should be located e.g., in the global plot options or in a specific layer.

# INSERT CODE HERE

# method 1
# INSERT CODE HERE

# method 2
# Aside: colorful points

2. Adding statistics (4 points)#

ggplot2 allows easy overlay of statistical models on top of the data.

The graph below shows the relationship between life expectency and GDP per capita:

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()
../_images/a31851f649614ceb829f4fad1e4bc568913e6171be69697c529793233428f80d.png

However, the data points are squished close together on the left side of the graph, so it’s hard to see the actual relationship we’re interested in.

To fix this, we can change the scale of x-axis units using the scale functions. We can also make the data points transparent using the alpha function. This is helpful when there is a large amount of clustered data.

Modify the code above to incorporate these two changes (scale and transparency).

# INSERT CODE HERE

Already we can more easily visualize the trend in the data.

Next, let’s overlay statistics by fiting a simple relationship to the data. Modify the code from the previous question by adding a geom_smooth layer.

# INSERT CODE HERE

Again, we can add a layer of detail by introducing separate colors for each continent. We can also create individual trendlines for each continent, instead of only one trendline.

Modify the code from the previous question to add these elements.

# INSERT CODE HERE

When you are finished, save the notebook as Exercise5.ipynb, push it to your class GitHub repository and send the instructors a link to your notebook via Canvas.

DUE: 5pm EST, Feb 14, 2024

IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.

Someone’s Name