Exercise 5: Using ggplot#
This homework assignment is designed to get you comfortable working with ggplot
for generating data visualizations.
We will be using the gapminder dataset. It contains information about population, life expectancy and per capita GDP by country over time.
1. Color, plot type and layers (6 points)#
Install and load the gapminder
dataset. Look at the first few rows of the data frame.
# INSERT CODE HERE
Now, let’s create a basic scatterplot using ggplot2
that shows how life expectancy has changed over time.
# INSERT CODE HERE
We can add another layer of detail by using color to indicate continent. Modify the code from the previous question to to do so.
What trends can you identify in the data?
# INSERT CODE HERE
Write your response here. *
Using a scatterplot probably isn’t the best type of plot for effectively visualizing change over time. Instead, modify the code from the previous question to create a line plot.
# INSERT CODE HERE
Hmm, this plot looks a bit weird and unexpected, making it difficult for us to easily interpret the data trends. What is causing this?
Write your response here. *
Now, let’s try to separate the data by country, plotting one line for each country. Modify the code from the previous plot to do so.
# INSERT CODE HERE
But what if we want to visualize both lines and points on our graph? There are two primary ways to do this - taking advantage of what we know about using layers for ggplot (each layer is drawn on top of the previous layer).
Method 1: Plot black points on top of the colorful lines
Method 2: Plot black points underneath the colorful lines
Modify the code from the previous question to create two plots, one for each method.
Hint: to control the color of the points, think about where the aesthetic color attribute should be located e.g., in the global plot options or in a specific layer.
# INSERT CODE HERE
# method 1
# INSERT CODE HERE
# method 2
# Aside: colorful points
2. Adding statistics (4 points)#
ggplot2
allows easy overlay of statistical models on top of the data.
The graph below shows the relationship between life expectency and GDP per capita:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
However, the data points are squished close together on the left side of the graph, so it’s hard to see the actual relationship we’re interested in.
To fix this, we can change the scale of x-axis units using the scale functions. We can also make the data points transparent using the alpha function. This is helpful when there is a large amount of clustered data.
Modify the code above to incorporate these two changes (scale and transparency).
# INSERT CODE HERE
Already we can more easily visualize the trend in the data.
Next, let’s overlay statistics by fiting a simple relationship to the data. Modify the code from the previous question by adding a geom_smooth
layer.
# INSERT CODE HERE
Again, we can add a layer of detail by introducing separate colors for each continent. We can also create individual trendlines for each continent, instead of only one trendline.
Modify the code from the previous question to add these elements.
# INSERT CODE HERE
When you are finished, save the notebook as Exercise5.ipynb, push it to your class GitHub repository and send the instructors a link to your notebook via Canvas.
DUE: 5pm EST, Feb 14, 2024
IMPORTANT Did you collaborate with anyone on this assignment? If so, list their names here.
Someone’s Name