Getting Started with Data Visualizations in R (Part 2)

Getting Started with Data Visualizations in R (Part 2)


In our last post on Getting Started with Data Visualizations in R, we went over how to start using ggplot2 in R. We learned how to set up a basic scatterplot and how to change the colors of the points via a variety of methods. We also learned how to update our plot for categorical variables, and how to add labels, change fonts, and alter the legend.

Today, we're continuing with Part 2 of this tutorial! In this post, we'll look at some other common plots and their variations. We'll also learn how to plot multiple plots in one plot. Finally, we'll go over how to save our plots.

If you're new to R, I have a tutorial on getting started with coding in R in a two-part series (Part 1 and Part 2). This series will get you up to speed on installing and using R and RStudio so you can follow along with this post!

Load data and the ggplot2 package

Let's get started by loading the ggplot2 library and the airquality dataset again! Since methods for handling missing data are beyond the scope of this tutorial, we'll omit the observations with NA values again for now.

library(ggplot2)
data("airquality")
airquality.complete <- na.omit(airquality)

Before we move away from the scatterplot, there's a variation that we might occasionally find very useful. One way to highlight a trend in the data is by adding a line through the points on the scatterplot.

How to overlay a line on a scatterplot

Let's see how to do this! First, we'll initialize a basic plot and add some points to it by adding a geom_point() layer again. We'll also use the theme_minimal() layer again to adjust the overall style of our plot.

p <- ggplot(data=airquality.complete, aes(Temp, Ozone)) + 
  geom_point() +
  theme_minimal()
p
Basic scatter plot of airquality dataset

We can add a smooth line that runs through the points in our scatterplot by adding a geom_smooth() layer. There are a number of ways to find a smooth line that runs through the points. For more details on the different methods available in the geom_smooth() layer, we can refer to the documentation for geom_smooth().

How to fit a LOESS line through the points

In our case, we have fewer than 1,000 observations so if we don't specify which method we want to use, geom_smooth() will default to fitting a locally estimated scatterplot smoothing (LOESS) line.

p + geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Add LOESS line to the basic scatterplot of airquality dataset

The gray band around the line here shows a confidence interval around the smoothed line. We worked a bit with confidence intervals in our post on uncertainty quantification with the Central Limit Thereom. If you want to read a little more about confidence intervals, please refer to the explanation and corresponding R code at the bottom of that post.

If we don't want to plot the confidence interval, we can specify se=FALSE inside the geom_smooth() layer. We can also specify the color of the line by specifying a value for colour inside the geom_smooth() layer.

p + geom_smooth(se=FALSE, colour="orange")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Add LOESS line to scatterplot without the confidence interval

How to fit a least squares line through the points

If we want to change the method of the line, we can do that by specifying method inside the geom_smooth() layer. The geom_smooth() documentation in ggplot2 shows the different choices for these methods.

For example, let's say we want to fit a straight line through the data. This is often referred to as a least squares fit line. We can do that by specifying method=lm inside the geom_smooth() layer. Here, lm stands for linear model and is also the name of the function, lm(), that fits a straight line through the data.

p + geom_smooth(method="lm", colour="purple")
#> `geom_smooth()` using formula 'y ~ x'
Add least squares line to scatterplot

How to plot histograms

Another common plot type is the histogram. We can plot histograms by adding a geom_histogram() layer. This is great when we want to depict count data, such as how many times a particular event occurs.

For example, we can use a histogram to see how many days exhibit varying ozone levels (in parts per billion, or ppb). Below, we see that there were very few days for which the ozone level exceeded 100 ppb. In fact, the majority of the days had ozone levels less than 75 ppb.

ggplot(airquality.complete, aes(x=Ozone)) + 
  geom_histogram() +
  theme_minimal()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Basic histogram of ozone levels from airquality dataset

Notice that when we initiated this plot, we only specified x=Ozone in the aes() mapping. This is because histograms depict counts, or frequencies, on the secondary axis. If we had specified y=Ozone instead, our histogram orientation would be flipped so that the counts are depicted along the x-axis.

How to change the number of bins in the histogram

The default number of bins in geom_histogram() is 30. We can change that by specifying a different number of bins via bins inside the geom_histogram() layer.

ggplot(airquality.complete, aes(x=Ozone)) + 
  geom_histogram(bins=10) +
  theme_minimal()
Histogram of ozone levels in airquality dataset with 10 bins

How to change the colors of the histogram

We can change the colors of the bins and their outlines by specifying fill and colour inside the geom_histogram() layer. The fill input dictates the color of the bin itself while the colour input dictates the color of the bin borders.

ggplot(airquality.complete, aes(x=Ozone)) + 
  geom_histogram(bins=10, colour="#c0cc00", fill="#f50a60") +
  theme_minimal()
Histogram of ozone levels in airquality dataset with colored bins and border

In the figure above, we've specified HEX values for the colors from this webpage showing HEX values for popular web colors. Previously, we went over many options and resources for changing colors in Part 1 of this tutorial. Those methods are generally applicable to visualizations made with ggplot2 so we can apply those methods (as appropriate for a particular variable type) to these plots as well.

If we want to see other elements we can alter when plotting histograms, we can find those in the documentation for geom_histogram() in ggplot2.

How to plot boxplots

Next, let's see how to make boxplots! You might have also heard these referred to as box-and-whisker plots.

Boxplots are great when we want to see the distribution of values in a continuous variable. We briefly talked about continuous and categorical variables in Part 1 of this tutorial when we talked about coloring based on values in the Month variable.

If we look at our data with the head() function, we see that Ozone, Solar.R, Wind, and Temp should be continuous while Month should be a factor. This is because the numbers in Month are placeholders for the months May through September. Meanwhile, the Day variable takes on discrete values since it can only take on integer values from 1 through 31.

head(airquality.complete)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 7    23     299  8.6   65     5   7
#> 8    19      99 13.8   59     5   8

Let's say we want to see the distribution of values in the Ozone variable. We can plot a boxplot in ggplot2 by adding a geom_boxplot() layer. Below, we've specified y=Ozone because we wanted to plot a vertical boxplot. If we want to plot a horizontal boxplot, we could specify x=Ozone instead.

ggplot(airquality.complete, aes(y=Ozone)) +
  geom_boxplot() + 
  theme_minimal()
Single boxplot of ozone level in airquality dataset

How to interpret a boxplot

The default value for the middle line in the box is the median, or the 50th percentile, of the Ozone variable. The bottom of the box shows the first quartile, or the 25th percentile, of the Ozone variable. This is also referred to as the lower hinge. The top of the box shows the third quartile, or the 75th percentile. This is also referred to as the upper hinge.

The lines extending from the box are sometimes referred to as whiskers. The positions of the whiskers depend on the interquartile range (IQR). This is the difference between the third and first quartiles. The top (or upper) whisker extends until the largest value in Ozone that is within 1.5 \(\times\) IQR above the upper hinge. The bottom (or lower) whisker extends until the smallest value in Ozone that is within 1.5 \(\times\) IQR below the lower hinge.

How to plot a boxplot for each level in a factor variable

If we want to plot a boxplot of the Ozone data for each month in airquality.complete, we can do that by specifying x=as.factor(Month). Just as in Part 1 of this tutorial, we have to tell R that the Month variable is a factor otherwise it will treat it as a continuous variable.

ggplot(airquality.complete, aes(x=as.factor(Month), y=Ozone)) +
  geom_boxplot() + 
  theme_minimal()
Boxplots of ozone levles in airquality dataset by month

How to update level names in a factor variable

Let's update our Month variable so that it is a factor and let's also recode the values in Month so that they show the actual month names. We can do both of those things simultaneously with the factor() function in R.

airquality.complete$Month <- factor(airquality.complete$Month, 
                                    labels=c("May", "Jun", "Jul", "Aug", "Sep"))

Now if we look at our data again using the head() function, we'll see that the entries for Month have been updated to the month names!

head(airquality.complete)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67   May   1
#> 2    36     118  8.0   72   May   2
#> 3    12     149 12.6   74   May   3
#> 4    18     313 11.5   62   May   4
#> 7    23     299  8.6   65   May   7
#> 8    19      99 13.8   59   May   8

Now when we remake our plot, the month names will show up along the x-axis. Let's also use what we learned from the last post and rename the axes labels!

ggplot(airquality.complete, aes(x=Month, y=Ozone)) +
  geom_boxplot() + 
  theme_minimal() +
  xlab("Month") +
  ylab("Ozone (in pbb)")
Boxplots of ozone level in airquality dataset by month as factor and labels

How to change the fill colors of the boxplots

Finally, we can add some colors to these boxplots just as we did in Part 1 of this tutorial. To do this, we need to specify the variable we want to use to determine the fill of the boxplots. In this case, we want to fill in the color of the boxplots based on the Month variable so we will specify fill=Month.

ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month)) +
  geom_boxplot() + 
  theme_minimal() +
  xlab("Month") +
  ylab("Ozone (in pbb)")
Boxplots of ozone levels by months with default fill colors

If we want to specify a different color palette, we can do that by adding a scale_fill_brewer() layer and specifying the palette choice within that layer. Below, we are again using the Set3 color brewer palette.

ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month)) +
  geom_boxplot() + 
  theme_minimal() +
  xlab("Month") +
  ylab("Ozone (in pbb)") + 
  scale_fill_brewer(palette="Set3")
Boxplots of ozone levels by month with Set3 fill colors

Notice that we specified the fill rather than the colour here. By contrast, when we were changing the colors of the scatterplot points, we specified the colour. This is because ggplot2 uses colour to denote the color of lines and points, and uses fill to denote the color when we fill-in the color of something like a box.

How to change the border colors of the boxplots

If we additionally specify colour, we can change the color of the lines on our boxplots. If we don't specify a color brewer palette for the colors via a scale_color_brewer() layer, the line colors will be the default colors for factors.

ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month, color=Month)) +
  geom_boxplot() + 
  theme_minimal() +
  xlab("Month") +
  ylab("Ozone (in pbb)") + 
  scale_fill_brewer(palette="Set3")
Boxplots of ozone levels in airquality dataset with Set3 fill and default border colors

We can change these colors by adding a scale_color_brewer() layer. If we don't want to fill in the color of the boxplots, we can omit the fill=Month mapping from the base plot and the scale_fill_brewer() layer.

ggplot(airquality.complete, aes(x=Month, y=Ozone, color=Month)) +
  geom_boxplot() + 
  theme_minimal() +
  xlab("Month") +
  ylab("Ozone (in pbb)") + 
  scale_color_brewer(palette="Dark2")
Boxplots of ozone levels in airquality dataset by month with only border colors

For more variations on boxplots, we can refer to the documentation on geom_boxplots() in ggplot2.

How to plot bar plots

Another useful plot we want to make is the bar plot, also called a bar chart in ggplot2. To illustrate the usefulness of this plot, we'll use medal counts from the 2022 Winter Olympics! I saved the Wikipedia medal table for the 2022 Winter Olympics in a text file so that we can load the data into R.

How to load your own data into R

We'll use the read.csv() function in R to load this data into our workspace.

olympics <- read.csv("https://thebitwise.org/olympic_medals_2022/", 
                     sep="\t", header=FALSE)
colnames(olympics) <- c("country", "gold", "silver", "bronze", "total")
head(olympics)
#>         country gold silver bronze total
#> 1        Norway   16      8     13    37
#> 2       Germany   12     10      5    27
#> 3         China    9      4      2    15
#> 4 United States    8     10      7    25
#> 5        Sweden    8      5      5    18
#> 6   Netherlands    8      5      4    17

A quick look at our olympics data with the head() function shows us that the data contains the number of gold, silver, bronze, and total medal counts for each country. We can plot the total medal count by country by adding a geom_bar() layer to a base ggplot plot.

How to plot a basic bar plot and rotate axis tick labels

The default behavior in geom_bar() is similar to the histogram. To get a bar plot that uses the actual values in a variable, we insert stat="identity" inside the geom_bar() layer. Alternatively, we can also use the geom_col() instead.

In the bar plot below, we've rotated the country names by 45 degrees on the x-axis so that they don't overlap with each other.

ggplot(olympics, aes(x=country, y=total)) + 
  geom_bar(stat="identity") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1))
Barplot of medal counts in 2022 Winter Olympics by country

Notice that we used both the theme_minimal() layer to adjust the overall plot style and the theme() layer to rotate the x-axis tick labels. If we had applied the theme_minimal() layer after the theme() layer, the settings within theme_minimal() would have overwritten our text rotations. In this case, order makes a difference!

How to plot a stacked bar plot

This plot is interesting but maybe we want to depict more than just the total medal count. If we want to show the breakdown of the medal counts by their colors, we need to first reshape our medal color variables into a single variable to input into the ggplot() function.

We can do this via the melt() function in the reshape2 package. To use this function, we first install the reshape2 package with the following.

install.packages("reshape2")

We then load it and use the melt() function to combine the gold, silver, and bronze variables into a single variable by country. In the code snippet below, we use all but the total column in the olympics dataset. Since this is the 5th column, we can exclude it with olympics[,-5]. If we included total, our bar plot would double count each medal since the gold, silver, and bronze counts add up to the total count.

How to reshape data for a stacked bar plot

The melt function automatically names the new variables with the id variable, variable, and value. For more descriptive names, we'll also rename these variables with the colnames() function and the <- assignment in R.

library(reshape2)
olympics.long <- melt(olympics[,-5], id="country")
colnames(olympics.long) <- c("Country", "Medal", "Count")
head(olympics.long)
#>         Country Medal Count
#> 1        Norway  gold    16
#> 2       Germany  gold    12
#> 3         China  gold     9
#> 4 United States  gold     8
#> 5        Sweden  gold     8
#> 6   Netherlands  gold     8

Now we're ready to make our more informative barplot! In the figure below, we now have the total medal count by country with bar colors based on the number of gold, silver, and bronze medals!

ggplot(olympics.long, aes(x=Country, y=Count)) + 
  geom_bar(stat="identity", aes(fill=Medal)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1))
Barplots of medal counts in 2022 Winter Olympics by country and medal type

How to manually change the colors in a stacked bar plot

Although this is an improvement, it is a little confusing that gold, silver, and bronze are depicted by other colors. To remedy this, we also want to adjust the fill colors so that the colors on the plot match the medal color names.

To do that, we'll use something we learned when we went over how to manually change colors in Part 1 of this tutorial. We'll add a scale_fill_manual() layer with HEX values for the colors we want for gold, silver, and bronze. We'll store those HEX values in olympic_colors.

olympic_colors <- c("#d4af37", "#c0c0c0", "#897129")
ggplot(olympics.long, aes(x=Country, y=Count)) + 
  geom_bar(stat="identity", aes(fill=Medal)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1),
        legend.position="top") +
  scale_fill_manual(values=olympic_colors) +
  ylab("Number of Olympic Medals in 2022")
Barplot of medal counts in 2022 Winter Olympics by country and medal type with colors fixed

In the plot above, we've also relocated the legend to above the plot. For more details on adjustments we can make to bar plots, we can refer to the documentation for geom_bar() in ggplot2.

How to plot line graphs

There's one more very common plot type that we might useful. This one is a line graph, or line chart, and is very common when depicting data that changes with time. To see this, we'll return to our airquality.complete data, where we can plot a line graph with Day on the x-axis and Ozone on the y-axis. We'll do this by adding a geom_line() layer.

ggplot(airquality.complete, aes(x=Day, y=Ozone)) +
  geom_line()
Confusing line graph

The figure above is confusing and doesn't seem to make much sense. That's because our airquality.complete data can potentially contain an Ozone level for each Day in each of the five months in Month!

How to plot multiple lines in a single plot

Connecting all these points does not make much sense so let's split these out by Month. We can do this by specifying group=Month inside the aes() mapping. To make these lines easier to see, we'll additionally color them by month via colour=Month inside the aes() mapping.

ggplot(airquality.complete, aes(x=Day, y=Ozone, group=Month, colour=Month)) +
  geom_line()
Line graph of ozone levels by day with a separate line for each month

This new plot shows the Ozone levels for each Day, with a separate line for each Month. The five lines for the five months overlap quite a bit, however, so it's still a little difficult to distinguish them.

To depict this information even more clearly, we can split out the line graphs by Month onto separate panels.

How to plot multiple plots in one plot with facets

To split the line graphs by Month onto separate panels, we can add a facet_grid() layer. Inside this layer, we specify .~Month to indicate that we want to split the panels based on Month along the y-axis. If we additionally wanted to split the panels by a second factor variable, we could do that by specifying the variable name in place of . in .~Month.

ggplot(airquality.complete, aes(x=Day, y=Ozone, group=Month, colour=Month)) +
  geom_line() + 
  facet_grid(.~Month) + 
  theme_bw()
Facet grid of ozone level by day with separate panel for each month

Since the facets are automatically labeled with the month names, our legend for the Month colors is now redundant. As we did in Part 1, we can remove the legend by adding a theme(legend.position = "none") layer after the theme_minimal() layer.

We can also switch to plotting points instead of lines by swapping to the geom_point() layer. Finally, we can swap the facet orientation to see what the plot looks like when we split the panels along the other axis.

ggplot(airquality.complete, aes(x=Temp, y=Ozone, group=Month, colour=Month)) +
  geom_point() + 
  facet_grid(Month~.) +
  theme_bw() +
  theme(legend.position = "none")
Facet grid with scatterplot of ozone levels by day with separate panel for each month

How to save plots to files

We covered a lot of different plot types in this post! Let's wrap up by looking at how to save these plots to files. There are a number of ways to do this. For now, we'll use the ggsave() function from the ggplot2 package.

A quick look at the ggsave() documentation via ?ggsave shows us that at a minimum, we need to specify a file name for the saved plot. If we don't specify a plot object, such as p above, ggsave() will default to saving the last plot we made.

ggsave("myplot.png")

Other parameters to alter when saving plots as files

The ggsave documentation also shows us that we can specify different file types including eps, ps, pdf, jpeg, tiff, and png via the device input. Additionally, we can also specify the width and height of our exported file.

If we want to save to a particular directory, we can specify that with the path input. If we don't specify path, ggsave() will default to saving the plot in the current working directory. We can locate our current working directory by typing getwd() into the R console. Finally, if we plan to use the saved plot for publication, we can also specify the resolution we need with the dpi input.

Great job!

Great job! (decorative)

That was a lot of material we just covered! We learned how to overlay smooth lines on scatterplots. We also learned how to plot histograms, boxplots, bar plots, and line graphs. Finally, we went over facets and how we can use them to plot multiple plots in a single plot, and how to save our plots to files.

Now that we're pretty comfortable with the basics of making data visualizations in R with ggplot2, we'll move onto a code lab on exploratory data analysis! In our next post, we'll practice and explore the things we learned in this tutorial with a code lab!