Getting Started with Data Visualizations in R (Part 2)
In our last post on Getting Started with Data Visualizations in R, we went over how to start using
ggplot2 in R. We learned how to set up a basic scatterplot and how to change the colors of the points via a variety of methods. We also learned how to update our plot for categorical variables, and how to add labels, change fonts, and alter the legend.
Today, we're continuing with Part 2 of this tutorial! In this post, we'll look at some other common plots and their variations. We'll also learn how to plot multiple plots in one plot. Finally, we'll go over how to save our plots.
If you're new to R, I have a tutorial on getting started with coding in R in a two-part series (Part 1 and Part 2). This series will get you up to speed on installing and using R and RStudio so you can follow along with this post!
Load data and the
Let's get started by loading the
ggplot2 library and the
airquality dataset again! Since methods for handling missing data are beyond the scope of this tutorial, we'll omit the observations with
NA values again for now.
Before we move away from the scatterplot, there's a variation that we might occasionally find very useful. One way to highlight a trend in the data is by adding a line through the points on the scatterplot.
How to overlay a line on a scatterplot
Let's see how to do this! First, we'll initialize a basic plot and add some points to it by adding a
geom_point() layer again. We'll also use the
theme_minimal() layer again to adjust the overall style of our plot.
p <- ggplot(data=airquality.complete, aes(Temp, Ozone)) + geom_point() + theme_minimal() p
We can add a smooth line that runs through the points in our scatterplot by adding a
geom_smooth() layer. There are a number of ways to find a smooth line that runs through the points. For more details on the different methods available in the
geom_smooth() layer, we can refer to the documentation for geom_smooth().
How to fit a LOESS line through the points
p + geom_smooth() #> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The gray band around the line here shows a confidence interval around the smoothed line. We worked a bit with confidence intervals in our post on uncertainty quantification with the Central Limit Thereom. If you want to read a little more about confidence intervals, please refer to the explanation and corresponding R code at the bottom of that post.
If we don't want to plot the confidence interval, we can specify
se=FALSE inside the
geom_smooth() layer. We can also specify the color of the line by specifying a value for
colour inside the
p + geom_smooth(se=FALSE, colour="orange") #> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
How to fit a least squares line through the points
If we want to change the method of the line, we can do that by specifying
method inside the
geom_smooth() layer. The geom_smooth() documentation in ggplot2 shows the different choices for these methods.
For example, let's say we want to fit a straight line through the data. This is often referred to as a least squares fit line. We can do that by specifying
method=lm inside the
geom_smooth() layer. Here,
lm stands for linear model and is also the name of the function,
lm(), that fits a straight line through the data.
p + geom_smooth(method="lm", colour="purple") #> `geom_smooth()` using formula 'y ~ x'
How to plot histograms
Another common plot type is the histogram. We can plot histograms by adding a
geom_histogram() layer. This is great when we want to depict count data, such as how many times a particular event occurs.
For example, we can use a histogram to see how many days exhibit varying ozone levels (in parts per billion, or ppb). Below, we see that there were very few days for which the ozone level exceeded 100 ppb. In fact, the majority of the days had ozone levels less than 75 ppb.
ggplot(airquality.complete, aes(x=Ozone)) + geom_histogram() + theme_minimal() #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that when we initiated this plot, we only specified
x=Ozone in the
aes() mapping. This is because histograms depict counts, or frequencies, on the secondary axis. If we had specified
y=Ozone instead, our histogram orientation would be flipped so that the counts are depicted along the x-axis.
How to change the number of bins in the histogram
ggplot(airquality.complete, aes(x=Ozone)) + geom_histogram(bins=10) + theme_minimal()
How to change the colors of the histogram
We can change the colors of the bins and their outlines by specifying
colour inside the
geom_histogram() layer. The
fill input dictates the color of the bin itself while the
colour input dictates the color of the bin borders.
ggplot(airquality.complete, aes(x=Ozone)) + geom_histogram(bins=10, colour="#c0cc00", fill="#f50a60") + theme_minimal()
In the figure above, we've specified HEX values for the colors from this webpage showing HEX values for popular web colors. Previously, we went over many options and resources for changing colors in Part 1 of this tutorial. Those methods are generally applicable to visualizations made with
ggplot2 so we can apply those methods (as appropriate for a particular variable type) to these plots as well.
If we want to see other elements we can alter when plotting histograms, we can find those in the documentation for geom_histogram() in ggplot2.
How to plot boxplots
Next, let's see how to make boxplots! You might have also heard these referred to as box-and-whisker plots.
Boxplots are great when we want to see the distribution of values in a continuous variable. We briefly talked about continuous and categorical variables in Part 1 of this tutorial when we talked about coloring based on values in the
If we look at our data with the
head() function, we see that
Temp should be continuous while
Month should be a factor. This is because the numbers in
Month are placeholders for the months May through September. Meanwhile, the
Day variable takes on discrete values since it can only take on integer values from 1 through 31.
head(airquality.complete) #> Ozone Solar.R Wind Temp Month Day #> 1 41 190 7.4 67 5 1 #> 2 36 118 8.0 72 5 2 #> 3 12 149 12.6 74 5 3 #> 4 18 313 11.5 62 5 4 #> 7 23 299 8.6 65 5 7 #> 8 19 99 13.8 59 5 8
Let's say we want to see the distribution of values in the
Ozone variable. We can plot a boxplot in
ggplot2 by adding a
geom_boxplot() layer. Below, we've specified
y=Ozone because we wanted to plot a vertical boxplot. If we want to plot a horizontal boxplot, we could specify
ggplot(airquality.complete, aes(y=Ozone)) + geom_boxplot() + theme_minimal()
How to interpret a boxplot
The default value for the middle line in the box is the median, or the 50th percentile, of the
Ozone variable. The bottom of the box shows the first quartile, or the 25th percentile, of the
Ozone variable. This is also referred to as the lower hinge. The top of the box shows the third quartile, or the 75th percentile. This is also referred to as the upper hinge.
The lines extending from the box are sometimes referred to as whiskers. The positions of the whiskers depend on the interquartile range (IQR). This is the difference between the third and first quartiles. The top (or upper) whisker extends until the largest value in
Ozone that is within 1.5 \(\times\) IQR above the upper hinge. The bottom (or lower) whisker extends until the smallest value in
Ozone that is within 1.5 \(\times\) IQR below the lower hinge.
How to plot a boxplot for each level in a factor variable
If we want to plot a boxplot of the
Ozone data for each month in
airquality.complete, we can do that by specifying
x=as.factor(Month). Just as in Part 1 of this tutorial, we have to tell R that the
Month variable is a factor otherwise it will treat it as a continuous variable.
ggplot(airquality.complete, aes(x=as.factor(Month), y=Ozone)) + geom_boxplot() + theme_minimal()
How to update level names in a factor variable
Let's update our
Month variable so that it is a factor and let's also recode the values in
Month so that they show the actual month names. We can do both of those things simultaneously with the
factor() function in R.
airquality.complete$Month <- factor(airquality.complete$Month, labels=c("May", "Jun", "Jul", "Aug", "Sep"))
Now if we look at our data again using the
head() function, we'll see that the entries for
Month have been updated to the month names!
head(airquality.complete) #> Ozone Solar.R Wind Temp Month Day #> 1 41 190 7.4 67 May 1 #> 2 36 118 8.0 72 May 2 #> 3 12 149 12.6 74 May 3 #> 4 18 313 11.5 62 May 4 #> 7 23 299 8.6 65 May 7 #> 8 19 99 13.8 59 May 8
Now when we remake our plot, the month names will show up along the x-axis. Let's also use what we learned from the last post and rename the axes labels!
ggplot(airquality.complete, aes(x=Month, y=Ozone)) + geom_boxplot() + theme_minimal() + xlab("Month") + ylab("Ozone (in pbb)")
How to change the fill colors of the boxplots
Finally, we can add some colors to these boxplots just as we did in Part 1 of this tutorial. To do this, we need to specify the variable we want to use to determine the fill of the boxplots. In this case, we want to fill in the color of the boxplots based on the
Month variable so we will specify
ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month)) + geom_boxplot() + theme_minimal() + xlab("Month") + ylab("Ozone (in pbb)")
If we want to specify a different color palette, we can do that by adding a
scale_fill_brewer() layer and specifying the palette choice within that layer. Below, we are again using the
Set3 color brewer palette.
ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month)) + geom_boxplot() + theme_minimal() + xlab("Month") + ylab("Ozone (in pbb)") + scale_fill_brewer(palette="Set3")
Notice that we specified the fill rather than the colour here. By contrast, when we were changing the colors of the scatterplot points, we specified the colour. This is because
colour to denote the color of lines and points, and uses fill to denote the color when we fill-in the color of something like a box.
How to change the border colors of the boxplots
If we additionally specify
colour, we can change the color of the lines on our boxplots. If we don't specify a color brewer palette for the colors via a
scale_color_brewer() layer, the line colors will be the default colors for factors.
ggplot(airquality.complete, aes(x=Month, y=Ozone, fill=Month, color=Month)) + geom_boxplot() + theme_minimal() + xlab("Month") + ylab("Ozone (in pbb)") + scale_fill_brewer(palette="Set3")
We can change these colors by adding a
scale_color_brewer() layer. If we don't want to fill in the color of the boxplots, we can omit the
fill=Month mapping from the base plot and the
ggplot(airquality.complete, aes(x=Month, y=Ozone, color=Month)) + geom_boxplot() + theme_minimal() + xlab("Month") + ylab("Ozone (in pbb)") + scale_color_brewer(palette="Dark2")
For more variations on boxplots, we can refer to the documentation on geom_boxplots() in ggplot2.
How to plot bar plots
Another useful plot we want to make is the bar plot, also called a bar chart in
ggplot2. To illustrate the usefulness of this plot, we'll use medal counts from the 2022 Winter Olympics! I saved the Wikipedia medal table for the 2022 Winter Olympics in a text file so that we can load the data into R.
How to load your own data into R
We'll use the
read.csv() function in R to load this data into our workspace.
olympics <- read.csv("https://thebitwise.org/olympic_medals_2022/", sep="\t", header=FALSE) colnames(olympics) <- c("country", "gold", "silver", "bronze", "total") head(olympics) #> country gold silver bronze total #> 1 Norway 16 8 13 37 #> 2 Germany 12 10 5 27 #> 3 China 9 4 2 15 #> 4 United States 8 10 7 25 #> 5 Sweden 8 5 5 18 #> 6 Netherlands 8 5 4 17
A quick look at our
olympics data with the
head() function shows us that the data contains the number of gold, silver, bronze, and total medal counts for each country. We can plot the total medal count by country by adding a
geom_bar() layer to a base
How to plot a basic bar plot and rotate axis tick labels
The default behavior in
geom_bar() is similar to the histogram. To get a bar plot that uses the actual values in a variable, we insert
stat="identity" inside the
geom_bar() layer. Alternatively, we can also use the
In the bar plot below, we've rotated the country names by 45 degrees on the x-axis so that they don't overlap with each other.
ggplot(olympics, aes(x=country, y=total)) + geom_bar(stat="identity") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1))
Notice that we used both the
theme_minimal() layer to adjust the overall plot style and the
theme() layer to rotate the x-axis tick labels. If we had applied the
theme_minimal() layer after the
theme() layer, the settings within
theme_minimal() would have overwritten our text rotations. In this case, order makes a difference!
How to plot a stacked bar plot
This plot is interesting but maybe we want to depict more than just the total medal count. If we want to show the breakdown of the medal counts by their colors, we need to first reshape our medal color variables into a single variable to input into the
We can do this via the
melt() function in the
reshape2 package. To use this function, we first install the
reshape2 package with the following.
We then load it and use the
melt() function to combine the
bronze variables into a single variable by
country. In the code snippet below, we use all but the
total column in the
olympics dataset. Since this is the 5th column, we can exclude it with
olympics[,-5]. If we included
total, our bar plot would double count each medal since the
bronze counts add up to the
How to reshape data for a stacked bar plot
The melt function automatically names the new variables with the id variable,
value. For more descriptive names, we'll also rename these variables with the
colnames() function and the
<- assignment in R.
library(reshape2) olympics.long <- melt(olympics[,-5], id="country") colnames(olympics.long) <- c("Country", "Medal", "Count") head(olympics.long) #> Country Medal Count #> 1 Norway gold 16 #> 2 Germany gold 12 #> 3 China gold 9 #> 4 United States gold 8 #> 5 Sweden gold 8 #> 6 Netherlands gold 8
Now we're ready to make our more informative barplot! In the figure below, we now have the total medal count by country with bar colors based on the number of gold, silver, and bronze medals!
ggplot(olympics.long, aes(x=Country, y=Count)) + geom_bar(stat="identity", aes(fill=Medal)) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1))
How to manually change the colors in a stacked bar plot
Although this is an improvement, it is a little confusing that
bronze are depicted by other colors. To remedy this, we also want to adjust the fill colors so that the colors on the plot match the medal color names.
To do that, we'll use something we learned when we went over how to manually change colors in Part 1 of this tutorial. We'll add a
scale_fill_manual() layer with HEX values for the colors we want for
bronze. We'll store those HEX values in
olympic_colors <- c("#d4af37", "#c0c0c0", "#897129") ggplot(olympics.long, aes(x=Country, y=Count)) + geom_bar(stat="identity", aes(fill=Medal)) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust=1, hjust=1), legend.position="top") + scale_fill_manual(values=olympic_colors) + ylab("Number of Olympic Medals in 2022")
In the plot above, we've also relocated the legend to above the plot. For more details on adjustments we can make to bar plots, we can refer to the documentation for geom_bar() in ggplot2.
How to plot line graphs
There's one more very common plot type that we might useful. This one is a line graph, or line chart, and is very common when depicting data that changes with time. To see this, we'll return to our
airquality.complete data, where we can plot a line graph with
Day on the x-axis and
Ozone on the y-axis. We'll do this by adding a
The figure above is confusing and doesn't seem to make much sense. That's because our
airquality.complete data can potentially contain an
Ozone level for each
Day in each of the five months in
How to plot multiple lines in a single plot
Connecting all these points does not make much sense so let's split these out by
Month. We can do this by specifying
group=Month inside the
aes() mapping. To make these lines easier to see, we'll additionally color them by month via
colour=Month inside the
This new plot shows the
Ozone levels for each
Day, with a separate line for each
Month. The five lines for the five months overlap quite a bit, however, so it's still a little difficult to distinguish them.
To depict this information even more clearly, we can split out the line graphs by
Month onto separate panels.
How to plot multiple plots in one plot with facets
To split the line graphs by
Month onto separate panels, we can add a
facet_grid() layer. Inside this layer, we specify
.~Month to indicate that we want to split the panels based on
Month along the y-axis. If we additionally wanted to split the panels by a second factor variable, we could do that by specifying the variable name in place of
ggplot(airquality.complete, aes(x=Day, y=Ozone, group=Month, colour=Month)) + geom_line() + facet_grid(.~Month) + theme_bw()
Since the facets are automatically labeled with the month names, our legend for the
Month colors is now redundant. As we did in Part 1, we can remove the legend by adding a
theme(legend.position = "none") layer after the
We can also switch to plotting points instead of lines by swapping to the
geom_point() layer. Finally, we can swap the facet orientation to see what the plot looks like when we split the panels along the other axis.
ggplot(airquality.complete, aes(x=Temp, y=Ozone, group=Month, colour=Month)) + geom_point() + facet_grid(Month~.) + theme_bw() + theme(legend.position = "none")
How to save plots to files
We covered a lot of different plot types in this post! Let's wrap up by looking at how to save these plots to files. There are a number of ways to do this. For now, we'll use the
ggsave() function from the
A quick look at the
ggsave() documentation via
?ggsave shows us that at a minimum, we need to specify a file name for the saved plot. If we don't specify a plot object, such as
ggsave() will default to saving the last plot we made.
Other parameters to alter when saving plots as files
ggsave documentation also shows us that we can specify different file types including eps, ps, pdf, jpeg, tiff, and png via the
device input. Additionally, we can also specify the
height of our exported file.
If we want to save to a particular directory, we can specify that with the
path input. If we don't specify
ggsave() will default to saving the plot in the current working directory. We can locate our current working directory by typing
getwd() into the R console. Finally, if we plan to use the saved plot for publication, we can also specify the resolution we need with the
That was a lot of material we just covered! We learned how to overlay smooth lines on scatterplots. We also learned how to plot histograms, boxplots, bar plots, and line graphs. Finally, we went over facets and how we can use them to plot multiple plots in a single plot, and how to save our plots to files.
Now that we're pretty comfortable with the basics of making data visualizations in R with
ggplot2, we'll move onto a code lab on exploratory data analysis! In our next post, we'll practice and explore the things we learned in this tutorial with a code lab!