In our last two posts, we went over how to start making data visualizations in R with ggplot2 (Part 1 and Part 2). Now that we've finished that series, let's work on a Code Lab featuring exploratory data analysis! Today, we'll be exploring patterns in urban bike share usage with bike sharing data from Capital Bikeshare in Washington D.C.!
If you're new to R, I have a tutorial on getting started with coding in R in a two-part series here and here. This series will get you up to speed on installing and using R and RStudio so you can follow along with this post.
Bike share data
Let's start by taking a quick look at the documentation for our data! The documentation shows us that the data consists of hourly and daily bike rental data from the Capital Bikeshare system from 2011 to 2012. We'll focus on just the hourly data in this Code Lab.
Ideally, we'd probably prefer to use more recent data, and many bike sharing programs do release raw data to the public. However, these releases typically don't include weather data. Since it would take us some time to scrape our own weather data, we'll use this data since someone has already gone through all the work of merging it with weather data for us.
The Attribute Information section shows us the names of the variables in this data and what they contain. For each hour in a day between 2011 and 2012, we have day, seasonal, and holiday information. We also have weather data for that hour, information on how many bikes were rented in that hour, and how many of those rentals were from casual or registered bike users.
Once we unzip the file, we'll see a file named hour.csv. Let's copy that file to our working directory so we can work with it more easily. To find our current working directory, we can type the following into the R console.
Don't worry if you're having trouble moving this file to your working directory in R! I'll also have it linked below in case you want to access it that way.
Reading our data into R
Now we're ready to load our data into R! If your hour.csv file is now in your working directory, you can load it in as follows. We'll call this file hourTemp because we'll be making some modifications to it.
Let's take a quick look at our data with the dim() and head() functions. The dim() function tells us the dimension of our data. In this case, it will tell us the number of rows and columns we have in our data. The head() function shows us the first few rows of our data.
From the dim() output, we see that there are 17,379 rows, or observations, in this data and 17 columns, or variables. From the head() output, we can see that the variable names match the ones we saw under the Attribute Information section on the documentation page.
Each row in hourTemp gives us weather, day, and holiday information for a particular hour of a day during the 2011 to 2012 period. We also get information on the total number of bike rentals during that hour (in the cnt column), as well as how many of those rentals came from casual or registered bike users.
Preparing the data to look at differences between casual and registered users
Since we have data on both casual and registered bike users, let's look at how patterns in bike share usage vary between these two groups! In order to plot bike usage for these two groups separately, we'll split each row in our data into two rows: one for casual users and one for registered users. We'll additionally make a new variable called user to indicate whether the row contains data on casual or registered users.
Making a new dataset with separate rows for casual and registered users
To do this, we'll make new datasets for casual and registered users. We'll call these hour_casual and hour_registered. Both hour_casual and hour_registered will have the same weather and holiday information contained in columns 3 through 14 of hourTemp so we'll extract those columns first.
We can access to any set of rows or columns in our data using brackets [rows, columns] immediately following the dataset name. For example, hourTemp[1,] returns the first row in hourTemp. Similarly, hourTemp[,1] returns the first column in hourTemp. In this case, since we want to take every row of columns 3 through 14, we'll use hourTemp[,3:14].
We can create a new dataset with the assignment <- function. The first line in the code snippet below tells R that we want to make a new object named hour_casual using hourTemp[,3:14].
Next, we'll make new count variables in hour_casual and hour_registered using the casual and registered count data from the original hourTemp dataset.
We can make new variables for a dataset by using $ and the new variable name immediately following the dataset name. For example, we can read the first line in the code snippet below as follows: "Make a new variable named count in hour_casual using the data in hourTemp$casual."
We'll also make new user variables to indicate whether the counts for that row come from casual or registered users.
Next, let's relabel some of the weather and season variables so that they're easier to read. The weathersit variable details the weather situation during the hour. From the documentation, it looks like a 1 means that the skies were relatively free of clouds so let's relabel that variable as Clear.
To do that, we'll first use the which() function and the == comparison operator to identify the entries in weathersit that are equal to 1. To read more about these comparison operators, we can type the following into the R console to pull up their documentation.
This operator will return TRUE if the objects on the left and right of == are exactly equal. It will return FALSE if they are not. If the objects on the left and right are columns from datasets, it will apply the comparison element-wise. Below are two small examples to illustrate how this operator works.
1:5==c(1,2,3,4,5)# all equal#>  TRUE TRUE TRUE TRUE TRUE1:5==c(1,2,3,3,5)# not all equal#>  TRUE TRUE TRUE FALSE TRUE
If we add the which() function outside of the == comparison, we'll get a list of all the elements that returned TRUE in the comparison. For example, the code snippet below will give us a list of all the entries (stored in weather_1) in weathersit that have a value equal to 1. Then we can use weather_1[1:20] to look at the first 20 entries in this list.
Now that we know how to relabel entries, let's relabel the 2 responses as Cloudy. Then the 3 and 4 responses indicate varying levels of stormy weather so let's relabel anything that contains either a 3 or a 4 as Stormy.
Finally, R doesn't know how to order these season names. So let's order the responses within these variables so that when we plot the data, the labels will follow a logical ordering. We can do that using the factor() function with a levels() input inside.
The factor() function tells R that this is a categorical variable. The levels() input tells R the ordering that we want for the entries within the factor variable.
The mnth, hr, holiday, weekday, and workingday variables are also categorical variables but they contain numeric values so R will default to treating them as continuous variables. Rather than specifying them as factors when we plot, let's go ahead and specify that in the dataset so that R knows to treat them as factors.
Then, let's start by looking at some overall trends in seasonality, weather situation, and user type based on the time of day. There are many ways we can do that. We'll start by making boxplots of hourly bike rentals by season, weather situation, and user type. We can make separate boxplots for each combination of season and weather situation with a facet_grid() layer.
There's a lot going on in these plots because there are 24 hours along the x-axis of every subplot. Additionally, there are two boxplots (one for each of the casual and registered users) for each of those hours! To reduce the number of hour labels appearing on the x-axis, I've used the scale_x_discrete() layer with the breaks input to label only every third hour.
A few trends stand out right away! First, we can see that there are definitely more bike rentals when the weather is clear than when it is stormy.
Second, we see that among registered bike users, it looks like there are two times of peak usage during the day. The first is in the morning between 6 am and 9 am, and the second is in the evening between 4 pm and 7 pm. We can see this more clearly when looking at the same plots for just the registered users.
From these plots, it looks like most registered users are using the bikes for some form of commuting. By contrast, casual users seem to rent bikes during midday and evening hours.
Usage trends by month
Next, let's look at some usage trends by month! Since the mnth variable currently contains numbers 1 through 12, let's relabel them so that they show the actual month names. We'll do that using the factor() function with the labels input.
From this plot, we see that there are a lot more hourly bike rentals from registered users than there are from casual ones. We also see that registered bike users seem to rent bikes throughout the year. However, their usage is slightly lower in cloudy and stormy weather and winter months.
Meanwhile, casual users don't rent bikes very frequently during the winter months. We also see that most of their bike rentals occur between March and October.
In the previous figures, we were plotting the number of hourly bike rentals on the y-axis. If we want to see the aggregate number of bike rentals by month, we can also do that using a bar plot. Here, we have to remember to use the stat="identity" input when using the geom_bar() layer. Alternatively, we can also use a geom_col() layer.
These bar plots depicting the aggregate bike rentals by month on the y-axis make the differences in number of rentals between weather situations much more apparent. Unsurprisingly, there are very few bike rentals during stormy weather. However, there is also a noticeable drop in bike rentals between clear and cloudy weather. Moreover, this trend holds even for registered bike users, who seem to be renting bikes for their commute.
In the figure above, we added scale_y_continuous() layer with a labels input to avoid the default scientific notation for the tick labels on the y-axis.
Usage trends by month and working day
One variable we haven't looked at yet is the workingday variable. This variable is coded as 0 if the day was a weekend or holiday, and 1 otherwise. Let's relabel this variable to make it easier to read. Again, we'll use the factor() function with the labels input.
hour$workingday<-factor(hour$workingday, labels=c("Weekend or Holiday", "Work Day"))
Let's see what the data looks like when we plot trends for each month by working day and weather situation for the different user types!
This plot makes it apparent that most of the bike rentals from casual users happen during weekends or holidays. It also shows us that registered users rent bikes throughout the year but they rent a little more frequently on working days. Again, we see that there are far fewer bike rentals during stormy weather and winter months.
Usage trends by day of week
We just looked at some trends by working day status but we can also look at those trends by day of week. Since the weekday variable currently contains numbers 0 through 6, let's relabel them so that they show the actual days of the week. Again, we'll do that using the factor() function with the labels input.
The trends in these plots mirror the patterns we saw in the plots of month by working day status. We again see that registered users seem to rent bikes throughout the year. Also, with the exception of the winter months and stormy weather, they seem to rent slightly more frequently between Monday and Friday.
Similarly, we also see that casual users tend to rent bikes more frequently on the weekends, particularly in the spring and summer months.
Usage trends by season
We've used facets throughout this Code Lab to look at different slices of the data. Sometimes, it's also useful to subset the data. This means that we look at just the observations in the data that satisfy a particular criteria. For example, we can look at slices of the data for just certain seasons.
Let's try this out and subset hour for the different seasons.
We haven't looked at the atemp variable yet. This variable reports the normalized value of what the temperature feels like in Celsius. Since the variable has been normalized, its values run from 0 to 1. We can see this by looking at the variable summary.
summary(hour$atemp)#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0000 0.3333 0.4848 0.4758 0.6212 1.0000
This means that we'll interpret atemp values closer to 0 as cold, and atemp values closer to 1 as hot.
Let's see what we find when we plot a scatterplot of hourly bike rentals by time of day, working day status, and user type for just the spring months. We'll color the points by the normalized feeling temperature so that darker points indicate colder feeling temperatures.
This plot shows shows us that with some exceptions, registered users rent bikes like casual users during weekends and holidays! Not only does the time of day pattern look very similar, but the hourly counts also look very similar during those days.
In the figure above, we added a geom_jitter() layer to allow the points to move a little along on the x-axis. Otherwise, the points would stack up like a straight line above each hour. However, we set the height input to 0 in the geom_jitter() layer so that we don't move the points in the vertical direction and alter the hourly bike counts.
Let's see what we find when we plot the same thing for just winter months!
This plot shows us that during the winter months, registered users exhibit similar time of day usage as casual users during weekends and holidays since we see the same single peak pattern. However, registered bike users exhibit much higher hourly counts in colder temperatures (higher counts with dark blue dots). We also see that compared with registered users, casual users rent bikes very infrequently during working days in colder temperatures.
In this post, we saw how we can use data visualizations to explore patterns in our data. We also got a lot of practice with using ggplot2, adding layers to modify plot attributes, and using facets to look at different slices of our data. We also got some more experience using external data sources, and we practiced cleaning real data to prepare them for downstream analyses (such as data visualizations). Finally, got a lot of practice applying conditional operations with the which() function and subsetting data. Great job!
In our last post, we implemented our own k-means clustering algorithm in R! Today, we'll explore k-means clustering some more with a Code Lab to see how we can use the algorithm we coded up last time to cluster pixels in an image!
In previous posts, we discussed vectors and vector norms in a basic introduction to linear algebra and got some practice working with them in our Code Lab on coding a simple recommendation system in R. Today, we'll follow up on those skills and take a first look at k-means clustering, a machine learning algorithm for clustering!
When we shop online, we often get recommendations for other products that are similar to ones we've been looking at. Systems that recommend related products and services are frequently referred to as recommendation systems. In today's Code Lab, we'll code a simple recommendation system using something called cosine similarity!
So far, we've been working with single numbers in our posts. Many kinds of data, however, can be represented by matrices. In order to discuss and learn about methods designed for data stored in matrices, today's post is a quick tutorial on getting started with linear algebra in R!