Introduction

This is an exploration of bicycle-sharing data in the city of Seattle, WA (USA) from October 2014-August 2016. I hope to eventually combine this data with other forms of ride-sharing and transportation in the city, but this will be the first step.

Time to get started!

Loading Necessary Packages

# For data manipulation and tidying
library(dplyr)
library(lubridate)
library(tidyr)
library(tidyverse)

# For mapping
library(ggmap)
library(mapproj)

# For data visualizations
library(ggplot2)

# For modeling and machine learning
library(caret)

Importing Data

All of the data can be downloaded from the bicycle-sharing service “Pronto!”’s website or from Kaggle. This project contains 3 data sets and I’ll import and inspect each data file independently.

station <- read.csv(file = "CopyOfstation.csv", header = TRUE, 
    stringsAsFactors = FALSE)

trip <- read.csv(file = "CopyOftrip.csv", header = TRUE, 
    stringsAsFactors = FALSE)

weather <- read.csv(file = "CopyOfweather.csv", header = TRUE, 
    stringsAsFactors = FALSE)

Okay, let’s take a look at each of these data files.

Data Structures and Variables

Station

Observations: 58
Variables: 9
$ station_id        <chr> "BT-01", "BT-03", "BT-04", "BT-05", "CBD-03"...
$ name              <chr> "3rd Ave & Broad St", "2nd Ave & Vine St", "...
$ lat               <dbl> 47.61842, 47.61583, 47.61609, 47.61311, 47.6...
$ long              <dbl> -122.3510, -122.3486, -122.3411, -122.3442, ...
$ install_date      <chr> "10/13/2014", "10/13/2014", "10/13/2014", "1...
$ install_dockcount <int> 18, 16, 16, 14, 20, 18, 20, 20, 20, 18, 16, ...
$ modification_date <chr> "", "", "", "", "", "", "", "11/9/2015", "",...
$ current_dockcount <int> 18, 16, 16, 14, 20, 18, 20, 18, 20, 18, 0, 1...
$ decommission_date <chr> "", "", "", "", "", "", "", "", "", "", "8/9...

Looks like this dataset is dealing with 9 variables:

  • Station ID : The individual ID number for a bike station
  • Name : The name of that station ID, also appears to be the rough location of the station
  • Latitude : The latitude of the station
  • Longitude : The longitude of the station
  • Install Date : When that particular station was installed (in MM/DD/YYYY format)
  • Install Dock Count : Number of docks (bike positions) available at each station on installation day
  • Modification Date : When a particular station was modified (in MM/DD/YYYY format)
  • Current Dock Count : Number of docks (bike positions) available at each station on August 31, 2016
  • Decommission Date : The date that a particular station was put out of service (in MM/DD/YYYY format)

Trip

Observations: 50,793
Variables: 12
$ trip_id           <int> 431, 432, 433, 434, 435, 436, 437, 438, 439,...
$ starttime         <chr> "10/13/2014 10:31", "10/13/2014 10:32", "10/...
$ stoptime          <chr> "10/13/2014 10:48", "10/13/2014 10:48", "10/...
$ bikeid            <chr> "SEA00298", "SEA00195", "SEA00486", "SEA0033...
$ tripduration      <dbl> 985.935, 926.375, 883.831, 865.937, 923.923,...
$ from_station_name <chr> "2nd Ave & Spring St", "2nd Ave & Spring St"...
$ to_station_name   <chr> "Occidental Park / Occidental Ave S & S Wash...
$ from_station_id   <chr> "CBD-06", "CBD-06", "CBD-06", "CBD-06", "CBD...
$ to_station_id     <chr> "PS-04", "PS-04", "PS-04", "PS-04", "PS-04",...
$ usertype          <chr> "Member", "Member", "Member", "Member", "Mem...
$ gender            <chr> "Male", "Male", "Female", "Female", "Male", ...
$ birthyear         <int> 1960, 1970, 1988, 1977, 1971, 1974, 1978, 19...

This dataset appears to contain 12 variables:

  • Trip ID : An identification number assigned to each trip (from one bike station to another)
  • Start Time : The time and date that a bike was borrowed from a station (in MM/DD/YYYY HH:MM format)
  • Stop Time : The time and date that a bike was returned to a station (in MM/DD/YYYY HH:MM format)
  • Bike ID : The identification number for a specific bike
  • Trip Duration : Time of trip (measured in seconds)
  • From Station Name : The name of the station where the bike was borrowed from
  • To Station Name : The name of the station where the bike was returned to
  • From Station ID : The ID number of the station where the bike was borrowed from
  • To Station ID : The ID number of the station where the bike was returned to
  • User Type : Indicates whether the user was a “Member” (i.e., someone with a monthly or annual membership to Pronto!) or a “Short-Term Pass Holder” (i.e., someone who purchased a 24 hour or 3 day pass)
  • Gender : The gender of the rider (if known)
  • Birth Year : The year that the rider was born

Weather

Observations: 689
Variables: 21
$ Date                       <chr> "10/13/2014", "10/14/2014", "10/15/...
$ Max_Temperature_F          <int> 71, 63, 62, 71, 64, 68, 73, 66, 64,...
$ Mean_Temperature_F         <int> 62, 59, 58, 61, 60, 64, 64, 60, 58,...
$ Min_TemperatureF           <int> 54, 55, 54, 52, 57, 59, 55, 55, 55,...
$ Max_Dew_Point_F            <int> 55, 52, 53, 49, 55, 59, 57, 57, 52,...
$ MeanDew_Point_F            <int> 51, 51, 50, 46, 51, 57, 55, 54, 49,...
$ Min_Dewpoint_F             <int> 46, 50, 46, 42, 41, 55, 53, 50, 46,...
$ Max_Humidity               <int> 87, 88, 87, 83, 87, 90, 94, 90, 87,...
$ Mean_Humidity              <int> 68, 78, 77, 61, 72, 83, 74, 78, 70,...
$ Min_Humidity               <int> 46, 63, 67, 36, 46, 68, 52, 67, 58,...
$ Max_Sea_Level_Pressure_In  <dbl> 30.03, 29.84, 29.98, 30.03, 29.83, ...
$ Mean_Sea_Level_Pressure_In <dbl> 29.79, 29.75, 29.71, 29.95, 29.78, ...
$ Min_Sea_Level_Pressure_In  <dbl> 29.65, 29.54, 29.51, 29.81, 29.73, ...
$ Max_Visibility_Miles       <int> 10, 10, 10, 10, 10, 10, 10, 10, 10,...
$ Mean_Visibility_Miles      <int> 10, 9, 9, 10, 10, 8, 10, 10, 10, 6,...
$ Min_Visibility_Miles       <int> 4, 3, 3, 10, 6, 2, 6, 5, 6, 2, 10, ...
$ Max_Wind_Speed_MPH         <int> 13, 10, 18, 9, 8, 10, 10, 12, 15, 1...
$ Mean_Wind_Speed_MPH        <int> 4, 5, 7, 4, 3, 4, 3, 5, 8, 8, 9, 4,...
$ Max_Gust_Speed_MPH         <chr> "21", "17", "25", "-", "-", "-", "1...
$ Precipitation_In           <dbl> 0.00, 0.11, 0.45, 0.00, 0.14, 0.31,...
$ Events                     <chr> "Rain", "Rain", "Rain", "Rain", "Ra...

This dataset represents quite a bit of weather data in 21 variables.

  • Date : The date in MM/DD/YYYY format
  • Max Temperature F : The maximum temperature that day (in degrees F)
  • Mean Temperature F : The average temperature that day (in degrees F)
  • Min Temperature F : The minimum temperature that day (in degrees F)
  • Max Dew Point F : The maximum dew point (in degrees F)
  • Mean Dew Point F : The average dew point (in degrees F)
  • Min Dew Point F : The minimum dew point (in degrees F)
  • Max Humidity : The maximum humidity (in %)
  • Mean Humidity : The average humidity (in %)
  • Min Humidity : The minimum humidity (in %)
  • Maximum Sea Level Pressure : The maximum atmospheric pressure at sea level (in inches of mercury)
  • Mean Sea Level Pressure : The average atmospheric pressure at sea level (in inches of mercury)
  • Min Sea Level Pressure : The minimum atmospheric pressure at sea level (in inches of mercury)
  • Max Visibility Miles : The maximum visibility (in miles)
  • Mean Visibility Miles : The average visibility (in miles)
  • Min Visibility Miles : The minimum visibility (in miles)
  • Max Wind Speed MPH : The maximum sustained wind speed (in miles per hour)
  • Mean Wind Speed MPH : The average sustained wind speed (in miles per hour)
  • Max Gust Speed MPH : The maximum gust wind speed (in miles per hour)
  • Precipitation : The amount of precipitation (measured in inches)
  • Events : Weather events that occurred that day (e.g., rain, fog, snow, thunderstorm etc.)

Data Visualizations

Exploring the Stations Dataset

Since the “Stations” dataset was the first one I imported, let’s start with a little exploration there. First of all, how many unique stations are we dealing with?

station %>% summarise(n_distinct(station_id))
  n_distinct(station_id)
1                     58

Wow! 58 different stations! Let’s take a quick peek at Figure 1 to see where they are located.

station_locs <- station %>% group_by(station_id) %>% select(1:4, 
    -2)
# Load the correct map
mymap <- get_map(location = "Seattle", maptype = "roadmap", zoom = 12)

# Plot a single point for each Station ID
ggmap(mymap) + geom_point(aes(x = long, y = lat), data = station_locs, 
    alpha = 0.7, color = "darkred", size = 2)
Station Locations

Figure 1: Station Locations

So it looks like all of the stations are located near the Lower Queen Anne, Belltown, International District, Capitol Hill and University of Washington areas. Let’s take a more zoomed-in look with Figure 2

Zoomed-In Station Locations

Figure 2: Zoomed-In Station Locations

Great! So the locations are pretty well clustered. I wonder what order they were added in.

Station Installations

First, let’s convert those character-string date objects to actual dates using the lubridate package.

station$install_date <- mdy(station$install_date)

# How many times were new stations installed?
station %>% summarise(n_distinct(install_date))
  n_distinct(install_date)
1                        9
# How many stations were installed on each date?
station %>% group_by(install_date) %>% summarise(count = n()) %>% 
    arrange(install_date)
# A tibble: 9 x 2
  install_date count
        <date> <int>
1   2014-10-13    50
2   2015-05-22     1
3   2015-06-12     1
4   2015-07-27     1
5   2015-09-15     1
6   2015-10-29     1
7   2016-03-18     1
8   2016-07-03     1
9   2016-08-09     1

It looks like the vast majority (86%) of the stations were added on opening day. Let’s see where those original ones were and where the rest were added.

So they added more stations throughout the district that they serve, instead of adding several new stations to a single neighborhood all at once. Good to know.

Now, I wonder how many bikes can be parked at each station (as of August 31,2016)? See Figure 3 to find out.

How Many Bikes Can Be Parked At Each Station?

Figure 3: How Many Bikes Can Be Parked At Each Station?

Well that’s weird, some of the stations have a dock count of 0. I’m assuming they didn’t start that way. Let’s calculate the change in dock count from station installation to August 31, 2016 and plot it on a map.

Change in Number of Bike Docks Per Station

Any stations with no change in number of docks are not shown below in Figure 4.

Change in Number of Bike Docks Per Station

Figure 4: Change in Number of Bike Docks Per Station

Wow! Looks like quite a few stations took away bike docks and none gained any. Perhaps those stations weren’t being used very frequently. We’ll have to look at that a bit later

Current Station Size

I’m going to take one quick look at the current size of each station before moving on to the next dataset. Note: I did not include any stations that were closed as of August 31, 2016 in this map. See Figure 5 below to look at the current size of each station.

ggmap(mymap) + geom_point(aes(x = long, y = lat, size = install_dockcount), data = station, 
    alpha = 0.7, color = "darkred")
Current Station Size

Figure 5: Current Station Size

So it looks like the biggest stations tend to be on the outskirts of the rest. Where there are several stations in close proximity, there tend to be fewer bike docks at each station. That makes sense, logically speaking. If you go to a station and there is no bike to rent, you can easily go to another nearby, assuming there is another nearby. In areas where the stations are more secluded, it’s more important that there be bikes and open spaces readily available for users.

Alright, I’m feeling good about exploring this dataset. Time to check out the trip dataset!

Exploring the Trips Dataset

It’s been a while since we’ve looked at the trip dataset, so let’s take another peek at it here.

glimpse(trip)
Observations: 50,793
Variables: 12
$ trip_id           <int> 431, 432, 433, 434, 435, 436, 437, 438, 439,...
$ starttime         <chr> "10/13/2014 10:31", "10/13/2014 10:32", "10/...
$ stoptime          <chr> "10/13/2014 10:48", "10/13/2014 10:48", "10/...
$ bikeid            <chr> "SEA00298", "SEA00195", "SEA00486", "SEA0033...
$ tripduration      <dbl> 985.935, 926.375, 883.831, 865.937, 923.923,...
$ from_station_name <chr> "2nd Ave & Spring St", "2nd Ave & Spring St"...
$ to_station_name   <chr> "Occidental Park / Occidental Ave S & S Wash...
$ from_station_id   <chr> "CBD-06", "CBD-06", "CBD-06", "CBD-06", "CBD...
$ to_station_id     <chr> "PS-04", "PS-04", "PS-04", "PS-04", "PS-04",...
$ usertype          <chr> "Member", "Member", "Member", "Member", "Mem...
$ gender            <chr> "Male", "Male", "Female", "Female", "Male", ...
$ birthyear         <int> 1960, 1970, 1988, 1977, 1971, 1974, 1978, 19...

Great, so there are quite a few things that we can potentially look at using this dataset by itself. Let’s start with the number of trips per day since Pronto! began opening bike stations. To do that, we need to recode our start date/times as POSIXct objects. We’ll use the lubridate package for this.

# Make the start and stop dates into POSIXct objects
trip_2 <- trip %>% mutate(start_dt = mdy_hm(starttime), stop_dt = mdy_hm(stoptime))

# Recode the dates
trip_2 <- trip_2 %>% mutate(start_date = paste(month(start_dt), 
    day(start_dt), year(start_dt), sep = "/"))
trip_2$start_date <- mdy(trip_2$start_date)

trip_2 <- trip_2 %>% mutate(stop_date = paste(month(stop_dt), 
    day(stop_dt), year(stop_dt), sep = "/"))
trip_2$stop_date <- mdy(trip_2$stop_date)

Great! Figure 6 is a nice visualization of the number of rides per day.

Number of Rides Per Day

Figure 6: Number of Rides Per Day

Hmm, grouping by day is a little noisy. Perhaps we should try by month?

Plotting Trips Per Month (By Season)

First, we need to create a “Year-Month” variable

start_date_ym <- trip_2 %>% mutate(ym = paste(year(start_date), 
    month(start_date), sep = "/"), Season = ifelse(ym %in% c("2014/10", "2014/11"), "Fall", ifelse(ym %in% c("2014/12", "2015/1", "2015/2"), "Winter", ifelse(ym %in% c("2015/3", "2015/4", "2015/5"), "Spring", "Summer"))))

Now plot. I think I’ll plot this by month but color it by season (where December, January, and February are “winter”, March, April, and May are “spring”, June, July, August are “summer”, and September, October, November are “autumn”). See Figure 7 below.

Trips Per Month (By Season)

Figure 7: Trips Per Month (By Season)

Well that intuitively makes sense. The number of trips taken per month increases in the spring, reaches a maximum in the summer, declines through the fall, remains fairly stable in the winter and then repeats.

Average Trip Duration

Great! I wonder how the average trip duration fluctuates over this time period. See Figure 8 below.

# Convert Trip Duration from Seconds to Minutes
Trip_Duration_Month <- start_date_ym %>% mutate(trip_duration_min = tripduration/60) %>% 
    group_by(ym) %>% select(ym, trip_duration_min, Season) %>% summarise(Avg = mean(trip_duration_min), 
    sd = sd(trip_duration_min)) %>% mutate(se = sd/sqrt(n()),Season = ifelse(ym %in% c("2014/10", "2014/11"), "Fall", ifelse(ym %in% c("2014/12", "2015/1", "2015/2"), "Winter", ifelse(ym %in% c("2015/3", "2015/4", "2015/5"), "Spring", "Summer"))))
ggplot(Trip_Duration_Month,aes(x = ym, y = Avg, color = Season)) + geom_point() + geom_line(aes(group = 1)) + labs(x = "Date", y = "Average Trip Duration (In Minutes)") + theme_bw() + geom_errorbar(aes(ymin = Avg - se, ymax = Avg + se))
Average Trip Duration (By Season)

Figure 8: Average Trip Duration (By Season)

There’s surprisingly not a huge range in trip durations here.

The little bit of variation here makes logical sense. Longer trips were being taken in the spring and summer months rather than the fall and winter. It’s also notable that the spring and summer of 2016 may have shown fewer trips than the previous year, show a slight increase in average trip length.

Number of Trips by Day of Week

I wonder if people are using this service to commute to/from work. Let’s look at the number of trips by day of the week.

First, let’s create a Day of the Week variable.

start_date_ym$wd <- wday(start_date_ym$start_date, label = TRUE)

Now to plot the total number of trips by day of the week.

Ok, so there are definitely more trips during the week than on the weekends. I wonder if this varies by season too.

start_date_ym %>%
  group_by(wd, Season) %>%
  summarize(N = n()) %>%
ggplot(aes(x = wd, y = N, color = Season, group = Season)) + geom_point() + geom_line() + labs(x = "Day of the Week", y = "Number of trips") + theme_bw()
Number of trips by day of week

Figure 9: Number of trips by day of week

So it looks like on Figure 9 usage is relatively consistent across seasons, at least as far as the number of trips are concerned.

Number of Trips by Time of Day

How about time of day? Are people using these around commuting times during the week and later on weekends?

start_date_ym %>%
  group_by(Hour=hour(start_dt), Season, wd) %>%
  summarize(N=n()) %>%
  ggplot(aes(x= Hour,y = N, color = Season, group = Season)) +
  geom_point() +
  geom_line() +
  labs(x="Hour of Day",y="Number of Trips") +
  facet_wrap(~wd)
Number of trips by time of day

Figure 10: Number of trips by time of day

Wow, looks like on Figure 10, regardless of the season, people are commuting to/from work using this service (there’s a spike between 8 and 10 AM and another between 4 and 7 PM Monday through Friday). But the weekends seem to be popular between 10 AM and 10 PM.

Number of Trips by Member Type

I wonder if different types of members (those who have a membership vs. those that bought a 24 hour or 3 day pass) vary in the number of trips they take.

If I were to guess, I’d think the short-term passes would be ideal for tourists or people looking for a quick weekend trip, whereas members may be more likely to continue using the service year-round. Let’s check out my assumptions by plotting, once again colored by season on Figure 11

start_date_ym %>%
  filter(usertype != "") %>%
  group_by(ym, Season, usertype) %>%
  summarize(N = n()) %>%
  ggplot(aes(x = ym, y = N, colour = Season, group = usertype)) +
  geom_point(aes(shape = usertype), size = 3) +
  geom_line() +
  theme_bw() +
  labs(y = "Total number of trips per month", x = "")
Number of trips by member type

Figure 11: Number of trips by member type

Surprisingly (to me, at least), different types of users seem to follow similar patterns of usage. Spring and Summer are definitely the most popular times for anyone to ride a bike in the Seattle area.

Trip Duration by Member Type

While it may seem that the trip duration shouldn’t vary widely by member type, a quick look at Pronto!’s pricing structure may make you reconsider that assumption. You see, while you have to purchase either an annual membership ($85/year), a 24-Hour Pass ($8) or a 3-Day Pass ($16) there is still a cap on the duration of your trip. For members, any ride under 45 minutes is free, but any ride going over 45 minutes will incur a fee of $2 for every additional 30 minutes. For short-term users, any ride under 30 minutes is free, but going over that time limit would cost you an additional $2 for the first 30 minutes and $5 for each additional 30 minutes after that!

Let’s see in Figure 12 if these time limits cause differing behaviors in our users.

Trip Duration by member type

Figure 12: Trip Duration by member type

Ok, so our members are pretty good about making sure that they return their bike before they incur extra charges, but the short-term pass holders frequently go over their time limit. I wonder how the cost of a trip varies for members and pass holders. Let’s try to calculate the cost of a trip.

start_date_ym <- start_date_ym %>% mutate(tripduration_min = tripduration/60, cost = ifelse(usertype == "Member" & 
    tripduration_min <= 45, 0, ifelse(usertype == "Member" & tripduration_min > 
    45 & tripduration_min <= 75, 2, ifelse(usertype == "Member" & 
    tripduration_min > 75, (2 + 2 * ceiling((tripduration_min - 75)/30)), 
    ifelse(usertype == "Short-Term Pass Holder" & tripduration_min <= 
        30, 0, ifelse(usertype == "Short-Term Pass Holder" & 
        tripduration_min > 30 & tripduration_min < 60, 2, ifelse(usertype == 
        "Short-Term Pass Holder" & tripduration_min > 60, (2 + 
        5 * ceiling((tripduration_min - 60)/30)), NA)))))))

That was a complicated nested if/else statement! Let’s see how much in Figure 13 these folks are paying in additional fees!

Additional Fees

Figure 13: Additional Fees

Looks like short-term pass holders (who are already paying a higher price per day of biking), are also paying lots of extra fees. This could be because they are unfamiliar with the pricing structure and don’t realize they need to return their bike to a station within 30 minutes without getting charged. It is also possible that short-term users may be tourists who don’t know their way around as easily, and thus can’t find their way to a station within the time limit.

Member Demographics

We only seem to have age and gender information about people who have an annual Pronto! membership, so we can at least take a look at what types of people use this service.

Let’s look first at Figure 14 for the members ages.

Member Ages

Figure 14: Member Ages

My first instinct here is to say “Wow! There’s a lot of mid-20 to 30-somethings that use this service!” But this figure (and these data) may be a little misleading. You see, we don’t have any sort of Rider ID number, meaning we can’t take “individual activity level” into account. So we can’t tell if the tallest spike is because 6 very athletic 27-year-olds went on 1,000 trips each, or if 100 people went on 60 trips each, or if there were 6,000 27-year-olds who each only used the service once.

The same problem would arise if we looked at gender, so I’m just going to move beyond demographics.

Exploring the Weather Dataset

Now that I’ve visualized all that I can think of in terms of the “trips” dataset, it’s time to take a brief look at the weather dataset.

Let’s get a quick reminder of what we’re looking at here.

glimpse(weather)
Observations: 689
Variables: 21
$ Date                       <chr> "10/13/2014", "10/14/2014", "10/15/...
$ Max_Temperature_F          <int> 71, 63, 62, 71, 64, 68, 73, 66, 64,...
$ Mean_Temperature_F         <int> 62, 59, 58, 61, 60, 64, 64, 60, 58,...
$ Min_TemperatureF           <int> 54, 55, 54, 52, 57, 59, 55, 55, 55,...
$ Max_Dew_Point_F            <int> 55, 52, 53, 49, 55, 59, 57, 57, 52,...
$ MeanDew_Point_F            <int> 51, 51, 50, 46, 51, 57, 55, 54, 49,...
$ Min_Dewpoint_F             <int> 46, 50, 46, 42, 41, 55, 53, 50, 46,...
$ Max_Humidity               <int> 87, 88, 87, 83, 87, 90, 94, 90, 87,...
$ Mean_Humidity              <int> 68, 78, 77, 61, 72, 83, 74, 78, 70,...
$ Min_Humidity               <int> 46, 63, 67, 36, 46, 68, 52, 67, 58,...
$ Max_Sea_Level_Pressure_In  <dbl> 30.03, 29.84, 29.98, 30.03, 29.83, ...
$ Mean_Sea_Level_Pressure_In <dbl> 29.79, 29.75, 29.71, 29.95, 29.78, ...
$ Min_Sea_Level_Pressure_In  <dbl> 29.65, 29.54, 29.51, 29.81, 29.73, ...
$ Max_Visibility_Miles       <int> 10, 10, 10, 10, 10, 10, 10, 10, 10,...
$ Mean_Visibility_Miles      <int> 10, 9, 9, 10, 10, 8, 10, 10, 10, 6,...
$ Min_Visibility_Miles       <int> 4, 3, 3, 10, 6, 2, 6, 5, 6, 2, 10, ...
$ Max_Wind_Speed_MPH         <int> 13, 10, 18, 9, 8, 10, 10, 12, 15, 1...
$ Mean_Wind_Speed_MPH        <int> 4, 5, 7, 4, 3, 4, 3, 5, 8, 8, 9, 4,...
$ Max_Gust_Speed_MPH         <chr> "21", "17", "25", "-", "-", "-", "1...
$ Precipitation_In           <dbl> 0.00, 0.11, 0.45, 0.00, 0.14, 0.31,...
$ Events                     <chr> "Rain", "Rain", "Rain", "Rain", "Ra...

Great, let’s change the Date variable to a POSIXct object, and make the “Events” variable factors.

# Adjusting the Date Variable
weather$Date <- mdy(weather$Date)

# Adjusting the Events Variable
weather$Events <- as.factor(weather$Events)

Great. Now how many weather events are there?

levels(weather$Events)
 [1] ""                    "Fog"                 "Fog , Rain"         
 [4] "Fog-Rain"            "Rain"                "Rain , Snow"        
 [7] "Rain , Thunderstorm" "Rain-Snow"           "Rain-Thunderstorm"  
[10] "Snow"               

Wow! So mostly combinations of rain…

Let’s combine a few of these things that seem to represent the same event.

weather$Events <- gsub("Fog , Rain|Fog-Rain", "Fog-Rain", weather$Events)
weather$Events <- gsub("Rain , Snow|Rain-Snow", "Rain-Snow", 
    weather$Events)
weather$Events <- gsub("Rain , Thunderstorm|Rain-Thunderstorm", 
    "Rain-TS", weather$Events)

weather$Events <- as.factor(weather$Events)

Where else does this dataset need to be cleaned up? Let’s look for any missing values.

summary(weather)
      Date            Max_Temperature_F Mean_Temperature_F
 Min.   :2014-10-13   Min.   :39.00     Min.   :33.00     
 1st Qu.:2015-04-03   1st Qu.:55.00     1st Qu.:48.00     
 Median :2015-09-22   Median :63.00     Median :56.00     
 Mean   :2015-09-22   Mean   :64.03     Mean   :56.58     
 3rd Qu.:2016-03-12   3rd Qu.:73.00     3rd Qu.:65.00     
 Max.   :2016-08-31   Max.   :98.00     Max.   :83.00     
                                        NA's   :1         
 Min_TemperatureF Max_Dew_Point_F MeanDew_Point_F Min_Dewpoint_F 
 Min.   :23.00    Min.   :10.00   Min.   : 4.00   Min.   : 1.00  
 1st Qu.:43.00    1st Qu.:44.00   1st Qu.:41.00   1st Qu.:36.00  
 Median :50.00    Median :50.00   Median :46.00   Median :42.00  
 Mean   :49.45    Mean   :48.57   Mean   :45.02   Mean   :40.87  
 3rd Qu.:57.00    3rd Qu.:54.00   3rd Qu.:51.00   3rd Qu.:47.00  
 Max.   :70.00    Max.   :77.00   Max.   :59.00   Max.   :57.00  
                                                                 
  Max_Humidity    Mean_Humidity    Min_Humidity  
 Min.   : 40.00   Min.   :24.00   Min.   :15.00  
 1st Qu.: 78.00   1st Qu.:60.00   1st Qu.:38.00  
 Median : 86.00   Median :70.00   Median :50.00  
 Mean   : 84.54   Mean   :68.51   Mean   :49.97  
 3rd Qu.: 90.00   3rd Qu.:79.00   3rd Qu.:63.00  
 Max.   :100.00   Max.   :95.00   Max.   :87.00  
                                                 
 Max_Sea_Level_Pressure_In Mean_Sea_Level_Pressure_In
 Min.   :29.47             Min.   :29.31             
 1st Qu.:30.01             1st Qu.:29.93             
 Median :30.12             Median :30.04             
 Mean   :30.12             Mean   :30.03             
 3rd Qu.:30.24             3rd Qu.:30.16             
 Max.   :30.86             Max.   :30.81             
                                                     
 Min_Sea_Level_Pressure_In Max_Visibility_Miles Mean_Visibility_Miles
 Min.   :29.14             Min.   : 3.00        Min.   : 1.00        
 1st Qu.:29.84             1st Qu.:10.00        1st Qu.: 9.00        
 Median :29.96             Median :10.00        Median :10.00        
 Mean   :29.94             Mean   : 9.99        Mean   : 9.43        
 3rd Qu.:30.08             3rd Qu.:10.00        3rd Qu.:10.00        
 Max.   :30.75             Max.   :10.00        Max.   :10.00        
                                                                     
 Min_Visibility_Miles Max_Wind_Speed_MPH Mean_Wind_Speed_MPH
 Min.   : 0.000       Min.   : 4.00      Min.   : 0.000     
 1st Qu.: 4.000       1st Qu.: 8.00      1st Qu.: 3.000     
 Median : 9.000       Median :10.00      Median : 4.000     
 Mean   : 7.245       Mean   :11.09      Mean   : 4.631     
 3rd Qu.:10.000       3rd Qu.:13.00      3rd Qu.: 6.000     
 Max.   :10.000       Max.   :30.00      Max.   :23.000     
                                                            
 Max_Gust_Speed_MPH Precipitation_In       Events   
 Length:689         Min.   :0.0000            :361  
 Class :character   1st Qu.:0.0000   Fog      : 16  
 Mode  :character   Median :0.0000   Fog-Rain : 13  
                    Mean   :0.1051   Rain     :287  
                    3rd Qu.:0.0900   Rain-Snow:  3  
                    Max.   :2.2000   Rain-TS  :  7  
                                     Snow     :  2  

Ok, so we have one NA for “Mean_Temperature_F”, “Max_Gust_Speed_MPH” seems to be represented as a character vector because it has “-” representing NA values, and we have 361 unlabelled Events.

Max Gust Speed should be the easiest one to fix, so we’ll start there.

weather$Max_Gust_Speed_MPH <- gsub("-", 0, weather$Max_Gust_Speed_MPH)

weather$Max_Gust_Speed_MPH <- as.numeric(weather$Max_Gust_Speed_MPH)

Great! We changed any absent values for Maximum Gust Speed to 0 MPH and changed the variable type to a number. Uh oh, looks like there are still 185 NA values for Max Gust Speed. That’s a lot to try to replace. I would normally suggest generating a model that could try to predict those values based on other known values, but for now, we’ll just leave it alone.

Since there is only one missing Mean Temperature, it seems the easiest way to fill in the hole is to look up what the average temperature was that day. Note: I certainly would not recommend this if it were any more than one missing value

weather[which(is.na(weather$Mean_Temperature_F)), 1]
[1] "2016-02-14"

Ok, so we’re looking for the Mean Temperature on February 14, 2016 in the zipcode 98101 (according to dataset documentation). Looks like the mean temperature that day was 50 degrees F.

Time to substitute in that value.

weather[490, "Mean_Temperature_F"] <- 50

Perfect. Now what to do with the unlabelled “Event” categories. The dataset “ReadMe” file from Pronto! doesn’t include any information about this weather dataset. The only thing I can think to do is refer to the Event as “Other”.

weather$Events <- gsub("^$", "Other", weather$Events)
weather$Events <- as.factor(weather$Events)

Ok, we’re in good shape. Now to do a few quick visualizations.

Temperature

Minimum

Mean

Maximum

Events

Combining Weather and Trips Dataset

Good, so we can now see some parts of the weather data. Let’s combine the weather data with our trip data. Let’s try a left join from the dplyr package.

# Make a copy of the data frame
trip_3 <- trip_2

# Change column name in trip_3 to match weather dataset
trip_3$Date <- trip_3$start_date

# Left join the trip and weather dataframes by date.
trip_weather <- left_join(trip_3, weather, by = "Date")

Mean Temperature vs. Number of Trips

Ok. Now let’s see how the number of trips per day is influenced by weather in Figure 15 (mean temperature, rounded to the nearest 5 degrees F)

trip_weather %>% 
  mutate(mean_rounded = 5*round(Mean_Temperature_F/5)) %>%
  group_by(mean_rounded) %>%
  summarize(N = n()) %>%
  ggplot(aes(x = mean_rounded , y = N)) +
  geom_line() + 
  theme_bw() + 
  labs(x = "Temperature (Rounded to Nearest 5 degrees F", y = "Number of Trips")
Mean Temperature vs. Number of Trips

Figure 15: Mean Temperature vs. Number of Trips

So, as expected, there are more trips when the weather is mild but not too warm (over 70F) or too cold (below 50F). However, this figure may be influenced by the overall number of days that exhibited each mean temperature. Let’s try to standardize that in Figure 16.

t1 <- trip_weather %>% 
  mutate(rounded = 5*round(Mean_Temperature_F/5)) %>%
  group_by(rounded) %>%
  summarize(n = n(), Days = n_distinct(Date)) %>% 
  mutate(Ratio = n/Days)

  ggplot(data = t1, aes(x = rounded , y = Ratio)) +
  geom_line() + 
  theme_bw() + 
  labs(x = "Temperature (Rounded to Nearest 5 degrees F)", y = "Number of Trips/Number of Days at Temperature")
Standardization of Temperature and Number of Trips

Figure 16: Standardization of Temperature and Number of Trips

So when we standardize our measurements, correcting for the number of days that actually reached each temperature, we see a steady increase in the number of trips until around 75F where the trend levels off. People are more likely to ride a bike when it’s warm outside.

Precipitation vs. Number of Trips

If you’ve ever heard of Seattle, you probably hear that it rains all the time there. Let’s see if that has an impact on the number of trips taken in a day.

We’ll start with a figure standardized for number of days at a precipitation level, rounded to the nearest 0.2 inches.

t2 <- trip_weather %>% 
  mutate(rounded = 0.2*round(Precipitation_In/0.2)) %>%
  group_by(rounded) %>%
  summarize(n = n(), Days = n_distinct(Date)) %>% 
  mutate(Ratio2 = n/Days)

  ggplot(data = t2, aes(x = rounded , y = Ratio2)) +
  geom_line() + 
  theme_bw() + 
  labs(x = "Precipitation (Rounded to Nearest 0.2 Inches)", y = "Number of Trips/Number of Days at Precipitaton Level")
Precipitation vs. Number of Trips

Figure 17: Precipitation vs. Number of Trips

Looks like in Figure 17 that even Seattleites have a limit when it comes to riding a bike in the rain. The more it rained, the fewer trips were taken per day.

Conclusions

So what did we learn from all of this? In the nearly 2 years since Pronto! opened in Seattle:

  • 236,065 bike trips were taken using this service
  • More trips occur in the spring and summer than winter/autumn
  • More trips occur during warm/dry weather
  • People tend to ride downhill more frequently than uphill
  • Pronto! bikes are used for work commutes during the week and more leisurely use on weekends
  • Short-Term Pass Holders are more likely to incur extra charges due to surpassing their time limit

Suggestions for Pronto!

  • Give users bonuses for bringing bikes back to a station on the top of the hill
  • Hold discounts in fall/winter
  • Find a way to alert short-term users that their time limit will be ending soon, and where the nearest station is to them at that time
  • Consider a 3rd membership option: “Commuter”. This may allow users to take bikes between 7-10 AM and 4-7 PM for free, but operate under a different time limit schedule during other times of day.

As always, I appreciate any and all feedback from my work and appreciate you taking the time to see what I’ve done. Thanks!