1 Exploring Categorical Data

1.1 Contingency table review

The dataset has been loaded into your workspace as comics.

  • Type the name of the dataset to look at the rows and columns of the dataset.
  • View the levels() that the align variable can take.
  • View the levels() that the gender variable can take.
  • Create a contingency table of the same two variables.
# Print the first rows of the data
# comics (huge amount to load so I am making this line a comment)

# Check levels of align
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male"   "Other"
# Create a 2-way contingency table
table(comics$align, comics$gender)
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

1.2 Dropping levels

The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.

The contingency table from the last exercise is available in your workspace as tab. Load the dplyr package. Print tab to find out which level of align has the fewest total entries. *Use filter() to filter out all rows of comics with that level, then drop the unused level with droplevels(). Save the simplifed dataset over the old one as comics.

tab <- table(comics$align, comics$gender)

# Print tab
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0
# Remove align level
comics <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()

1.3 Side-by-side barcharts

While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.

Here you’ll construct two side-by-side barcharts of the comics data. This shows that there can often be two or more options for presenting the same data. Passing the argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not stacked) barchart.

  • Load the ggplot2 package.
  • Create a side-by-side barchart with align on the x-axis and gender as the fill aesthetic.
  • Create another side-by-side barchart with gender on the x-axis and align as the fill aesthetic. Rotate the axis labels 90 degrees to help readability.
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position = "dodge") + theme(axis.text.x = element_text(angle = 90))

1.4 Conditional proportions

The following code generates tables of joint and conditional proportions, respectively:

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2)  # Conditional on columns
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Go ahead and run it in the console. Approximately what proportion of all female characters are good? 51%

1.5 Counts vs. proportions

Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, you’ll construct two barcharts in this exercise: one of counts and one of proportions.

  • Create a stacked barchart of gender counts with align on the x-axis.
  • Create a stacked barchart of gender proportions with align on the x-axis by setting the position argument to geom_bar() equal to "fill".
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill")

Excellent work! By adding position = "fill" to geom_bar(), you are saying you want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.

1.6 Marginal barchart

If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are "Neutral" show an alignment between "Good" and "Bad", so it makes sense to put that bar in the middle.

  • Reorder the levels of align using the factor() function so that printing them reads “Bad”, “Neutral”, then “Good”.
  • Create a barchart of counts of the align variable.
# Change the order of the levels in align
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + geom_bar()

1.7 Conditional barchart

Now, if you want to break down the distribution of alignment based on gender, you’re looking for conditional distributions.

You could make these by creating multiple filtered datasets (one for each gender) or by faceting the plot of alignment based on gender. As a point of comparison, we’ve provided your plot of the marginal distribution of alignment from the last exercise.

  • Create a barchart of align faceted by gender.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

1.8 Improve piechart

The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts.

This is a piechart of a dataset called pies that contains the favorite pie flavors of 98 people. Improve the representation of these data by constructing a barchart that is ordered in descending order of count.

  • Use the code provided to reorder the levels of flavor so that they’re in descending order by count.
  • Create a barchart of flavor and orient the labels vertically so that they’re easier to read. The default coloring may look drab by comparison, so change the fill of the bars to "chartreuse".
# Put levels of flavor in decending order
#lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
#pies$flavor <- factor(pies$flavor, levels = lev)

# Create barchart of flavor
#ggplot(pies, aes(x = flavor)) + 
#geom_bar(fill = "chartreuse") + theme(axis.text.x = element_text(angle = 90))

# Alternative solution to finding levels
# lev <- unlist(select(arrange(cnt, desc(n)), flavor))

2 Numerical Summaries

2.1 Faceted histogram

In this chapter, you’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorial variable, but before you get there, you’ll want to familiarize yourself with the dataset.

2.1.1 Instructions

The cars dataset has been loaded in your workspace.

  • Load the ggplot2 package.
  • View the size of the data and the variable types using str().
  • Plot a histogram of city_mpg faceted by suv, a logical variable indicating whether the car is an suv or not.
library(ggplot2)

str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

2.2 Boxplots and density plots

The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you’ll try your hand at two alternatives: the box plot and the density plot.

2.2.1 Instructions

A quick look at unique(cars$ncyl) shows that there are more possible levels of ncyl than you might think. Here, restrict your attention to the most common levels.

  • Filter cars to include only cars with 4, 6, or 8 cylinders and save the result as common_cyl. The %in% operator may prove useful here.
  • Create side-by-side box plots of city_mpg separated out by ncyl.
  • Create overlaid density plots of city_mpg colored by ncyl.
library(dplyr)
library(ggplot2)
  # Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).

2.3 Compare distribution via plots

Which of the following interpretations of the plot is not valid? Answer: The variability in mileage of 8 cylinder cars is similar ot the variability in mileage of 4 cylinder cars.

2.4 Marginal and conditional histograms

Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

You’ll be making two plots using the “data pipeline” paradigm, where you start with the raw data and end with the plot.

2.4.1 Instructions

  • Create a histogram of the distribution of horsepwr across all cars and add an appropriate title. Start by piping in the raw dataset.
  • Create a second histogram of the distribution of horsepower, but only for those cars that have an msrp less than $25,000. Keep the limits of the x-axis so that they’re similar to that of the first plot, and add a descriptive title.
# Create hist of horsepwr
cars %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of horsepower for all cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create hist of horsepwr for affordable cars
cars %>%
  filter(msrp < 25000) %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of horsepower for affordable cars")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

2.5 Marginal and conditional histograms interpretation

Observe the two histograms in the plotting window and decide which of the following is a valid interpretation. Answer: The highest horsepower car in the less expensive range has just under 250 horsepower

2.6 Three binwidths

Before you take these plots for granted, it’s a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in your data.

2.6.1 Instructions

Create the following three plots, adding a title to each to indicate the binwidth used:

  • A histogram of horsepower (i.e. horsepwr) with a binwidth of 3.
  • A second histogram of horsepower with a binwidth of 30.
  • A third histogram of horsepower with a binwidth of 60.
# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

# Create hist of horsepwr with binwidth of 30
cars %>% 
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

2.7 Three binwidths interpretations

What feature is present in Plot A that’s not found in B or C) Answer: There is a tendency for cars to have horsepower right at 200 or 300 horsepower

2.8 Box plots for outliers

In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.

2.8.1 Instructions

Construct a box plot of msrp. Exclude the largest 3-5 outliers by filtering the rows to retain cars less than $100,000. Save this reduced dataset as cars_no_out. *Construct a similar box plot of msrp using this reduced dataset. Compare the two plots.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)
  
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

2.9 Plot selection

Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

2.9.1 Instructions

Use density plots or box plots to construct the following visualizations. For each variable, try both plots and submit the one that is better at capturing the important structure.

Display the distribution of city_mpg. Display the distribution of width.

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Create plot of width
cars %>% 
  ggplot(aes(x = width)) +
  geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).

2.10 3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

2.10.1 Instructions

common_cyl, which you created to contain only cars with 4, 6, or 8 cylinders, is available in your workspace.

Create a histogram of hwy_mpg faceted on both ncyl and suv. Add a title to your plot to indicate what variables are being faceted on.

# Facet hists using hwy mileage and ncyl
common_cyl %>% 
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("Mileage by suv and ncyl")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).

2.11 Interpret 3 var plot

Which of the following interpretations of the plot is valid? Answer: Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increase

3 Numerical Summaries

Now that we’ve looked at exploring categorical and numerical data, you’ll learn some useful statistics for describing distributions of data. The choice of measure for center can have a dramatic impact on what we consider to be a typical observation, so it is important that you consider the shape of the distribution before deciding on the measure.

3.1 Choice Of Centre Measure

Which set of measures of central tendency would be worst for describing the two distributions shown here? Answer: Mean, Mode

3.2 Calculate Center Measures

Throughout this chapter, you will use data from gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.

For this exercise, focus on how the life expectancy differs from continent to continent. This requires that you conduct your analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.

3.2.1 Instructions

Create a dataset called gap2007 that contains only data from the year 2007. Using gap2007, calculate the mean and median life expectancy for each continent. Don’t worry about naming the new columns produced by summarize(). Confirm the trends that you see in the medians by generating side-by-side box plots of life expectancy for each continent.

library(gapminder)
# Create dataset of 2007 data
gap2007 <- filter(gapminder, gapminder$year == "2007")

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##      <fctr>           <dbl>             <dbl>
## 1    Africa            54.8              52.9
## 2  Americas            73.6              72.9
## 3      Asia            70.7              72.4
## 4    Europe            77.6              78.6
## 5   Oceania            80.7              80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

3.3 Choice of Spread Measure

3.3.1 Instructions

The choice of measure for spread can dramatically impact how variable we consider our data to be, so it is important that you consider the shape of the distribution before deciding on the measure.

Which set of measures of spread would be worst for describing the two distributions shown here? Answer: Variance, Range

3.4 Calculate Spread Measures

Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

3.4.1 Instructions

The gap2007 dataset that you created in an earlier exercise is available in your workspace.

For each continent in gap2007, summarize life expectancies using the sd(), the IQR(), and the count of countries, n(). No need to name the new columns produced here. The n() function within your summarize() call does not take any arguments. Graphically compare the spread of these distributions by constructing overlaid density plots of life expectancy broken down by continent.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##      <fctr>         <dbl>          <dbl> <int>
## 1    Africa         9.631         11.610    52
## 2  Americas         4.441          4.632    25
## 3      Asia         7.964         10.152    33
## 4    Europe         2.980          4.782    30
## 5   Oceania         0.729          0.516     2
# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

3.5 Choose measures for center and spread

Consider the density plots shown here. What are the most appropriate measures to describe their centers and spreads? In this exercise, you’ll select the measures and then calculate them.

3.5.1 Intstructions

Using the shapes of the density plots, calculate the most appropriate measures of center and spread for the following:

The distribution of life expectancy in the countries of the Americas. Note you’ll need to apply a filter here. The distribution of country populations across the entire gap2007 dataset.

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44
# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531   26702008

3.6 Describe the Shape

To build some familiarity with distributions of different shapes, consider the four that are plotted here. Which of the following options does the best job of describing their shape in terms of modality and skew/symmetry? Answer: unimodal left-skewed; B: unimodal symmetric; C: unimodal right-skewed, D: bimodal symmetric.

3.7 Transformations

Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.

Here you’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R).

3.7.1 Instructions

Using the gap2007 data:

Create a density plot of the population variable. Mutate a new column called log_pop that is the natural log of the population and save it back into gap2007. Create a density plot of your transformed variable.

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

3.8 Identify Outliers

Consider the distribution, shown here, of the life expectancies of the countries in Asia. The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.

3.8.1 Instructions

gap2007 is still available in your workspace.

Apply a filter so that it only contains observations from Asia, then create a new variable called is_outlier that is TRUE for countries with life expectancy less than 50. Assign the result to gap_asia. Filter gap_asia to remove all outliers, then create another box plot of the remaining life expectancies.

# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

4 Case Study

4.1 Spam and num_char

Is there an association between spam and the length of an email? You could imagine a story either way:

  • Spam is more likely to be a short message tempting me to click on a link, or
  • My normal email is likely shorter since I exchange brief emails with my friends all the time.

Here, you’ll use the email dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email.

As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr chain with the layers in a ggplot2 object.

4.1.1 Instructions

Using the email dataset

  • Load the packages ggplot2, dplyr, and openintro.
  • Compute appropriate measures of the center and spread of num_char for both spam and not-spam using group_by() and summarize(). No need to name the new columns created by summarize().
  • Construct side-by-side box plots to visualize the association between the same two variables. It will be useful to mutate() a new column containing a log-transformed version of num_char.
# Load packages

library(ggplot2)
library(dplyr)
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cars
## The following object is masked from 'package:ggplot2':
## 
##     diamonds
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize(median(num_char),
            IQR(num_char))
## # A tibble: 2 x 3
##    spam `median(num_char)` `IQR(num_char)`
##   <dbl>              <dbl>           <dbl>
## 1     0               6.83           13.58
## 2     1               1.05            2.82
# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

4.2 Spam and !!!

Let’s look at a more obvious indicator of spam: exclamation marks. exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.

Experiment with different types of plots until you find one that is the most informative. Recall that you’ve seen:

  • Side-by-side box plots
  • Faceted histograms
  • Overlaid density plots

4.2.1 instuctions

The email dataset is still available in your workspace.

  • Calculate appropriate measures of the center and spread of exclaim_mess for both spam and not-spam using group_by() and summarize().
  • Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
  • If you decide to use a log transformation, remember that log(0) is -Inf in R, which isn’t a very useful value! You can get around this by adding a small number (like .01) to the quantity inside the log() function. This way, your value is never zero. This small shift to the right won’t affect your results.
# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(median(exclaim_mess),
            IQR(exclaim_mess))
## # A tibble: 2 x 3
##    spam `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <dbl>                  <dbl>               <dbl>
## 1     0                      1                   5
## 2     1                      0                   1
# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = log_exclaim_mess)) +
  geom_histogram() +
  facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Alternative plot: side-by-side box plots
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = 1, y = log_exclaim_mess)) +
  geom_boxplot() +
  facet_wrap(~ spam)

# Alternative plot: Overlaid density plots
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess + .01)) %>%
  ggplot(aes(x = log_exclaim_mess, fill = spam)) +
  geom_density(alpha = 0.3)

4.3 Spam and !!! interpretations

Which interpretation of these faceted histograms is not correct Answer: There are more cases of spam in this dataset than not-spam.

4.4 Collapsing levels

If it was difficult to work with the heavy skew of exclaim_mess, the number of images attached to each email (image) poses even more of a challenge. Run the following code at the console to get a sense of its distribution:

table(email$image)

Recall that this tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, let’s collapse image into a categorical variable that indicates whether or not the email had at least one image. In this exercise, you’ll create this new variable and explore its association with spam.

4.4.1 instructions

Starting with email, form a continuous chain that links together the following tasks:

  • Create a new variable called has_image that is TRUE where the number of images is greater than zero and FALSE otherwise.
  • Create an appropriate plot with email to visualize the relationship between has_image and spam.
# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

4.5 Image and spam interpretation

Which of the following interpretations of the plot is valid? Answer: An email without an image is more likely to be not-spam than spam.

4.6 Data Integrity

In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, the variable num_char contains the number of characters in the email, in thousands, so it could take decimal values, but it certainly shouldn’t take negative values.

You can formulate a test to ensure this variable is behaving as we expect:

email$num_char < 0

If you run this code at the console, you’ll get a long vector of logical values indicating for each case in the dataset whether that condition is TRUE. Here, the first 1000 values all appear to be FALSE. To verify that all of the cases indeed have non-negative values for num_char, we can take the sum of this vector:

sum(email$num_char < 0)

This is a handy shortcut. When you do arithmetic on logical values, R treats TRUE as 1 and FALSE as 0. Since the sum over the whole vector is zero, you learn that every case in the dataset took a value of FALSE in the test. That is, the num_char column is behaving as we expect andtaking only non-negative values.

4.6.1 Instructions

Consider the variables image and attach. You can read about them with ?email, but the help file is ambiguous: do attached images count as attached files in this dataset?

Design a simple test to determine if images count as attached files. This involves creating a logical condition to compare the values of the two variables, then using sum() to assess every case in the dataset. Recall that the logical operators are < for less than, <= for less than or equal to, > for greater than, >= for greater than or equal to, and == for equal to.

# Test if images count as attachments
sum(email$image> email$attach)
## [1] 0

4.7 Answering questions with chains

When you have a specific question about a dataset, you can find your way to an answer by carefully constructing the appropriate chain of R code. For example, consider the following question:

"Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?"

This can be answered with the following chain:

email %>% filter(spam == “not-spam”) %>% group_by(to_multiple) %>% summarize(median(num_char))

The code makes it clear that you are using num_char to measure the length of an email and median() as the measure of what is typical. If you run this code, you’ll learn that the answer to the question is “yes”: the typical length of non-spam sent to multiple people is a bit lower than those sent to only one person.

This chain concluded with summary statistics, but others might end in a plot; it all depends on the question that you’re trying to answer

4.7.1 Instructions

Build a chain to answer each of the following questions, both about the variable dollar.

  • For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email? Create a summary statistic that answers this question.
  • If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not-spam? Create a barchart that answers this question.
# Question 1
email %>%
  filter(dollar>0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##    spam `median(dollar)`
##   <dbl>            <dbl>
## 1     0                4
## 2     1                2
# Question 2
email %>%
  filter(dollar>10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

4.8 What’s in a number?

Turn your attention to the variable called number. Read more about it by pulling up the help file with ?email.

To explore the association between this variable and spam, select and construct an informative plot. For illustrating relationships between categorical variables, you’ve seen

  • Faceted barcharts
  • Side-by-side barcharts
  • Stacked and normalized barcharts.

Let’s practice constructing a faceted barchart.

4.8.1 Instructions

  • Reorder the levels of number so that they preserve the natural ordering of "none", then "small", then "big".
  • Construct a faceted barchart of the association between number and spam.
# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() +
  facet_wrap(~ spam)

4.9 What’s in a number interpretation

4.10 Which of the following interpretations of the plot is not valid? Answer: Given that an email contains no number, it is more likely to be spam.