4 Tidyverse

This first package in R that we will talk about is actually a suite of packages that have a common syntax, called the tidyverse. They actually call it a dialect of the R language, which is kinda fun!

There are eight core packages in the tidyverse.

# You can install individual packages in the tidyverse like this
install.packages("dplyr")

# But I like to use the tidyverse package that allows you to easily install and 
#   load the tidyverse
install.packages("tidyverse")

# Then the package must be attached so that we have access to the functions
library(tidyverse) # Note that you don't need the quotation marks here

# Print out the tidyverse packages
tidyverse::tidyverse_packages()
#  [1] "broom"         "cli"           "crayon"        "dbplyr"       
#  [5] "dplyr"         "dtplyr"        "forcats"       "googledrive"  
#  [9] "googlesheets4" "ggplot2"       "haven"         "hms"          
# [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
# [17] "modelr"        "pillar"        "purrr"         "readr"        
# [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
# [25] "rvest"         "stringr"       "tibble"        "tidyr"        
# [29] "xml2"          "tidyverse"

There are a bunch of tidyverse packages, but only the core eight are loaded with the library(tidyverse) function call, each other package must be loaded individually (e.g. library(broom)).

Each of these packages has a set of functions that it comes with.

# List several random functions from the dplyr package
ls("package:dplyr")[sample(1:length(ls("package:dplyr")), size = 20)]
#  [1] "mutate_each_"          "group_map"             "tbl_nongroup_vars"    
#  [4] "quo_name"              "distinct_at"           "transmute"            
#  [7] "summarize_if"          "dense_rank"            "select_"              
# [10] "collapse"              "failwith"              "select_var"           
# [13] "mutate_at"             "do_"                   "compare_tbls2"        
# [16] "first"                 "as_data_frame"         "contains"             
# [19] "db_explain"            "group_by_drop_default"

For fun, let’s break down how I am getting the random functions from each package. This is a good example of how you can break something down into it’s smallest components to understand what it is doing.

# Get the help page for a specific package
help(package = dplyr)

You can also navigate to this page with the Help panel on RStudio.

4.1 `tidyverse`-specific resources

tidyverse website – poke around this website and you can stumble on some good stuff
The tidy tools manifesto – who doesn’t love a good manifesto? My favorite part of the tidyverse is the final principle, which is: 4. Design for humans. “Programs must be written for people to read, and only incidentally for machines to execute.” — Hal Abelson
The tidyverse style guide – “Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” Honestly, I think that joke vastly underestimates how important it is to have good coding style. You can actually read “butitsuremakesthingseasiertoread” pretty easily because you are an expert reader – you’ve been at it every day for years, maybe decades – coding, not so much. I don’t think can overstate how important I think it is to write visually pleasing code. It is good practice for your future self as much as it is for your collaborators.

4.1.1 Cheat sheets⁶

4.1.1.1 Core tidyverse

readr – import your data. Need to import data? Cool kids = read_csv().
tidyr – tidy your data. Helps you tidy your data quick. What does tidy data mean? Keep reading…
tibble – a new data.frame – doesn’t have a cheat sheet, just works in the shadows.
dplyr – wrangle your data. Need to get a mean? Find a standard deviation? Look no further.
ggplot2 – plot your data – you already know ggplot.
stringr – maipulate strings. Have a bunch of strings for some reason? Use stringr.
purrr – functional programming. Wanna conserve energy like a cat? Replace for() with map().
forcats – manipulate factors– Using a bunch of factors? Reorder them with forcats.

4.1.1.2 Other useful tidyverse packages

broom – clean model output – technically a subpackage of tidymodels a cousin of the tidyverse
rvest – web scraping – mining data from a website
modelr – modelling – support for modelling data in the tidyverse

4.2 Tidy data

The tidyverse gets its name from the type of data that it is designed to interact with – tidy data. So let’s quickly define tidy data.

Every column is a variable.
Every row is an observation.
Every cell is a single value.

Yikes. That’s abstract…

4.3 Messy data

An example of some messy data.

# Make up some random data
mdh_df <- tibble(gill = rnorm(15, mean = 12, sd = 2),
                 adductor = rnorm(15, mean = 18, sd = 2.5),
                 mantle = rnorm(15, mean = 6, sd = 3))

# Print the data
mdh_df
# # A tibble: 15 × 3
#     gill adductor mantle
#    <dbl>    <dbl>  <dbl>
#  1 11.7      17.4  4.90 
#  2  9.06     19.7  2.87 
#  3 11.0      19.4  7.71 
#  4 12.8      16.3  5.59 
#  5 14.7      16.2 13.2  
#  6 11.8      18.9  5.88 
#  7 12.8      19.9  8.07 
#  8 11.9      17.7  6.08 
#  9  9.25     20.2  3.77 
# 10 11.2      19.0  6.57 
# 11 11.2      16.5  0.585
# 12 11.9      18.9 10.4  
# 13 14.2      15.2  6.46 
# 14 13.5      21.6 12.5  
# 15 11.7      23.0  7.43

4.4 Let’s tidy it

mdh_df <- mdh_df %>%
  pivot_longer(everything(),
               names_to = "tissue",
               values_to = "iu_gfw")

mdh_df
# # A tibble: 45 × 2
#    tissue   iu_gfw
#    <chr>     <dbl>
#  1 gill      11.7 
#  2 adductor  17.4 
#  3 mantle     4.90
#  4 gill       9.06
#  5 adductor  19.7 
#  6 mantle     2.87
#  7 gill      11.0 
#  8 adductor  19.4 
#  9 mantle     7.71
# 10 gill      12.8 
# # … with 35 more rows

BAM! That is tidy data. 1. Every column is a variable – tissue and enzyme activity. 2. Every row is an observation – enzyme activity in I.U./g f.w.. 3. Every cell is a single value.

We will talk about tidy data more when we get to our own data sets later in the workshop, but for now, let’s continue to talk about the functions from the core tidyverse

4.5 Tidy `iris` challenge

Take the built in iris data set and convert it to tidy data format.

# First 6 rows of the iris data set
iris %>%
  head()
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

# Functionally equivalent to this
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Click for Tidy iris solultion.

4.6 `dplyr` functions

dplyr has a few core functions that help organize, and analyse

Taken from the dplyr vignette.

4.6.1 Rows

filter() chooses rows based on column values.
slice() chooses rows based on location.
arrange() changes the order of the rows.

4.6.2 Columns

select() changes whether or not a column is included.
rename() changes the name of columns.
mutate() changes the values of columns and creates new columns.
relocate() changes the order of the columns.

4.6.3 Groups of rows

summarise() collapses a group into a single row.

4.7 Pipe `%>%` operator

The %>% operator is called a pipe and it basically works by taking what is on the left of the pipe and inserting it as the first argument on the function on the right.

For example, from the magrittr page:

x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
x %>% f %>% g %>% h is equivalent to h(g(f(x)))

Sometimes for functions with multiple arguments it is necessary to pipe the left-hand-side into other arguments of the function beyond the first argument. In that case you can use the argument placeholder ., as follows:

x %>% f(y, .) is equivalent to f(y, x)
x %>% f(y, z = .) is equivalent to f(y, z = x)

4.7.1 A fun example from R for data science

There are a few ways to run multi-line code

Nested it would look like this:

# This is almost unreadable
bop(scoop(hop(little_bunny(), through = forest), up = field_mice), on = head)

With saving intermediate variables it would look like this:

# This is slightly more readable, but has a bunch of meaningless intermediates
foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

And finally, with the %>% operator it would look like this:

# This is more readable and avoids needless intermediates
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mice) %>%
  bop(on = head)

We will practice piping together in a moment.

starwars data set.

4.8 `arrange`

# Arrange the tibble by height
starwars %>%
  arrange(height)
# # A tibble: 87 × 14
#    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#  1 Yoda         66    17 white      green      brown            896 male  mascu…
#  2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
#  3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
#  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
#  5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
#  6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
#  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
#  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
#  9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
# 10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# #   films <list>, vehicles <list>, starships <list>

By default it arranges height ascending from low to high. You can change that behavoir by wrapping the variable you want to arrange by in the desc() function like below:

# Arrange the tibble by height descending
starwars %>%
  arrange(desc(height))
# # A tibble: 87 × 14
#    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#  1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
#  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
#  3 Lama Su     229    88 none       grey       black           NA   male  mascu…
#  4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
#  5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
#  6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
#  7 Taun We     213    NA none       grey       black           NA   fema… femin…
#  8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
#  9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
# 10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# #   films <list>, vehicles <list>, starships <list>

You can combine tidyverse functions with base R functions like below

# The head function will print out the top n number of individuals
starwars %>%
  arrange(desc(height)) %>%
  head(n = 20)
# # A tibble: 20 × 14
#    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#  1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
#  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
#  3 Lama Su     229    88 none       grey       black           NA   male  mascu…
#  4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
#  5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
#  6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
#  7 Taun We     213    NA none       grey       black           NA   fema… femin…
#  8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
#  9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
# 10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# 11 IG-88       200   140 none       metal      red             15   none  mascu…
# 12 Ki-Adi-…    198    82 white      pale       yellow          92   male  mascu…
# 13 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
# 14 Jar Jar…    196    66 none       orange     orange          52   male  mascu…
# 15 Kit Fis…    196    87 none       green      black           NA   male  mascu…
# 16 Mas Ame…    196    NA none       blue       blue            NA   male  mascu…
# 17 Qui-Gon…    193    89 brown      fair       blue            92   male  mascu…
# 18 Dooku       193    80 white      fair       brown          102   male  mascu…
# 19 Wat Tam…    193    48 none       green, gr… unknown         NA   male  mascu…
# 20 Nute Gu…    191    90 none       mottled g… red             NA   male  mascu…
# # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# #   vehicles <list>, starships <list>

There is also a tidyverse way to do that which is a bit shorter

# slice_max will order by a variable and take the top n
starwars %>%
  slice_max(order_by = height, n = 20)
# # A tibble: 22 × 14
#    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#  1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
#  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
#  3 Lama Su     229    88 none       grey       black           NA   male  mascu…
#  4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
#  5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
#  6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
#  7 Taun We     213    NA none       grey       black           NA   fema… femin…
#  8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
#  9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
# 10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# #   films <list>, vehicles <list>, starships <list>

4.9 `summarise`

There are a few ways to get the average height of the Star Wars characters

In base R:

#  [1] 172 167  96 202 150 178 165  97 183 182 188 180 228 180 173 175 170 180  66
# [20] 170 183 200 190 177 175 180 150  NA  88 160 193 191 170 196 224 206 183 137
# [39] 112 183 163 175 180 178  94 122 163 188 198 196 171 184 188 264 188 196 185
# [58] 157 183 183 170 166 165 193 191 183 168 198 229 213 167  79  96 193 191 178
# [77] 216 234 188 178 206  NA  NA  NA  NA  NA 165
# [1] NA
# [1] 174.358

starwars %>%
  summarise(height_avg = mean(height, na.rm = TRUE))
# # A tibble: 1 × 1
#   height_avg
#        <dbl>
# 1       174.

4.10 `group_by`

Let’s say you wanted to know what the average height of each sex was

mdh_df %>%
  group_by(tissue) %>%
  summarise(iu_gfw_avg = mean(iu_gfw),
            iu_gfw_sd = sd(iu_gfw))
# # A tibble: 3 × 3
#   tissue   iu_gfw_avg iu_gfw_sd
#   <chr>         <dbl>     <dbl>
# 1 adductor      18.7       2.14
# 2 gill          11.9       1.57
# 3 mantle         6.80      3.37

Warning: you must be careful about the order when reusing variable names.

# Bad order
mdh_df %>%
  group_by(tissue) %>%
  summarise(iu_gfw = mean(iu_gfw),
            sd = sd(iu_gfw))
# # A tibble: 3 × 3
#   tissue   iu_gfw    sd
#   <chr>     <dbl> <dbl>
# 1 adductor  18.7     NA
# 2 gill      11.9     NA
# 3 mantle     6.80    NA

4.11 `iris` group challenge

Use what you have just learned of the tidyverse package to calculate the group averages for the iris data.

Click for iris summary stats solution.

Warning: I have downloaded these cheat sheets and saved them for my quick access, but they may not be the most current version of the cheat sheet.↩︎

4 Tidyverse

4.1 tidyverse-specific resources

4.1.1 Cheat sheets6