4 Tidyverse
This first package in R that we will talk about is actually a suite of packages that have a common syntax, called the tidyverse
. They actually call it a dialect of the R language, which is kinda fun!
There are eight core packages in the tidyverse.
# You can install individual packages in the tidyverse like this
install.packages("dplyr")
# But I like to use the tidyverse package that allows you to easily install and
# load the tidyverse
install.packages("tidyverse")
# Then the package must be attached so that we have access to the functions
library(tidyverse) # Note that you don't need the quotation marks here
# Print out the tidyverse packages
::tidyverse_packages()
tidyverse# [1] "broom" "cli" "crayon" "dbplyr"
# [5] "dplyr" "dtplyr" "forcats" "googledrive"
# [9] "googlesheets4" "ggplot2" "haven" "hms"
# [13] "httr" "jsonlite" "lubridate" "magrittr"
# [17] "modelr" "pillar" "purrr" "readr"
# [21] "readxl" "reprex" "rlang" "rstudioapi"
# [25] "rvest" "stringr" "tibble" "tidyr"
# [29] "xml2" "tidyverse"
There are a bunch of tidyverse packages, but only the core eight are loaded with the library(tidyverse)
function call, each other package must be loaded individually (e.g. library(broom)
).
Each of these packages has a set of functions that it comes with.
# List several random functions from the dplyr package
ls("package:dplyr")[sample(1:length(ls("package:dplyr")), size = 20)]
# [1] "mutate_each_" "group_map" "tbl_nongroup_vars"
# [4] "quo_name" "distinct_at" "transmute"
# [7] "summarize_if" "dense_rank" "select_"
# [10] "collapse" "failwith" "select_var"
# [13] "mutate_at" "do_" "compare_tbls2"
# [16] "first" "as_data_frame" "contains"
# [19] "db_explain" "group_by_drop_default"
For fun, let’s break down how I am getting the random functions from each package. This is a good example of how you can break something down into it’s smallest components to understand what it is doing.
# Get the help page for a specific package
help(package = dplyr)
You can also navigate to this page with the Help panel on RStudio.
4.1 tidyverse
-specific resources
- tidyverse website – poke around this website and you can stumble on some good stuff
- The tidy tools manifesto – who doesn’t love a good manifesto? My favorite part of the tidyverse is the final principle, which is: 4. Design for humans. “Programs must be written for people to read, and only incidentally for machines to execute.” — Hal Abelson
- The tidyverse style guide – “Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” Honestly, I think that joke vastly underestimates how important it is to have good coding style. You can actually read “butitsuremakesthingseasiertoread” pretty easily because you are an expert reader – you’ve been at it every day for years, maybe decades – coding, not so much. I don’t think can overstate how important I think it is to write visually pleasing code. It is good practice for your future self as much as it is for your collaborators.
4.1.1 Cheat sheets6
4.1.1.1 Core tidyverse
readr
– import your data. Need to import data? Cool kids =read_csv()
.tidyr
– tidy your data. Helps you tidy your data quick. What does tidy data mean? Keep reading…tibble
– a new data.frame – doesn’t have a cheat sheet, just works in the shadows.dplyr
– wrangle your data. Need to get a mean? Find a standard deviation? Look no further.ggplot2
– plot your data – you already know ggplot.stringr
– maipulate strings. Have a bunch of strings for some reason? Usestringr
.purrr
– functional programming. Wanna conserve energy like a cat? Replacefor()
withmap()
.forcats
– manipulate factors– Using a bunch of factors? Reorder them withforcats
.
4.2 Tidy data
The tidyverse gets its name from the type of data that it is designed to interact with – tidy data. So let’s quickly define tidy data.
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Yikes. That’s abstract…
4.3 Messy data
An example of some messy data.
# Make up some random data
<- tibble(gill = rnorm(15, mean = 12, sd = 2),
mdh_df adductor = rnorm(15, mean = 18, sd = 2.5),
mantle = rnorm(15, mean = 6, sd = 3))
# Print the data
mdh_df# # A tibble: 15 × 3
# gill adductor mantle
# <dbl> <dbl> <dbl>
# 1 11.7 17.4 4.90
# 2 9.06 19.7 2.87
# 3 11.0 19.4 7.71
# 4 12.8 16.3 5.59
# 5 14.7 16.2 13.2
# 6 11.8 18.9 5.88
# 7 12.8 19.9 8.07
# 8 11.9 17.7 6.08
# 9 9.25 20.2 3.77
# 10 11.2 19.0 6.57
# 11 11.2 16.5 0.585
# 12 11.9 18.9 10.4
# 13 14.2 15.2 6.46
# 14 13.5 21.6 12.5
# 15 11.7 23.0 7.43
4.4 Let’s tidy it
<- mdh_df %>%
mdh_df pivot_longer(everything(),
names_to = "tissue",
values_to = "iu_gfw")
mdh_df# # A tibble: 45 × 2
# tissue iu_gfw
# <chr> <dbl>
# 1 gill 11.7
# 2 adductor 17.4
# 3 mantle 4.90
# 4 gill 9.06
# 5 adductor 19.7
# 6 mantle 2.87
# 7 gill 11.0
# 8 adductor 19.4
# 9 mantle 7.71
# 10 gill 12.8
# # … with 35 more rows
BAM! That is tidy data. 1. Every column is a variable – tissue and enzyme activity. 2. Every row is an observation – enzyme activity in I.U./g f.w.. 3. Every cell is a single value.
We will talk about tidy data more when we get to our own data sets later in the workshop, but for now, let’s continue to talk about the functions from the core tidyverse
4.5 Tidy iris
challenge
Take the built in iris
data set and convert it to tidy data format.
# First 6 rows of the iris data set
%>%
iris head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# Functionally equivalent to this
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
Click for Tidy iris
solultion.
4.6 dplyr
functions
dplyr
has a few core functions that help organize, and analyse
Taken from the dplyr
vignette.
4.6.1 Rows
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.
4.7 Pipe %>%
operator
The %>%
operator is called a pipe and it basically works by taking what is on the left of the pipe and inserting it as the first argument on the function on the right.
For example, from the magrittr
page:
x %>% f
is equivalent tof(x)
x %>% f(y)
is equivalent tof(x, y)
x %>% f %>% g %>% h
is equivalent toh(g(f(x)))
Sometimes for functions with multiple arguments it is necessary to pipe the left-hand-side into other arguments of the function beyond the first argument. In that case you can use the argument placeholder .
, as follows:
x %>% f(y, .)
is equivalent tof(y, x)
x %>% f(y, z = .)
is equivalent tof(y, z = x)
4.7.1 A fun example from R for data science
There are a few ways to run multi-line code
Nested it would look like this:
# This is almost unreadable
bop(scoop(hop(little_bunny(), through = forest), up = field_mice), on = head)
With saving intermediate variables it would look like this:
# This is slightly more readable, but has a bunch of meaningless intermediates
<- little_bunny()
foo_foo <- hop(foo_foo, through = forest)
foo_foo_1 <- scoop(foo_foo_1, up = field_mice)
foo_foo_2 <- bop(foo_foo_2, on = head) foo_foo_3
And finally, with the %>%
operator it would look like this:
# This is more readable and avoids needless intermediates
%>%
foo_foo hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
We will practice piping together in a moment.
starwars data set.
4.8 arrange
# Arrange the tibble by height
%>%
starwars arrange(height)
# # A tibble: 87 × 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Yoda 66 17 white green brown 896 male mascu…
# 2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
# 3 Wicket … 88 20 brown brown brown 8 male mascu…
# 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
# 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
# 6 R4-P17 96 NA none silver, r… red, blue NA none femin…
# 7 R5-D4 97 32 <NA> white, red red NA none mascu…
# 8 Sebulba 112 40 none grey, red orange NA male mascu…
# 9 Gasgano 122 NA none white, bl… black NA male mascu…
# 10 Watto 137 NA black blue, grey yellow NA male mascu…
# # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# # films <list>, vehicles <list>, starships <list>
By default it arranges height ascending from low to high. You can change that behavoir by wrapping the variable you want to arrange by in the desc()
function like below:
# Arrange the tibble by height descending
%>%
starwars arrange(desc(height))
# # A tibble: 87 × 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Yarael … 264 NA none white yellow NA male mascu…
# 2 Tarfful 234 136 brown brown blue NA male mascu…
# 3 Lama Su 229 88 none grey black NA male mascu…
# 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
# 5 Roos Ta… 224 82 none grey orange NA male mascu…
# 6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
# 7 Taun We 213 NA none grey black NA fema… femin…
# 8 Rugor N… 206 NA none green orange NA male mascu…
# 9 Tion Me… 206 80 none grey black NA male mascu…
# 10 Darth V… 202 136 none white yellow 41.9 male mascu…
# # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# # films <list>, vehicles <list>, starships <list>
You can combine tidyverse
functions with base
R functions like below
# The head function will print out the top n number of individuals
%>%
starwars arrange(desc(height)) %>%
head(n = 20)
# # A tibble: 20 × 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Yarael … 264 NA none white yellow NA male mascu…
# 2 Tarfful 234 136 brown brown blue NA male mascu…
# 3 Lama Su 229 88 none grey black NA male mascu…
# 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
# 5 Roos Ta… 224 82 none grey orange NA male mascu…
# 6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
# 7 Taun We 213 NA none grey black NA fema… femin…
# 8 Rugor N… 206 NA none green orange NA male mascu…
# 9 Tion Me… 206 80 none grey black NA male mascu…
# 10 Darth V… 202 136 none white yellow 41.9 male mascu…
# 11 IG-88 200 140 none metal red 15 none mascu…
# 12 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
# 13 Dexter … 198 102 none brown yellow NA male mascu…
# 14 Jar Jar… 196 66 none orange orange 52 male mascu…
# 15 Kit Fis… 196 87 none green black NA male mascu…
# 16 Mas Ame… 196 NA none blue blue NA male mascu…
# 17 Qui-Gon… 193 89 brown fair blue 92 male mascu…
# 18 Dooku 193 80 white fair brown 102 male mascu…
# 19 Wat Tam… 193 48 none green, gr… unknown NA male mascu…
# 20 Nute Gu… 191 90 none mottled g… red NA male mascu…
# # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# # vehicles <list>, starships <list>
There is also a tidyverse way to do that which is a bit shorter
# slice_max will order by a variable and take the top n
%>%
starwars slice_max(order_by = height, n = 20)
# # A tibble: 22 × 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Yarael … 264 NA none white yellow NA male mascu…
# 2 Tarfful 234 136 brown brown blue NA male mascu…
# 3 Lama Su 229 88 none grey black NA male mascu…
# 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
# 5 Roos Ta… 224 82 none grey orange NA male mascu…
# 6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
# 7 Taun We 213 NA none grey black NA fema… femin…
# 8 Rugor N… 206 NA none green orange NA male mascu…
# 9 Tion Me… 206 80 none grey black NA male mascu…
# 10 Darth V… 202 136 none white yellow 41.9 male mascu…
# # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# # films <list>, vehicles <list>, starships <list>
4.9 summarise
There are a few ways to get the average height of the Star Wars characters
In base
R:
# [1] 172 167 96 202 150 178 165 97 183 182 188 180 228 180 173 175 170 180 66
# [20] 170 183 200 190 177 175 180 150 NA 88 160 193 191 170 196 224 206 183 137
# [39] 112 183 163 175 180 178 94 122 163 188 198 196 171 184 188 264 188 196 185
# [58] 157 183 183 170 166 165 193 191 183 168 198 229 213 167 79 96 193 191 178
# [77] 216 234 188 178 206 NA NA NA NA NA 165
# [1] NA
# [1] 174.358
%>%
starwars summarise(height_avg = mean(height, na.rm = TRUE))
# # A tibble: 1 × 1
# height_avg
# <dbl>
# 1 174.
4.10 group_by
Let’s say you wanted to know what the average height of each sex was
%>%
mdh_df group_by(tissue) %>%
summarise(iu_gfw_avg = mean(iu_gfw),
iu_gfw_sd = sd(iu_gfw))
# # A tibble: 3 × 3
# tissue iu_gfw_avg iu_gfw_sd
# <chr> <dbl> <dbl>
# 1 adductor 18.7 2.14
# 2 gill 11.9 1.57
# 3 mantle 6.80 3.37
Warning: you must be careful about the order when reusing variable names.
# Bad order
%>%
mdh_df group_by(tissue) %>%
summarise(iu_gfw = mean(iu_gfw),
sd = sd(iu_gfw))
# # A tibble: 3 × 3
# tissue iu_gfw sd
# <chr> <dbl> <dbl>
# 1 adductor 18.7 NA
# 2 gill 11.9 NA
# 3 mantle 6.80 NA
4.11 iris
group challenge
Use what you have just learned of the tidyverse
package to calculate the group averages for the iris
data.
Click for iris
summary stats solution.
Warning: I have downloaded these cheat sheets and saved them for my quick access, but they may not be the most current version of the cheat sheet.↩︎