Introduction
This is a quick introduction to the tidyverse
for lab meeting. And because I am not the authority on these things – here are a bunch of links that you may find more useful than following my ramblings.
Links
Books & webpages
- tidyverse website – poke around this website and you can stumble on some good stuff
- R for data science – a wonderfully thorough and useful book that emphasizes the tidyverse
- R for Graduate Students – very accessible introduction to R & the tidyverse
- The tidy tools manifesto – who doesn’t love a good manifesto? 1
- The tidyverse style guide – “Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” 2
Cheat sheets 3
Core tidyverse
readr
– import your data. Need to import data? Cool kids =read_csv()
. 4tidyr
– tidy your data. Helps you tidy your data quick. What does tidy data mean? Keep reading…tibble
– a new data.frame – doesn’t have a cheat sheet, just works in the shadows.dplyr
– wrangle your data. Need to get a mean? Find a standard deviation? Look no further.ggplot2
– plot your data – you already know ggplot.stringr
– maipulate strings. Have a bunch of strings for some reason? Usestringr
.purrr
– functional programming. Wanna conserve energy like a cat? Replacefor()
withmap()
.forcats
– manipulate factors– Using a bunch of factors? Reorder them withforcats
.
Other useful tidyverse packages
broom
– clean model output – technically a subpackage oftidymodels
a cousin of the tidyverservest
– web scraping – mining data from a websitemodelr
– modelling – support for modelling data in the tidyverse
What is the tidyverse
?
A suite of inter-related packages with a common grammar and syntax.
# Install the tidyverse
install.packages("tidyverse")
# Install
install.packages("broom")
# Load library
require(tidyverse)
require(broom)
Tidy data
The tidyverse gets its name from the type of data that it is designed to interact with – tidy data. So let’s quickly define tidy data.
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Yikes. That’s abstract…
Messy data
An example of some messy data.
# Make up some random data
mdh_df <- tibble(gill = rnorm(15, mean = 12, sd = 2),
adductor = rnorm(15, mean = 18, sd = 2.5),
mantle = rnorm(15, mean = 6, sd = 3))
# Print the data
mdh_df
# A tibble: 15 x 3
gill adductor mantle
<dbl> <dbl> <dbl>
1 12.3 18.4 10.4
2 10.1 23.2 4.37
3 11.8 17.8 9.54
4 11.5 18.0 5.63
5 11.0 18.8 1.55
6 13.5 21.0 5.01
7 11.6 14.5 6.65
8 8.71 18.7 4.95
9 14.0 14.1 10.1
10 14.9 20.4 -0.159
11 12.5 19.6 1.37
12 11.8 15.6 12.5
13 12.9 16.6 8.57
14 12.8 20.4 6.56
15 11.8 18.8 5.19
Let’s tidy it
mdh_df <- mdh_df %>%
pivot_longer(everything(),
names_to = "tissue",
values_to = "iu_gfw")
mdh_df
# A tibble: 45 x 2
tissue iu_gfw
<chr> <dbl>
1 gill 12.3
2 adductor 18.4
3 mantle 10.4
4 gill 10.1
5 adductor 23.2
6 mantle 4.37
7 gill 11.8
8 adductor 17.8
9 mantle 9.54
10 gill 11.5
# … with 35 more rows
BAM! That is tidy data. 1. Every column is a variable – tissue and enzyme activity. 2. Every row is an observation – enzyme activity in I.U./g f.w.. 3. Every cell is a single value.
Importing data
Why use readr
?
I find it a lot easier to define the data structure as you import data. And it creates a tibble
instead of data.frame
.
# Create a .csv file to import
write_csv(iris, "iris.csv")
# Try read.csv
d.f <- read.csv("iris.csv")
str(d.f)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
# Try read_csv
read_csv("iris.csv")
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
# Set col_type as you import data -- allows you to define level order too
d_f <- read_csv("iris.csv",
col_types = list(Species = col_factor(c("versicolor",
"setosa",
"virginica"))))
d_f
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
str(d_f)
tibble [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "versicolor","setosa",..: 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "spec")=
.. cols(
.. Sepal.Length = col_double(),
.. Sepal.Width = col_double(),
.. Petal.Length = col_double(),
.. Petal.Width = col_double(),
.. Species = col_factor(levels = c("versicolor", "setosa", "virginica"), ordered = FALSE, include_na = FALSE)
.. )
glimpse(d_f)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
Wrangling data
Intro to dplyr
dplyr
has a few core functions that are built to work on
Taken from the dplyr
vignette.
Rows
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.
Columns
select()
changes whether or not a column is included.rename()
changes the name of columns.mutate()
changes the values of columns and creates new columns.relocate()
changes the order of the columns.
Groups of rows
summarise()
collapses a group into a single row.
Pipe %>%
From magrittr
.
x %>% f
is equivalent tof(x)
x %>% f(y)
is equivalent tof(x, y)
x %>% f %>% g %>% h
is equivalent toh(g(f(x)))
The argument placeholder
x %>% f(y, .)
is equivalent tof(y, x)
x %>% f(y, z = .)
is equivalent tof(y, z = x)
A fun example from R for data science
foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
arrange
mdh_df %>%
arrange(tissue, iu_gfw)
# A tibble: 45 x 2
tissue iu_gfw
<chr> <dbl>
1 adductor 14.1
2 adductor 14.5
3 adductor 15.6
4 adductor 16.6
5 adductor 17.8
6 adductor 18.0
7 adductor 18.4
8 adductor 18.7
9 adductor 18.8
10 adductor 18.8
# … with 35 more rows
mdh_df %>%
arrange(desc(tissue), desc(iu_gfw))
# A tibble: 45 x 2
tissue iu_gfw
<chr> <dbl>
1 mantle 12.5
2 mantle 10.4
3 mantle 10.1
4 mantle 9.54
5 mantle 8.57
6 mantle 6.65
7 mantle 6.56
8 mantle 5.63
9 mantle 5.19
10 mantle 5.01
# … with 35 more rows
summarise
mdh_df %>%
summarise(iu_gfw_avg = mean(iu_gfw))
# A tibble: 1 x 1
iu_gfw_avg
<dbl>
1 12.2
group_by
mdh_df %>%
group_by(tissue) %>%
summarise(iu_gfw_avg = mean(iu_gfw),
iu_gfw_sd = sd(iu_gfw))
# A tibble: 3 x 3
tissue iu_gfw_avg iu_gfw_sd
<chr> <dbl> <dbl>
1 adductor 18.4 2.48
2 gill 12.1 1.52
3 mantle 6.15 3.61
Warning that you must be careful about the order when reusing variable names.
# Bad order
mdh_df %>%
group_by(tissue) %>%
summarise(iu_gfw = mean(iu_gfw),
sd = sd(iu_gfw))
# A tibble: 3 x 3
tissue iu_gfw sd
<chr> <dbl> <dbl>
1 adductor 18.4 NA
2 gill 12.1 NA
3 mantle 6.15 NA
# This order works because it collapses the data into a mean last
mdh_df %>%
group_by(tissue) %>%
summarise(sd = sd(iu_gfw),
iu_gfw = mean(iu_gfw))
# A tibble: 3 x 3
tissue sd iu_gfw
<chr> <dbl> <dbl>
1 adductor 2.48 18.4
2 gill 1.52 12.1
3 mantle 3.61 6.15
Print out a pretty table using kableExtra
.
mdh_df %>%
group_by(tissue) %>%
summarise(iu_gfw_avg = mean(iu_gfw),
iu_gfw_sd = sd(iu_gfw)) %>%
mutate(`I.U. / g f.w.` = paste(round(iu_gfw_avg, 2),
"±",
round(iu_gfw_sd, 2))) %>%
select(tissue, `I.U. / g f.w.`) %>%
rename("Tissue" = "tissue",
`MDH Activity (I.U. / g f.w.)` = `I.U. / g f.w.`) %>%
mutate(Tissue = str_to_sentence(Tissue)) %>%
kableExtra::kable()
Tissue | MDH Activity (I.U. / g f.w.) |
---|---|
Adductor | 18.39 ± 2.48 |
Gill | 12.09 ± 1.52 |
Mantle | 6.15 ± 3.61 |
nest
ChickWeight %>%
glimpse()
Rows: 578
Columns: 4
$ weight <dbl> 42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205, 40, 49, …
$ Time <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21, 0, 2, 4, 6, 8, 10, …
$ Chick <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ Diet <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
ChickWeight %>%
group_by(Chick, Diet) %>%
nest()
# A tibble: 50 x 3
# Groups: Chick, Diet [50]
Chick Diet data
<ord> <fct> <list>
1 1 1 <tibble [12 × 2]>
2 2 1 <tibble [12 × 2]>
3 3 1 <tibble [12 × 2]>
4 4 1 <tibble [12 × 2]>
5 5 1 <tibble [12 × 2]>
6 6 1 <tibble [12 × 2]>
7 7 1 <tibble [12 × 2]>
8 8 1 <tibble [11 × 2]>
9 9 1 <tibble [12 × 2]>
10 10 1 <tibble [12 × 2]>
# … with 40 more rows
ChickWeight_nest <- ChickWeight %>%
group_by(Chick, Diet) %>%
nest()
ChickWeight_nest$data[1:2]
[[1]]
# A tibble: 12 x 2
weight Time
<dbl> <dbl>
1 42 0
2 51 2
3 59 4
4 64 6
5 76 8
6 93 10
7 106 12
8 125 14
9 149 16
10 171 18
11 199 20
12 205 21
[[2]]
# A tibble: 12 x 2
weight Time
<dbl> <dbl>
1 40 0
2 49 2
3 58 4
4 72 6
5 84 8
6 103 10
7 122 12
8 138 14
9 162 16
10 187 18
11 209 20
12 215 21
broom
CHeck out the broom
vignette.
[And the broom
and dplyr
vignette] (https://cran.r-project.org/web/packages/broom/vignettes/broom_and_dplyr.html).
tidy
: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.
glance
: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model.
ChickWeight %>%
group_by(Chick, Diet) %>%
nest() %>%
mutate(
fit = map(data, ~ lm(weight ~ Time, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance)
) %>%
unnest(tidied)
# A tibble: 100 x 10
# Groups: Chick, Diet [50]
Chick Diet data fit term estimate std.error statistic p.value glanced
<ord> <fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> <list>
1 1 1 <tibb… <lm> (Int… 24.5 6.73 3.64 4.56e- 3 <tibbl…
2 1 1 <tibb… <lm> Time 7.99 0.524 15.3 2.97e- 8 <tibbl…
3 2 1 <tibb… <lm> (Int… 24.7 4.93 5.01 5.26e- 4 <tibbl…
4 2 1 <tibb… <lm> Time 8.72 0.384 22.7 6.15e-10 <tibbl…
5 3 1 <tibb… <lm> (Int… 23.2 5.08 4.56 1.04e- 3 <tibbl…
6 3 1 <tibb… <lm> Time 8.49 0.396 21.5 1.08e- 9 <tibbl…
7 4 1 <tibb… <lm> (Int… 32.9 4.01 8.21 9.42e- 6 <tibbl…
8 4 1 <tibb… <lm> Time 6.09 0.312 19.5 2.70e- 9 <tibbl…
9 5 1 <tibb… <lm> (Int… 16.9 7.56 2.24 4.93e- 2 <tibbl…
10 5 1 <tibb… <lm> Time 10.1 0.588 17.1 9.88e- 9 <tibbl…
# … with 90 more rows
ChickWeight %>%
ggplot() +
geom_line(aes(x = Time,
y = weight,
color = Chick)) +
facet_wrap(~ Diet)
Integrative example
# Import data -----
# ADH activity
adh_df <- read_csv("https://raw.githubusercontent.com/tsoleary/projects/master/ADH/data/biovision/adh_abs_11022020.csv")
# NADH std curve
nadh_df <- read_csv("https://raw.githubusercontent.com/tsoleary/projects/master/ADH/data/biovision/NADH_std_11022020.csv")
NADH Std Curve
nadh_df
# A tibble: 12 x 7
NADH temp `0_min` `30_min` `60_min` `90_min` `120_min`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 28 0.0481 0.0483 0.048 0.0482 0.048
2 2 28 0.167 0.168 0.168 0.168 0.167
3 4 28 0.299 0.305 0.303 0.302 0.301
4 6 28 0.441 0.447 0.446 0.445 0.444
5 8 28 0.569 0.581 0.578 0.577 0.575
6 10 28 0.688 0.711 0.706 0.705 0.703
7 0 28 0.0485 0.0483 0.0483 0.0484 0.0483
8 2 28 0.170 0.172 0.172 0.172 0.171
9 4 28 0.303 0.305 0.305 0.304 0.303
10 6 28 0.432 0.434 0.442 0.440 0.439
11 8 28 0.575 0.582 0.580 0.582 0.579
12 10 28 0.708 0.723 0.721 0.719 0.717
nadh_df <- nadh_df %>%
pivot_longer(cols = contains("min"),
names_to = "mins",
values_to = "abs") %>%
mutate(mins = as.numeric(str_remove_all(mins, "_min")))
nadh_df
# A tibble: 60 x 4
NADH temp mins abs
<dbl> <dbl> <dbl> <dbl>
1 0 28 0 0.0481
2 0 28 30 0.0483
3 0 28 60 0.048
4 0 28 90 0.0482
5 0 28 120 0.048
6 2 28 0 0.167
7 2 28 30 0.168
8 2 28 60 0.168
9 2 28 90 0.168
10 2 28 120 0.167
# … with 50 more rows
Standard curve linear regression
# Run linear model
std_lm <- lm(abs ~ NADH, nadh_df)
summary(std_lm)
Call:
lm(formula = abs ~ NADH, data = nadh_df)
Residuals:
Min 1Q Median 3Q Max
-0.0203362 -0.0049828 -0.0005845 0.0065411 0.0143638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0412305 0.0015183 27.16 <2e-16 ***
NADH 0.0667506 0.0002507 266.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.006634 on 58 degrees of freedom
Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
F-statistic: 7.087e+04 on 1 and 58 DF, p-value: < 2.2e-16
# Save intercept & slope values
int <- as.numeric(std_lm$coefficients[1])
slope <- as.numeric(std_lm$coefficients[2])
# Save values in a named vector
std_curve_lm <- c(int = int,
slope = slope)
std_curve_lm
int slope
0.04123048 0.06675057
Absorbance data
# ADH activity absorbance data
adh_df
# A tibble: 32 x 10
EtOH temp sample_vol n_flies genotype `0_min` `30_min` `60_min` `90_min`
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 0. 28 25 1 adh-f 0.0555 0.109 0.180 0.282
2 6.25e0 28 25 1 adh-f 0.0603 0.143 0.246 0.369
3 1.25e1 28 25 1 adh-f 0.0659 0.195 0.347 0.517
4 2.50e1 28 25 1 adh-f 0.0699 0.211 0.360 0.526
5 5.00e1 28 25 1 adh-f 0.0732 0.242 0.415 0.595
6 1.00e2 28 25 1 adh-f 0.0798 0.307 0.536 0.773
7 2.00e2 28 25 1 adh-f 0.084 0.354 0.630 0.915
8 1.00e3 28 25 1 adh-f 0.0868 0.375 0.666 0.970
9 0. 28 25 1 adh-f 0.0528 0.0716 0.097 0.139
10 6.25e0 28 25 1 adh-f 0.0581 0.114 0.172 0.236
# … with 22 more rows, and 1 more variable: `120_min` <dbl>
Tidy the data
adh_df <- adh_df %>%
pivot_longer(cols = contains("min"),
names_to = "mins",
values_to = "abs") %>%
mutate(mins = as.numeric(str_remove_all(mins, "_min")))
adh_df
# A tibble: 160 x 7
EtOH temp sample_vol n_flies genotype mins abs
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 0 28 25 1 adh-f 0 0.0555
2 0 28 25 1 adh-f 30 0.109
3 0 28 25 1 adh-f 60 0.180
4 0 28 25 1 adh-f 90 0.282
5 0 28 25 1 adh-f 120 0.391
6 6.25 28 25 1 adh-f 0 0.0603
7 6.25 28 25 1 adh-f 30 0.143
8 6.25 28 25 1 adh-f 60 0.246
9 6.25 28 25 1 adh-f 90 0.369
10 6.25 28 25 1 adh-f 120 0.504
# … with 150 more rows
# Add temp data with some noise
x <- adh_df %>%
mutate(temp = 23,
abs = abs*rnorm(160, mean = 0.9, sd = 0.02))
adh_df <- bind_rows(adh_df, x)
Plot the standard curve
ggplot(data = nadh_df,
aes(x = NADH,
y = abs)) +
geom_smooth(method = 'lm',
formula = y ~ x,
se = FALSE) +
geom_point() +
ggpmisc::stat_poly_eq(formula = y ~ x,
aes(label = paste(..eq.label.., ..rr.label..,
sep = "*\"; \"*")),
parse = TRUE,
rr.digits = 5,
label.x = 0.85,
label.y = "top") +
theme_classic() +
labs(y = "Absorbance @ 450 nm",
x = "NADH (nmol)")
Calculate enzyme activity
adh_df <- adh_df %>%
group_by(temp, EtOH) %>%
mutate(nmol_NADH = (abs - std_curve_lm["int"]) / std_curve_lm["slope"],
deltaAbs = abs - abs[mins == 0],
deltaNADH = nmol_NADH - nmol_NADH[mins == 0],
deltaTime = mins - 0) %>%
filter(deltaTime != 0) %>%
mutate(dilution_factor = 150/sample_vol,
mU_mL = (deltaNADH/(deltaTime*(sample_vol/1000)))*dilution_factor) %>%
ungroup(EtOH) %>%
mutate(mU_mL = mU_mL - mU_mL[EtOH == 0])
Calculating Km
adh_df %>%
filter(mins <= 60) %>%
group_by(temp, EtOH) %>%
nest() %>%
mutate(
fit = map(data, ~ lm(abs ~ mins, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance))
# A tibble: 16 x 6
# Groups: EtOH, temp [16]
EtOH temp data fit tidied glanced
<dbl> <dbl> <list> <list> <list> <list>
1 0 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
2 6.25 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
3 12.5 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
4 25 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
5 50 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
6 100 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
7 200 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
8 1000 28 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
9 0 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
10 6.25 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
11 12.5 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
12 25 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
13 50 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
14 100 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
15 200 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
16 1000 23 <tibble [8 × 11]> <lm> <tibble [2 × 5]> <tibble [1 × 12]>
adh_velo <- adh_df %>%
filter(mins <= 60) %>%
group_by(temp, EtOH) %>%
nest() %>%
mutate(
fit = map(data, ~ lm(abs ~ mins, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance)
) %>%
unnest(tidied) %>%
filter(term == "mins") %>%
select(temp, EtOH, estimate) %>%
rename(velocity = estimate) %>%
filter(EtOH != 0) %>%
arrange(temp, EtOH)
adh_mm <- adh_velo %>%
ungroup(EtOH) %>%
filter(EtOH != 0) %>%
group_by(temp) %>%
nest() %>%
mutate(
fit = map(data, ~ nls(formula =
velocity ~ SSmicmen(EtOH, Vmax, Km), data = .),
data = .x),
tidied = map(fit, tidy)
) %>%
unnest(tidied)
adh_mm
# A tibble: 4 x 8
# Groups: temp [2]
temp data fit term estimate std.error statistic p.value
<dbl> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 23 <tibble [7 × 2]> <nls> Vmax 0.00519 0.000364 14.3 0.0000306
2 23 <tibble [7 × 2]> <nls> Km 30.2 7.74 3.91 0.0113
3 28 <tibble [7 × 2]> <nls> Vmax 0.00583 0.000291 20.1 0.00000569
4 28 <tibble [7 × 2]> <nls> Km 29.5 5.39 5.47 0.00278
Plot
adh_velo %>%
mutate(temp = as.factor(temp)) %>%
group_by(temp) %>%
ggplot(aes(x = EtOH,
y = velocity)) +
geom_smooth(method = "nls",
aes(group = temp),
formula = y ~ SSmicmen(x, Vmax, Km),
color = "grey50",
data = adh_velo,
se = FALSE,
show.legend = FALSE) +
geom_point(aes(color = temp)) +
theme_classic()
Plotting data
ggplot2
Check out the ggplot2
book.
ggplot2
creates things in a specific order of layers.
glimpse(starwars)
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
$ sex <chr> "male", "none", "none", "male", "female", "male", "female"…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femin…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
$ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Retu…
$ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
$ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…
starwars %>%
ggplot(aes(x = sex, y = height)) +
geom_point() +
geom_boxplot() +
theme_classic()
starwars %>%
ggplot(aes(x = sex, y = height)) +
geom_boxplot() +
geom_point() +
theme_classic()
starwars %>%
ggplot(aes(x = sex, y = height)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
theme_classic()
starwars %>%
ggplot(aes(x = sex, y = height)) +
geom_hline(aes(yintercept = 188),
linetype = 2,
color = "grey20") +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1,
size = 2,
alpha = 0.8,
shape = 21,
color = "grey20",
fill = "grey50") +
scale_y_continuous(breaks = c(100, 150, 188, 200, 250),
labels = c("100", "150", "Thomas", "200", "250"),
name = "Height") +
theme_classic()
starwars %>%
filter(sex %in% c("male", "female")) %>%
ggplot(aes(x = sex, y = height)) +
geom_violin(fill = "azure2",
color = "grey20") +
geom_boxplot(fill = "azure2",
width = 0.2,
color = "grey20") +
theme_classic()
starwars %>%
filter(birth_year < 200) %>%
ggplot() +
geom_histogram(aes(x = birth_year),
binwidth = 10)
starwars %>%
filter(birth_year < 200) %>%
ggplot() +
geom_histogram(aes(x = birth_year),
color = "grey20",
fill = "grey50",
binwidth = 10) +
theme_classic()
starwars %>%
filter(birth_year < 200) %>%
ggplot() +
geom_histogram(aes(x = birth_year),
color = "grey20",
fill = "grey50",
binwidth = 10) +
geom_density(aes(x = birth_year,
y = 10 * ..count..),
color = "grey20") +
ylab("count") +
theme_classic()
starwars %>%
filter(mass < 1000) %>%
mutate(species_h = ifelse(species == "Human", "Human", "Other")) %>%
ggplot() +
geom_point(aes(x = height,
y = mass,
fill = species_h),
size = 5,
alpha = 0.8,
color = "grey50",
shape = 21) +
theme_classic()
starwars %>%
filter(mass < 1000) %>%
mutate(species_h = ifelse(species == "Human", "Human", "Other")) %>%
arrange(desc(species_h)) %>%
ggplot() +
geom_point(aes(x = height,
y = mass,
fill = species_h),
size = 5,
alpha = 0.8,
color = "grey50",
shape = 21) +
theme_classic()
starwars_h <- starwars %>%
filter(mass < 1000) %>%
mutate(species_h = ifelse(species == "Human", "Human", "Other")) %>%
arrange(desc(species_h))
p <- ggplot() +
geom_point(data = starwars_h %>%
filter(species_h != "Human"),
aes(x = height,
y = mass,
fill = species_h),
size = 1,
alpha = 0.8,
color = "grey50",
shape = 21) +
geom_point(data = starwars_h %>%
filter(species_h == "Human"),
aes(x = height,
y = mass,
fill = species_h),
size = 3,
alpha = 0.8,
color = "grey50",
shape = 21) +
scale_fill_manual(values = c("#d66666", "#222222")) +
theme_classic()
p
plotly::ggplotly(p, text = "text")
friends <- c("Luke Skywalker",
"Leia Organa",
"Han Solo",
"Chewbacca",
"C-3PO",
"R2-D2",
"Obi-Wan Kenobi")
ggplot() +
geom_point(data = starwars_h,
aes(x = height,
y = mass,
fill = species_h),
size = 5,
alpha = 0.8,
color = "grey50",
shape = 21) +
ggrepel::geom_label_repel(data = starwars_h %>%
filter(name %in% friends),
aes(x = height,
y = mass,
label = name),
min.segment.length = 0.01) +
theme_classic()
These are some examples of plots that I have made in R.
Creating functions
Functions are extremely useful when you are doing repetive tasks. And greating
Using snippets can help
For example when I type mfun_snip
and hit tab
, the following code prints out on my script.
# ------------------------------------------------------------------------------
# Function: function_name
# Description: description
# Inputs: input_description
# Outputs: output_description
function_name <- function(args) {
# Function body
print("Hello world!")
}
# End function -----------------------------------------------------------------
Which is made with the following code inserted into the snippets file. This webpage walks through the process of adding snippets.
snippet mfun_snip
# ------------------------------------------------------------------------------
# Function: ${1:function_name}
# Description: ${2:description}
# Inputs: ${3:input_description}
# Outputs: ${4:output_description}
${1:function_name} <- function(args) {
# Function body
print("Hello world!")
}
# End function -----------------------------------------------------------------
Examples of functions
# ------------------------------------------------------------------------------
# Function: read_tidy_spec_csv
# Description: uses read_csv and tidys the data
# Inputs: .csv file
# Outputs: data.frame
require(tidyverse)
read_tidy_spec_csv <- function(file) {
read_csv(file) %>%
pivot_longer(-c("cycle", "time", "temp"),
names_to = "conc_well",
values_to = "abs") %>%
mutate(conc = as.numeric(str_replace_all(conc_well, "_.", ""))) %>%
mutate(file = str_remove(file, ".csv")) %>%
separate(file, into = c("enzyme", "temp_group", "date"), sep = "_")
}
# End function -----------------------------------------------------------------
# ------------------------------------------------------------------------------
# Function: calc_Km_Vm
# Description: Calculate Vmax and Km for Michaelis–Menten curve
# Inputs: tibble with velocity estimates for each conc and temp_group
# Outputs: tibble with estimates for Vmax and Km
require(tidyverse)
require(broom)
calc_Km_Vm <- function(data) {
data %>%
filter(conc != 0 & velocity > 0) %>%
group_by(enzyme, temp_group) %>%
nest() %>%
mutate(
fit = map(data, ~ nls(formula =
velocity ~ SSmicmen(conc, Vmax, Km), data = .),
data = .x),
tidied = map(fit, tidy)
) %>%
unnest(tidied) %>%
mutate(r.squared = map(fit, Rsq)) %>%
unnest(r.squared)
}
# End function -----------------------------------------------------------------
# ------------------------------------------------------------------------------
# Function: v_test
# Description: calculate v from Hunter B. Fraser PNAS 2020 paper
# Inputs:
# var_par - variance of parents
# var_p1 - variance of parent 1
# var_p2 - variance of parent 2
# n_p1 - number of parent 1 individuals
# n_p2 - number of parent 2 individuals
# var_f2 - variance of f2 hybrids
# H2 - broad sense heritibility
# c
# Outputs: v-test value
v_test <- function(var_par, var_p1, var_p2, n_p1, n_p2, var_f2, H2, c) {
# Calculate the value of v
(var_par - var_p1/(4 * n_p1) - var_p2/(4 * n_p2)) / (var_f2 * H2 * c)
}
# End function -----------------------------------------------------------------
My favorite part of the tidyverse is the final principle, which is: 4. Design for humans. “Programs must be written for people to read, and only incidentally for machines to execute.” — Hal Abelson↩
Honestly I think that joke underestimates how important good coding style is. You can actually read “butitsuremakesthingseasiertoread” pretty easily because you are an expert reader – you’ve been at it everyday for decades – coding, probably not so much. I don’t think can overstate how important I think it is to write visually pleasing code.↩
Warning: I have downloaded these cheat sheets and saved them for my quick access, but they may not be the most current version of the cheat sheet.↩
Ever had a column that is actually a integer import into R as a character?
readr
can easily define column types as you import, among other things.↩