1 UNIX tutorial

1.1 What is the command-line?

Command-line (aka terminal or shell) - a way of interacting with your local computer or a remote server - commands or scripts, without using a graphical user interface (GUI).

1.2 Why do I want to be doing this?

  • It’s quite easy to copy, move, edit, and search within thousands of files in multiple directories with some simple command-line code.
  • It would take forever to do this by dragging/dropping with a mouse.
  • Allows you to work with very large data files without uncompressing them fully, or loading the entire file’s contents into memory

1.3 Connect to the server

  • The first step is to open a terminal shell on your local computer. For MacOS users, this is called “Terminal”.

  • We’ll connect to our remote server running Unix using the secure shell (ssh) protocol. Our server’s name is pbio381 and we can connect to it using our UVM netid username and password (as long as we’re on-campus)

ip0af52fbf:papers tsoleary$ ssh tsoleary@pbio381.uvm.edu
tsoleary@pbio381.uvm.edu's password: 
Last login: Tue Jan 31 10:51:15 2017 from ip040027.uvm.edu
[tsoleary@pbio381 ~]$ 
  • The log-in screen tells us some basic info on when we last logged in, and then gives us our current location in the filesystem (~) followed by the $ prompt, that tells us the computer is ready for our next command.

    • NOTE: The tilda (~) is short-hand for your home directory in UNIX. This is your own personal little corner of the computer’s hard drive space, and is the location that you should use to create folders and input/output/results files that are specific to your own work. No one has access to any files stored in your home directory but you.
  • To see the full path to your current directory, use the pwd command:

[tsoleary@pbio381 ~]$ pwd
/users/s/r/tsoleary
[tsoleary@pbio381 ~]$ 
  • The path shows the full directory address up from the “root” of the file structure, which is the most basal level (appealing to all you phylogeneticists here…). The root is symbolized as “/” and each subdirectory is separated by an additional “/”. So, the full path to my working directory on the server is /users/s/r/tsoleary/

  • Let’s make a new folder (aka, directory) using the mkdir command. Let’s name this folder “mydata”

[tsoleary@pbio381 ~]$ mkdir mydata
  • We can then use the ll command to show the current contents of any folders and files in our current location:
[tsoleary@pbio381 ~]$ ll
total 0
drwxr-xr-x. 6 tsoleary users 82 Jan 31 17:21 archive
drwxr-xr-x. 2 tsoleary users  6 Jan 31 17:21 mydata
drwxr-xr-x. 2 tsoleary users  6 Jan 31 17:15 scripts
[tsoleary@pbio381 ~]$ 
  • You’ll notive that I’ve got some extra folders in my output from previous work, whereas you will probably only see the “scripts” folder you just made.
  • NOTE: Each row shows a file or a folder (in this case, these are all folders) diplaying (from right to left) its name, when it was last edited, size, who it belongs to , and who has permission to read (r) write (w) and exectue (x) it. More on permissions later…
  • Try making your own folder named “scripts” and then use the ll command to list the folders again
  • We can change our current location within the directory structure using the cd command. Let’s use cd to move inside the mydata/ directory and ll to list its contents:
[tsoleary@pbio381 ~]$ cd mydata/
[tsoleary@pbio381 mydata]$ ll
total 0
[tsoleary@pbio381 mydata]$ 
  • Hah — nothing in there yet! Let’s go get some data!
    • We’ve placed the text file containing all the metadata information on the seastar sampling under a shared space on the server. The path to this shared space is:
      • /data/ Try using cd to navigate over to this location. Then ll to show its contents. You should see something like this:
drwxr-xr-x.  5 root root       73 Jan 31 17:35 archive
drwxrwxr-x.  2 root pb381adm   40 Nov 30  2015 packages
drwxrwxr-x. 33 root pb381adm 4096 Nov 30  2015 popgen
drwxrwxr-x.  3 root pb381adm   42 Jan 30 09:08 project_data
drwxrwxr-x.  2 root pb381adm    6 Oct  2  2015 scripts
drwxr-xr-x. 18 root root     4096 Sep  2  2015 users
[tsoleary@pbio381 data]$ 
  • Now, cd into the folder called “project_data” and ll. Do you see this?
[tsoleary@pbio381 data]$ cd project_data/
[tsoleary@pbio381 project_data]$ ll
total 8
drwxr-xr-x. 12 tsoleary users 4096 Jan 30 09:06 archive
-rw-r--r--.  1 tsoleary users 1255 Jan 30 09:08 ssw_samples.txt
[tsoleary@pbio381 project_data]$ 
  • The file called ssw_samples.txt is the one with the seastar metadata. We don’t want to open and make changes to this file in the shared space, because we don’t want to have our edits affect the rest of the group. So, let’s first make a copy of this file over to our home directory and put it inside the “mydata” folder. Use the cp command, followed by the filename, and the path to your destination (remember the ~ signals your home directory, and each subdirectory is then separated by a /):
[tsoleary@pbio381 project_data]$ cp ssw_samples.txt ~/mydata/
  • cd back to your ~/mydata/ directory and look inside. You should see your file…
[tsoleary@pbio381 project_data]$ cd ~/mydata/
[tsoleary@pbio381 mydata]$ ll
total 4
-rw-r--r--. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ 
  • Let’s take a peek at this file with the head command, which prints the first 10 lines to screen.
[tsoleary@pbio381 mydata]$ head ssw_samples.txt 
Individual  Trajectory  Location    Day3    Day6    Day9    Day12   Day15
10  HH  INT 10_5-08_H   10_5-11_H   10_5-14_H   10_5-17_H   10_5-20_H
24  HH  INT 24_5-08_H   24_5-11_H   24_5-14_H   24_5-17_H   24_5-20_H
27  HH  INT 27_5-08_H   27_5-11_H   27_5-14_H   27_5-17_H   27_5-20_H
08  HS  INT 08_5-08_H   08_5-11_S   08_5-14_S   08_5-17_S   08_5-20_S
09  HS  INT 09_5-08_H       09_5-14_S   09_5-17_S   09_5-20_S
15  HS  INT 15_5-08_H   15_5-11_H   15_5-14_H   15_5-17_S   15_5-20_S
19  HS  INT     19_5-11_H   19_5-14_H   19_5-17_H   19_5-20_S
20  HS  INT 20_5-08_H   20_5-11_H   20_5-14_H   20_5-17_H   20_5-20_S
03  SS  INT 03_5-08_S   03_5-11_S
  • The tail command provides similar functionality, but prints just the last lines in the file. These features may not seem a big deal right now, but when you’re dealing with files that are 20 Gb compressed, and feature hundreds of millions of lines of data, you and your computer will be happy to have tools to peek inside without having to open the whole file!
  • What if we want to extract just the rows of data that correspond to Healthy (HH) individuals? We can use the search tool grep to search for a target query. Any line matching our search string will be printed to screen.
[tsoleary@pbio381 mydata]$ grep 'HH' ssw_samples.txt 
10  HH  INT 10_5-08_H   10_5-11_H   10_5-14_H   10_5-17_H   10_5-20_H
24  HH  INT 24_5-08_H   24_5-11_H   24_5-14_H   24_5-17_H   24_5-20_H
27  HH  INT 27_5-08_H   27_5-11_H   27_5-14_H   27_5-17_H   27_5-20_H
31  HH  SUB 31_6-12_H   31_6-15_H   31_6-18_H   31_6-21_H   31_6-24_H
32  HH  SUB 32_6-12_H   32_6-15_H   32_6-18_H   32_6-21_H   
33  HH  SUB 33_6-12_H   33_6-15_H   33_6-18_H   33_6-21_H   33_6-24_H
34  HH  SUB 34_6-12_H   34_6-15_H   34_6-18_H   34_6-21_H   34_6-24_H
35  HH  SUB 35_6-12_H   35_6-15_H   35_6-18_H   35_6-21_H   
[tsoleary@pbio381 mydata]$
  • What if instead of printing it to screen, we want to save the output of our search to a new file? This is easy, just use the “>” symbol to redirect the results of any command to an output file with your choice of name.
[tsoleary@pbio381 mydata]$ grep 'HH' ssw_samples.txt >ssw_HHonly.txt
[tsoleary@pbio381 mydata]$ ll
total 8
-rw-r--r--. 1 tsoleary users  462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ 
  • We can do the same routine for the “SS” samples. Here’s a trick, when you’re doing a similar task as a previous command, hit the up arrow on your keyboard at the $ prompt, and it will recall the last command you issued. Then you just have to switch the HH’s for SS’s.
[tsoleary@pbio381 mydata]$ grep 'SS' ssw_samples.txt >ssw_SSonly.txt
[tsoleary@pbio381 mydata]$ ll
total 12
-rw-r--r--. 1 tsoleary users  462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
-rw-r--r--. 1 tsoleary users  342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 mydata]$ 
  • grep is a useful search tool and has many additional features for sorting and output of the results. These kinds of search algorithms are called “regular expressions”, or “regexp”, and are one of the most powerful tools for wokring with large text files. If you want to learn more about grep and its regexp capabilities, you can look at the man page or manual. In fact, every UNIX command-line program has a built-in man page that you can call up to help you. Just type man and then the program name and it will give you the manual (small excerpt shown below).
[tsoleary@pbio381 mydata]$ man grep


GREP(1)                            General Commands Manual                           GREP(1)

NAME
       grep, egrep, fgrep - print lines matching a pattern

SYNOPSIS
       grep [OPTIONS] PATTERN [FILE...]
       grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

DESCRIPTION
       grep searches the named input FILEs (or standard input if no files are named, or if a
       single hyphen-minus (-) is given as file name) for lines containing a  match  to  the
       given PATTERN.  By default, grep prints the matching lines.

       In  addition,  two variant programs egrep and fgrep are available.  egrep is the same
       as grep -E.  fgrep is the same as grep -F.  Direct  invocation  as  either  egrep  or
       fgrep  is  deprecated,  but is provided to allow historical applications that rely on
       them to run unmodified.

OPTIONS
   Generic Program Information
       --help Print a usage message briefly summarizing these command-line options  and  the
              bug-reporting address, then exit.

       -V, --version
              Print  the version number of grep to the standard output stream.  This version
              number should be included in all bug reports (see below).

   Matcher Selection
       -E, --extended-regexp
              Interpret PATTERN as an extended regular expression (ERE, see below).  (-E  is
              specified by POSIX.)

       -F, --fixed-strings, --fixed-regexp
              Interpret  PATTERN  as  a list of fixed strings, separated by newlines, any of
              which is to be matched.  (-F is  specified  by  POSIX,  --fixed-regexp  is  an
              obsoleted alias, please do not use it in new scripts.)

       -G, --basic-regexp
              Interpret PATTERN as a basic regular expression (BRE, see below).  This is the
              default.

       -P, --perl-regexp
              Interpret PATTERN as a Perl regular expression.  This is  highly  experimental
              and grep -P may warn of unimplemented features.
  • One of the most useful aspects of UNIX is the ability to take the output from one command and use it as standard input (termed ‘stdin’) into another command without having to store the intermediate files. Such a workflow is called “piping”, and makes use of the pipe character (|) located above the return key to feed data between programs.
    • Example: Say we wanted to know how many samples come from the Intertidal. We can use grep to do the search, and pipe the results to the command wc which will tally up the number of lines, words, and characters in the file…voila!
[tsoleary@pbio381 mydata]$ grep 'INT' ssw_samples.txt | wc
     16     106     762
[tsoleary@pbio381 mydata]$ 
  • Looks like 16 INT samples in the original data. See how quick it was to get a line count on this match, without actully opening a file or printing/saving the outputs?
  • Now, what if we want to move the files we created with just individuals of a particular disease status. There’s a way to do this quickly using the wildcard character *. With the wildcard, the *\* takes the place of any character, and in fact any length of characters.

  • Make a new directory called samples_by_disease/ inside the mydata/ folder
  • Move all files that contain the word “only” into the new directory using the mv command.

[tsoleary@pbio381 mydata]$ mkdir sample_by_disease/
[tsoleary@pbio381 mydata]$ ll
total 12
drwxr-xr-x. 2 tsoleary users   10 Jan 31 21:12 sample_by_disease
-rw-r--r--. 1 tsoleary users  462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
-rw-r--r--. 1 tsoleary users  342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 mydata]$ mv *only* sample_by_disease/
[tsoleary@pbio381 mydata]$ ll
total 4
drwxr-xr-x. 2 tsoleary users   60 Jan 31 21:12 sample_by_disease
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ cd sample_by_disease/
[tsoleary@pbio381 sample_by_disease]$ ll
total 8
-rw-r--r--. 1 tsoleary users 462 Jan 31 20:46 ssw_HHonly.txt
-rw-r--r--. 1 tsoleary users 342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 sample_by_disease]$ 
  • OK, what about when we have files we don’t want anymore? How do we clean up our workspace? You can remove files and folders with the rm command. However, in its default mode, UNIX will not ask if you really mean it before getting rid of it forever(!), so this can be dangerous if you’re not paying attention.
  • As an example, let’s use our grep command to pull out he seastar samples that started healthy and then became sick. But perhaps we later decide we’re not going to work with those samples, so we use rm to delete that file:
[tsoleary@pbio381 mydata]$ ll
total 8
drwxr-xr-x. 2 tsoleary users   60 Jan 31 21:12 sample_by_disease
-rw-r--r--. 1 tsoleary users  282 Feb  1 05:35 ssw_HSonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ rm ssw_HSonly.txt 
[tsoleary@pbio381 mydata]$ ll
total 4
drwxr-xr-x. 2 tsoleary users   60 Jan 31 21:12 sample_by_disease
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$
  • Gone! Forever! If that worries you, you can change your personal settings so that the server asks you to confirm deletion before it acts. To do this, we’ll need to follow a couple of new steps:

1.4 Confirm deletion after rm

  1. cd to your home directory (~/)
  2. list all the files, including “hidden” ones that aren’t usually shown. To do this, use ll -a.
  3. Look for a file called “.bashrc” — this contains your settings for how you interact with the server when you log in.
  4. We’re going to open this file and edit it to add a setting to request that rm confirms deletion with us. To edit text files on the fly in UNIX, you can use the built-in text editor, “vim”: vim .bashrc
  5. You should see something that looks like this:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
  1. Use your arrow key to move your cursor down to the last line, below "“# User specific aliases and functions” — this is where we’re going to insert our new function.

  2. By defauly, vim is in read-only mode when it opens files. To go into edit mode, press your “i” key (for “insert”). You are now able to make changes to the file.

  3. Add the following text on a new line directly below the “# User specific…” line:

    alias rm='rm -i'

  4. Your file should now look like this:

# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions

alias rm='rm -i'
  1. You’re now ready to get out of edit mode (hit the escape key), save your changes (type :w), and exit vim (type :q).

  2. These changes won’t take effect until you log out (type exit to log out of the server). But from now on, every time you log in, the server will remember that you want a reminder before deleting any of your work.

1.5 Review what we’ve learned so far

  • Logging in to the server: ssh netid@pbio381.uvm.edu
  • Finding what directory you’re in: pwd
  • Listing files in your current directory, or changing to a new directory: ll, cd
  • Making a new folder: mkdir foldername
  • Location of shared space, data, and programs on our class server:
[tsoleary@pbio381 ~]$ cd /data/
[tsoleary@pbio381 data]$ ll
total 8
drwxr-xr-x.  5 root root       73 Jan 31 17:35 archive
drwxrwxr-x.  2 root pb381adm   40 Nov 30  2015 packages
drwxrwxr-x. 33 root pb381adm 4096 Nov 30  2015 popgen
drwxrwxr-x.  3 root pb381adm   42 Jan 30 09:08 project_data
drwxrwxr-x.  2 root pb381adm    6 Oct  2  2015 scripts
drwxr-xr-x. 18 root root     4096 Sep  2  2015 users
[tsoleary@pbio381 data]$ 
  • Copying or moving files from one location to another: cp filename destinationpath/ or mv filename destinationpath/
  • Peeking into the first or last few lines of a file: head filename, tail filename
  • Searching within a file for a match: grep 'search string' filename
  • Outputing the results of a command to a new file: grep 'search string' filename >outputfilename
  • Using wildcards to work on multiple files at the same time: mv *.txt ~/newfolder
  • Using the “pipe” to send the output of one command to the input of another: grep 'INT' filename | wc
  • Removing files or folders: rm
  • Editing text files on the server: vim filename

2 FastQC

2.1 Learning Objectives

  • To get background on the ecology of Red spruce (Picea rubens), and the experimental design of the exome capture data
  • To understand the general work flow or “pipeline” for processing and analyzing the exome capture sequence data
  • To visualize and interpret Illumina data quality (what is a fastq file; what are Phred scores?).
  • To learn how to make/write a bash script, and how to use bash commands to process files in batches
  • To trim the reads based on base quality scores
  • To start mapping (a.k.a. aligning) each set of cleaned reads to a reference genome

2.2 Red spruce, Picea rubens

  • Red spruce is a coniferous tree that plays a prominent role in montane communities throughout the Appalachians.

  • Thrives in the cool, moist climates of the high elevation mountains of the Apppalachians and northward along the coastal areas of Atlantic Canada.

One region where populations are particular vulnerable to climate change is in the low-latitude trailing edge of the range, from Maryland to Tennessee, where - populations are highly fragmented and isolated on mountaintops. - These “island” populations are remnants of spruce forests that covered the southern U.S. glaciers extended as far south as Long Island, NY. - As the climate warmed at the end of the Pleistocene (~20K years ago), red spruce retreated upward in elevation to these mountaintop refugia, where they are now highlty isolated from other such stands and from the core of the range further north.

With funding from the National Science Foundation, the Keller Lab is studying the genetic basis of climate adaptation across the distribution of P. rubens.

Our main goals are to:

  1. characterize the genetic diversity and population structure across the range

  2. identify regions of the genome that show evidence of positive selection in response to climate gradients

  3. map the genetic basis of climate adaptive phenotypes

We hope to use this information to inform areas of the range most likely to experience climate maladaptation, and to help guide mitigation strategies.

2.3 Experimental Design

In 2017, we collected seeds and needle tissue from 340 mother trees at 65 populations spread throughout the range. - Extracted whole genomic DNA from needles to use for exome capture sequencing. - Sample size in the edge region = 110 mother trees from 23 populations. - Exome capture was designed based on transcriptomes from multiple tissues and developmental stages in the related species, white spruce (P. glauca). Bait design used 2 transcriptomes previously assembled by Rigault et al. (2011) and Yeaman et al. (2014). - A total of 80,000 120bp probes were designed, including 75,732 probes within or overlapping exomic regions, and an additional 4,268 probes in intergenic regions. - Each probe was required to represent a single blast hit to the P. glauca reference genome of at least 90bp long and 85% identity, covering 38,570 unigenes. - Libraries were made by random mechanical shearing of DNA (250 ng -1ug) to an average size of 400 bp followed by end-repair reaction, ligation of an adenine residue to the 3’-end of the blunt-end fragments to allow the ligation of barcoded adapters, and PCR-amplification of the library. SureSelect probes (Agilent Technologies: Santa Clara, CA) were used for solution-based targeted enrichment of pools of 16 libraries, following the SureSelectxt Target Enrichment System for Illumina Paired-End Multiplexed Sequencing Library protocol. - Libraries were sequenced on a single run of a Illumina HiSeq X to generate paired-end 150-bp reads.

2.4 The pipeline

  • Visualize, Clean, Visualize
    • Visualize the quality of raw data (Program: FastQC)
    • Clean raw data (Program: Trimmomatic)
    • Visualize the quality of cleaned data (Program: FastQC)
  • Calculate #’s of cleaned, high quality reads going into mapping
  • Map (a.k.a. Align) cleaned reads from each sample to the reference assembly to generate sequence alignment files (Program: bwa, Input: .fastq, Output: .sam).
  • Remove PCR duplicates identified during mapping, and calculate alignment statistics (% of reads mapping succesully, mapping quality scores, average depth of coverage per individual)
  • We’ll then use the results of our mapping next week to start estimating diversity and population structure.
  • Visualize, Clean, and Visualize again

Whenever you get a new batch of NGS data, the first step is to look at the data quality of coming off the sequencer and see if we notice any problems with base quality, sequence length, PCR duplicates, or adapter contamination.

2.4.1 .fastq files

A fastq file is the standard sequence data format for NGS. It contains the sequence of the read itself, the corresponding quality scores for each base, and some meta-data about the read.

The files are big (typically many Gb compressed), so we can’t open them completely. Instead, we can peek inside the file using head. But size these files are compressed (note the .gz ending in the filenames), and we want them to stay compressed while we peek. Bash has a solution to that called zcat. This lets us look at the .gz file without uncompressing it all the way.

The fastq files are in this path: /data/project_data/RS_ExomeSeq/fastq/edge_fastq

cd /data/project_data/RS_ExomeSeq/fastq/edge_fastq
zcat AB_05_R1_fastq.gz | head -n 4

@GWNJ-0842:368:GW1809211440:2:1101:17168:1907 1:N:0:NGAAGAGA+NTTCGCCT
GATGGGATTAGAGCCCCTGAAGGCTGATAGAACTTGAGTTTCACAGGCTCATTGCATTGAAGTGGCATTTGTGTGAATGCAGAGGAGGTACATAGGTCCTCGAGAATAAAAGAGATGTTGCTCCTCACCAAAATCAGTACAGATTATTTT
+
A<A-F<AFJFJFJA7FJJJJFFJJJJJJ<AJ-FJJ7-A-FJAJJ-JJJA7A7AFJ<FF--<FF7-AJJFJFJA-<A-FAJ<AJJ<JJF--<A-7F-777-FA77---7AJ-JF-FJF-A--AJF-7FJFF77F-A--7<-F--77<JFF<

Note: zcat lets us open a .gz (gzipped) file; we then “pipe” | this output from zcat to the head command and print just the top 4 lines -n4.

The .fastq file format** has 4 lines for each read:

Line Description 1. Always begins with ‘@’ and then information about the read 2. The actual DNA sequence 3. Always begins with a ‘+’ and sometimes the same info in line 1 4. A string of characters which represent the quality scores; always has same number of characters as line 2

Here’s a useful reference for understanding Quality (Phred) scores. If P is the probability that a base call is an error, then:

\[P = 10^(–Q/10)\]

\[ Q = –10 log10(P)\]

2.5 Visualize using FastQC

We’re going to use the program FastQC (already installed on our server). FastQC looks at the quality collectively across all reads in a sample.

First, let’s make a new dir within myresults to hold the outputs

mkdir ~/<myrepo>/myresults/fastqc

Then, the basic FastQC command is like this:

fastqc FILENAME.fastq.gz -o outputdirectory/

This will generate an .html output file for each input file you’ve run.

But, we want to be clever and process multiple files (i.e., ALL files from our population) without having to manually submit each one. We can do this by writing a bash script that contains a loop.

The basic syntax of a bash loops is like this:

for file in myfiles
do
  command 1 -options ${file}
  command 2 -options ${file}
done

Note the use of variable assignment using ${}. We define the word file in the for loop as the variable of interest, and then call the iterations of it using ${file}. For example, we could use the wildcard character (*) in a loop to call all files that include the population code “AB” and then pass those filenames in a loop to fastqc. Something like:

for file in AB*fastq.gz

do

 fastqc ${file} -o ~/<myrepo>/myresults/fastqc

done

Let’s write the above into a script using the Vim text editor at the command line. Type vim to get into the editor, then type “i” to enter INSERT mode. You can then type your script (remember to make necessary changes to the population code and output path). Lastly, to save the file and quit Vim, hit the ESCAPE key to get out of INSERT mode, followed by :wq fastqc.sh

Back at the command line, you should be able to ll and see your new script!

You may find that you need to change the permission on your script to make it executable. Do this using chmod u+x, which changes the permissions to give the user (you) permission to execute (x). Then give it a run!

chmod u+x fastqc.sh    # makes the script "executable" by the "user"
./fastqc.sh            # executes the script

It’ll take just a couple of minutes per fastq file. Once you’ve got results, let’s look at them by pushing your html files up to your Github. Remember,

git pull
git add --all .
git commit -m "note"
git push

Once synced to Github, use Github desktop to pull them down to your laptop where you can open them with a browser.

How does the quality look?

3 Trimmomatic

We’ll use the Trimmomatic program to clean the reads for each file. The program is already installed on our server.

We’ve provided an example script in the /data/scripts/ directory this time because the program is a java based program and thus a bit more particular in its call.

Copy the bash script over to your ~/myrepo/myscripts directory Open and edit the bash script using the program vim. Edit the file so that you’re trimming the fastq files for the population assigned to you Change the permissions on your script to make it executable, then run it! (examples below)

cp /data/scripts/trim_loop.sh  ~/myrepo/myscripts/ # copies the script to your home scripts dir
vim trim_loop.sh    # open the script with vim to edit

This time we use the variable coding to call the name of the R1 read pair, define the name for the second read in the pair (R2), and create a basename that only contains the “pop_ind” part of the name, i.e. AB_05

R2=${R1/_R1_fastq.gz/_R2_fastq.gz}   # defines the name for the second read in the pair (R2) based on knowing the R1 name (the file names are identifcal except for the R1 vs. R2 designation)
f=${R1/_R1_fastq.gz/}   # creates a new variable from R1 that has the "_R1_fastq.gz" stripped off
name=`basename ${f}`   # calls the handy "basename" function to define a new variable containing only the very last part of the filename while stripping off all the path information.  This gets us the "AB_05" bit we want.

Here’s how it should look (replace AB with your population name):

3.1 trim.sh

trim.sh

#!/bin/bash   
 
cd /data/project_data/RS_ExomeSeq/fastq/edge_fastq  

for R1 in AB*R1_fastq.gz  

do 
 
    R2=${R1/_R1_fastq.gz/_R2_fastq.gz}
    f=${R1/_R1_fastq.gz/}
    name=`basename ${f}`

    java -classpath /data/popgen/Trimmomatic-0.33/trimmomatic-0.33.jar org.usadellab.trimmomatic.TrimmomaticPE \
        -threads 1 \
        -phred33 \
         "$R1" \
         "$R2" \
         /data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${name}_R1.cl.pd.fq \
         /data/project_data/RS_ExomeSeq/fastq/edge_fastq/unpairedcleanreads/${name}_R1.cl.un.fq \
         /data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${name}_R2.cl.pd.fq \
         /data/project_data/RS_ExomeSeq/fastq/edge_fastq/unpairedcleanreads/${name}_R2.cl.un.fq \
        ILLUMINACLIP:/data/popgen/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:2:30:10 \
        LEADING:20 \
        TRAILING:20 \
        SLIDINGWINDOW:6:20 \
        MINLEN:35 
 
done 

Trimmomatic performs the cleaning steps in the order they are presented. It’s recommended to clip adapter early in the process and clean for length at the end.

The steps and options are from the Trimmomatic website:

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. MINLEN: Drop the read if it is below a specified length

3.2 Visualize again using FastQC

  • Check the quality of one of your cleaned files using fastqc again
  • Redo the same general vizualization steps on the cleaned files

4 Mapping

Now that we have cleaned and trimmed read pairs, we’re ready to map them against the reference genome.

4.1 Obtain reference genome

We’ll be using a reduced reference genome based on selecting only those scaffolds of the full genome reference that contain at least one bait. We’ve placed it on our server here: /data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa

The reference genome is based on Norway spruce (P. abies) and is available from congenie.org.

4.2 bwa for mapping reads

We’ll use the program bwa, which is a very efficient and very well vetted read mapper. Lots of others exist and can be useful to explore for future datasets. We tried several, and for our exome data, bwa seems to be the best

We are going to write a bash script together that calls the R1 and R2 reads for each individual in our population, and uses the bwa-mem algorithm to map reads to the reference genome. The resulting output will be a .sam file alignment. The basic bwa command we’ll use is:

bwa mem -t 1 -M -a ${ref} ${forward} ${reverse} > ${output}/BWA/${name}.sam

# where:

-t # the number of threads, or computer cpus to use (in this case, just 1)
-M # labels a read with a special flag if its mapping is split across >1 contig
-a # keeps alignments involving unpaired reads
${ref} # specifies the path and filename for the reference genome
${forward} # specifies the path and filename for the cleaned and trimmed R1 reads 
${reverse} # specifies the path and filename for the cleaned and trimmed R2 reads 
>${output}/BWA/${name}.sam  # directs the .sam file to be saved into a directory called BWA
# Other bwa options detailed here: bwa manual page

4.2.1 mapping.sh

#! /bin/bash

# this script is to run the read mapping using the bwa program

# -t 1 is that we are using only one thread
# -M is a special flag if its mapping is split across more than one contig
# this just allows it to map across contigs
# ${ref} variable ref that we will define as the reference genome
# then it will redirect the output file .sam to the output dir we made
# > redirects the output to save it to a file, otherwise bash would just vomit
# the words on the screen

ref="/data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa"

# a loop to map each individual within my population
# ${input} is the path to the

for forward in ${input}*_R1.cl.pd.fq

do
  # define the reverse read based on that forward read
  # then define the forward read as the whole text
  # and replace it with nothing so that you only have the base part of the name
  # having the slash inside the {} means basically find and replace... so the
  # first part of the slash is interpreted as the find and the second part is
  # what to replace it with
  reverse=${forward/_R1.cl.pd.fq/_R2.cl.pd.fq}
  f=${forward/_R1.cl.pd.fq/}
  # basename is a function in base that creates the basename that will be used
  # for naming the rest of the files
  name=`basename ${f}`
  bwa mem -t 1 -M ${ref} ${forward} ${reverse} > ${output}/BWA/${name}.sam
done

4.3 samtools and sambamba

Our last steps (with basic syntax) are to:

  1. Convert our .sam files to the more efficient binary version .bam and sort them sambamba-0.7.1-linux-static view -S --format=bam file.sam -o file.bam

  2. Get rid of any PCR duplicate sequences (why are these a problem?) and re-sort after removing duplicates.

  • sambamba-0.7.1-linux-static markdup -r -t 1 file.bam file.rmdup.bam
  • samtools sort file.rmdup.bam -o file.sorted.rmdup.bam
  1. Get some stats on how well the mapping worked
  • samtools flagstat file.sorted.rmdup.bam | awk 'NR>=5&&NR<=13 {print $1}' | column -x
  • samtools depth file.sorted.rmdup.bam | awk '{sum+=$3} END {print sum/NR}

For this, we’ll use a combination of two new programs: samtools and sambamba. Samtools was writtend by Heng Li, the same person who wrote bwa, and is a powerful tool for manipulating sam/bam files. Sambamba is derived from samtools, and has been re-coded to increase efficiency (speed). We’ll use them both at different steps.

I’ve put a bash script with the commands to run steps 1-3 above. It’s located here:

/data/scripts/process_bam.sh

  • Make a copy of this over to your home directory
  • Use vim to edit the paths and population
  • Save and quit vim, type :wq

4.4 process_bam.sh

#! /bin/bash
# this is where out output sam files are going to get converted into binary
# format (bam)

# then we are going to sort the bam files, remove PCR duplicates, and index them

# first convert sam to bam and then sort

for f in ${output}/BWA/${mypop}*.sam

do

  out=${f/.sam/}
  # this converts it to the bam
  sambamba-0.7.1-linux-static view -S --format=bam ${f} -o ${out}.bam
  # now we need to sort the bam files so it can quickly find the function
  samtools sort ${out}.bam -o ${out}.sorted.bam

done

# next remove the pcr duplicates from our bam files

for file in ${output}/BWA/${mypop}*.sorted.bam

do

  f=${file/.sorted.bam/}
  # this is going to mark duplicates if they have the exact same coordinates, because they are randomly fragmented so if they star
  # -r removes all dups except for one
  # no dash o -o needed for the output
  # just calling the command on the server without any of the options gives you the help information
  sambamba-0.7.1-linux-static markdup -r -t 1 ${file} ${f}.sorted.rmdup.bam

done

# finish by indexing the files

for file in ${output}/BWA/${mypop}*sorted.rmdup.bam

do

  samtools index ${file}

done

4.5 Create a wrapper

Last step, we’re going to put these scripts altogether into a “wrapper” that will exectue each of them one after the other, and work for us while we’re off getting a coffee or sleeping. :) I’ll show you how to code this together in class.

4.5.1 mypipeline.sh

#! /bin/bash

# we will use this as a wrappeer to run our different scripts

# path to my repo on the server
myrepo="/users/t/s/tsoleary/eco_genomics"

# define the population
mypop="BFA"

# directory to our cleaned and paired paired
input="/data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${mypop}"

# define an output directory in the common space not your own repo
output="/data/project_data/RS_ExomeSeq/mapping"


# call the mapping.sh and process_bam.sh files, you could also add the stats but for the sake of time we will do that later

source ./mapping.sh

source ./process_bam.sh

4.6 Using screen

Once your wrapper script is ready, you’re going to want to start a screen. The screen command initiates a new shell window that won’t interupt or stop your work if you close your computer, log off the server, or leave the UVM network. Anytime you’re running long jobs, you definiteily want to use screen.

Using it is easy. Just type screen followed by . It will take you to a new empyt terminal. You can then start your wrapper bash script and see that it starts running. Once everything looks good, you have to detach from the screen by typing Ctrl-A + Ctrl-D. If you don’t do this, you’ll lose your work!

When you’re ready to check back on the progress of your program, you can recover your screen by typing screen -r. That’ll re-attach you back to your program!

4.7 Next day mapping

  • Review our progress on mapping
  • Calculate mapping statistics to assess quality of the result
  • Visualize sequence alignment files
  • Introduce use of genotype-likelihoods for analyzing diversity in low coverage sequences
  • Use the ‘ANGSD’ progranm to calculate diversity stats, Fsts, and PCA

4.8 Sequence AlignMent (SAM) files

/data/project_data/RS_ExomeSeq/mapping/BWA/
  • First, try looking at a SAM file using head and tail.
tail -n 100 FILENAME.sam

A SAM file is a tab delimited text file that stores information about the alignment of reads in a FASTQ file to a reference genome or transcriptome. For each read in a FASTQ file, there’s a line in the SAM file that includes

  • the read, aka. query, name,
  • a FLAG (number with information about mapping success and orientation and whether the read is the left or right read),
  • the reference sequence name to which the read mapped
  • the leftmost position in the reference where the read mapped
  • the mapping quality (Phred-scaled)
  • a CIGAR string that gives alignment information (how many bases Match (M), where there’s an Insertion (I) or Deletion (D))
  • an ‘=’, mate position, inferred insert size (columns 7,8,9),
  • the query sequence and Phred-scaled quality from the FASTQ file (columns 10 and 11),
  • then Lots of good information in TAGS at the end, if the read mapped, including whether it is a unique read (XT:A:U), the number of best hits (X0:i:1), the number of suboptimal hits (X1:i:0).

The left (R1) and right (R2) reads alternate through the file. SAM files usually have a header section with general information where each line starts with the ‘@’ symbol. SAM and BAM files contain the same information; SAM is human readable and BAM is in binary code and therefore has a smaller file size.

Find the official Sequence AlignMent file documentation can be found [here](https://en.wikipedia.org/wiki/SAM_(file_format) or more officially.

4.9 How well reads mapped to the reference

  • We can use the program samtools Written by Heng Li, the same person who wrote bwa. It is a powerful tool for manipulating sam/bam files.

  • The command flagstat gets us some basic info on how well the mapping worked:

  • samtools flagstat FILENAME.sam

4.9.1 bam_stats.sh.

bam_stats.sh.

#! /bin/bash

# set repo

myrepo="/users/t/s/tsoleary/eco_genomics"

mypop="BFA"

output="/data/project_data/RS_ExomeSeq/mapping"

# this echo is creating a header for the file and saving it on the flagstats.txt files
echo "Num.reads R1 R2 Paired MateMapped Singletons MateMappedDiffChr" > ${myrepo}/myresults/${mypop}.flagstats.txt

for file in ${output}/BWA/${mypop}*sorted.rmdup.bam
  do
    f=${file/.sorted.rmdup.bam/}
    name=`basename ${f}`
    # the double >> makes it so that it appends it and doesn't just rewrite the file every time
    echo ${name} >> ${myrepo}/myresults/${mypop}.names.txt
    # NR is the number of rows, row 6 through 12 have the meaningful results in them
    samtools flagstat ${file} | awk 'NR>=6&&NR<=12 {print $1}' | column -x
  done >> ${myrepo}/myresults/${mypop}.flagstats.txt

# calculate the depth of coverage from our bam files
for file in ${output}/BWA/${mypop}*sorted.rmdup.bam
  do
    samtools depth ${file} | awk '{sum+=$3} END {print sum/NR}'
  done >> ${myrepo}/myresults/${mypop}.coverage.txt
  • We’ll also use the awk tool to help format the output.

We’ll use the samtools flagstat command to get read counts after mapping, and the depth command to get perhaps the most important statistic in read mapping: the depth of coverage, or how many reads cover each mapped position, on average.

Put the following into a loop for each individual for which you’ve generated a .sorted.rmdup.bam file:

  • samtools flagstat file.sorted.rmdup.bam | awk 'NR>=6&&NR<=13 {print $1}' | column -x
  • >> ${myrepo}/myresults/${mypop}.flagstats.txt

  • samtools depth file.sorted.rmdup.bam | awk '{sum+=$3} END {print sum/NR}
  • >> ${myrepo}/myresults/${mypop}.coverage.txt

While that’s running, we can take a look at one of our alignment files (sam or bam) using an integrated viewed in samtools called tview. To use it, simply call the program and command, followed by the sam/bam file you want to view and the path to the reference genome. For example:

samtools tview /data/project_data/RS_ExomeSeq/mapping/BWA/AB_05.sorted.rmdup.bam /data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa

4.10 Calling Genotypes

Inference of population genetics from the aligned sequence data. Should we call genotypes?

Many of the papers you’ll read that do popgen on NGS data have a SNP calling step that results in a specific gneotype being called for each SNP site for each individual. For example,

SNP Ind1 Ind 2
1 CC CT
2 AG AA
3 GT TT

But how do we know that Ind1 is homoozygous at SNP-1 (CC) – couldn’t it be CT and we just didn’t have enough coverage to observe the second allele?

The basic problem is that read data are counts that produce a binomial (actually multinomial) distribution of allele calls at a given site, and if you have few reads, you might by chance not observe the true genotype. So, what’s the right thing to do?

As with almost anything in statistics, the right thing to do is not throw away that uncertainty, but instead incorporate it into your analysis. That’s what we’re going to do…

Genotype-free population genetics using genotype likelihoods A growing movement in popgen analysis of NGS data is embracing the use of genotype likelihoods to calculate stats based on each individual having a likelihood (probability) of being each genotype.

A genotype likelihood (GL) is essentially the probability of observing the sequencing data (reads containing a particular base), given the genotype of the individual at that site.

These probabilities are modeled explicitly in the calculation of population diversty stats like pi, Tajima’s D, Fst, PCA, etc…; thus not throwing out any precious data, but also making fewer assumptions about the true (unknown) genotype at each locus

We’re going to use this approach with the program ‘ANGSD’, which stands for ‘Analysis of Next Generation Sequence Data’

This approach was pioneered by Rasmus Nielsen, published originally in Korneliussen et al. 2014.

ANGSD has a user’s manual (it’s a work in progress…)

The basic work flow of ANGSD looks like this:

Create a list of bam files for the samples you want to analyze Estimate genotype likelihoods (GL’s) and allele frequencies after filtering to minimize noise Use GL’s to: - estimate the site frequency spectrum (SFS) - estimate nucleotide diversities (Watterson’s theta, pi, Tajima’s D, …) - estimate Fst between all populations, or pairwise between sets of populations - perform a genetic PCA based on estimation of the genetic covariance matrix (this is done on the entire set of Edge ind’s)

4.11 ANGSD_mypop.sh

myrepo="/users/s/r/srkeller/Ecological_Genomics/Spring_2020"

mkdir ${myrepo}/myresults/ANGSD

output="${myrepo}/myresults/ANGSD"

mypop="AB"

ls /data/project_data/RS_ExomeSeq/mapping/BWA/${mypop}_*sorted.rm*.bam >${output}/${mypop}_bam.list

Check your output bamlist to see it was made properly.

  1. Estimate your GL’s and allele freqs after filtering for depth, base and mapping quality, etc. REF=“/data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa”

4.12 Estimating GL’s and allele freqs

Estimating GL’s and allele frequencies for all sites with ANGSD

ANGSD -b ${output}/${mypop}_bam.list \
-ref ${REF} -anc ${REF} \
-out ${output}/${mypop}_allsites \
-nThreads 1 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-doSaf 1 \
-doHWE 1 \
# -SNP_pval 1e-6

What do all these options mean?

-nThreads 1 # how many cpus to use – be conservative
-remove_bads 1  # remove reads flagged as ‘bad’ by samtools
-C 50   # enforce downgrading of map quality if contains excessive mismatches
-baq 1  # estimates base alignment qualities for bases around indels
-minMapQ 20 # threshold for minimum read mapping quality (Phred)
-minQ 20    # threshold for minimum base quality (Phred)
-setMinDepth 3  # min read depth across ALL individual to keep a site
-minInd 2   # min number of individuals to keep a site
-setMinDepthInd 1   # min read depth for an individual to keep a site
-setMaxDepthInd 17  # max read depth for an individual to keep a site
-skipTriallelic 1   # don’t use sites with >2 alleles
-GL 1   # estimate GL’s using the Samtools formula
-doCounts 1 # output allele counts for each site
-doMajorMinor 1 # fix major and minor alleles the same across all samples
-doMaf 1 # calculate minor allele frequency
-doSaf 1 # output allele frequencies for each site
-doHWE 1 # calculate obs. v. exp. heterozygosity
-SNP_pval 1e-6 # Keep only site highly likely to be polymorphic (SNPs)
NOTES

If you want to restrict the estimation of the genotype likelihoods to a particular set of sites you’re interested in, add the option -sites ./selected_sites.txt (tab delimited file with the position of the site in column 1 and the chromosome in column 2) or use -rf ./selected_chromosome.chrs (if listing just the unique “chromosomes” or contigs you want to anlayze) Some popgen stats you want to estimate only the polymorphic sites; for this you should include the -SNP_pval 1e-6 option to elininate monomorphic sites when calculating your GL’s There are good reasons to do it BOTH ways, with and without the -SNP_pval 1e-6 option. Keeping the monomorphic sites in is essential for getting proper estimates of nucleotide diversity and Tajima’s D. But other analyses such as PCA or GWAS want only the SNPs. 3a. Estimate the SFS for your pop

realSFS {$output}/${mypop}_allsites.saf.idx -maxIter 1000 -tole 1e-6 -P 1 > ${output}/${mypop}_allsites.sfs

3b. Once you have the SFS, you can estimate the theta diversity stats:

4.13 Estimating thetas

Estimating thetas for all sites, using the SFS from above as a prior to estimate the GL’s

ANGSD -b ${output}/${mypop}_bam.list \
-ref ${REF} -anc ${REF} \
-out ${output}/${mypop} \
-nThreads 1 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-setMinDepth 3 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-pest ${output}/${mypop}_allsites.sfs \
-doSaf 1 \
-doThetas 1

thetaStat do_stat ${output}/${mypop}_allsites.thetas.idx

The first column of the results file (${mypop}.thetas.idx.pestPG) is formatted a bit funny and we don’t really need it. We can use the cut command to get rid of it:

cut -f2- ${mypop}.thetas.idx.pestPG > ${mypop}.thetas This is now ready to bring into R to look at the mean and variability in nucleotide diversity for our pop. How does it compare to others?

3c. We can calculate Fst between any pair of populations by comparing their SFS to each other. For this, we’ll need to estimate the SFS for pairs of populations; we can each contribute to the overall analysis by looking at how our focal pop is divergent from the others in the edge region We first want to get a list of the unique population codes that doesn’t include our focal population (it doesn’t make sense to calculate an Fst of a population with itself…) cat /data/project_data/RS_ExomeSeq/metadata/RS_Exome_metadata.txt | grep -w “E” | cut -f1 | uniq | grep -v \({mypop} > ~/yourrepo/mydata/edgepops_no\){mypop}.txt

Next set up a loop to calculate Fst your population and each other pop (coded in the loop as variable “POP2”:

for POP2 in `cat ${myrepo}/mydata/edgepops_no${mypop}.txt`
do
    realSFS ${output}/${mypop}_allsites.saf.idx ${output}/${POP2}_allsites.saf.idx -P 1 > ${output}/${mypop}_${POP2}_allsites.sfs
    realSFS fst index ${output}/${mypop}_allsites.saf.idx ${output}/${POP2}_allsites.saf.idx -sfs ${output}/${mypop}_${POP2}_allsites.sfs -fstout ${output}/${mypop}_${POP2}_allsites -whichFst 1
    realSFS fst print ${output}/${mypop}_${POP2}_allsites.fst.idx > ${output}/${mypop}_${POP2}_allsites.fst
done

4.14 Ancestory relationships

3d. If we have time (and the previous analysis I started yesterday finally finishes running!), we can look at the ancestry relatioships among all the edge individuals using a genetic Principal Component Analysis. ANGSD has a routine for that too – PCAngsd. First, we estimate the GL’s for the entire sample, keeping just the polymorphic sites, and outputing the GL’s in “Beagle” format (file.beagle.gz)

Make a list of all the sorted and PCR dup removed bam files:

ls /data/project_data/RS_ExomeSeq/mapping/BWA/sorted.rmbam >${myrepo}/myresults/EDGE_bam.list

4.15 Run ANGSD to estimate the GL’s

ANGSD -b ${myrepo}/myresults/EDGE_bam.list \
-ref ${REF} -anc ${REF} -out ${myrepo}/myresults/EDGE_poly \
-nThreads 10 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-doGlf 2 \
-SNP_pval 1e-6

The resulting GL file EDGE_poly.beagle.gz can be used as input to PCAngsd to estimate the covariance matrix:

4.16 PCA - Estimating the covariance matrix with pcangsd

python /data/popgen/pcangsd/pcangsd.py -beagle ${myrepo}/myresults/EDGE_poly.beagle.gz -o ${myrepo}/myresults/EDGE_poly_covmatrix -threads 1

Once you have that, you’ll want to send all your outputs up to Github and then back down to your local machine. We can use R to find the eigenvectors of the genetic covariance matrix and plot it:

4.17 Vizualizing in R

setwd("path to your repo")
covmat <- read.table("myresults/myresults/EDGE_poly_covmatrix")

PCA <- eigen(covmat)
plot(PCA[,1],PCA[,2])

5 Diversity & Pheno~Geno

5.1 Learning Objectives

  • Appreciate the difference between the unfolded and the folded SFS (which should we use?)
  • Calculate diversity stats for our focal pops (SFS, Num sites, Frequency of SNPs, theta-W, pi, Tajima’s D)
  • Visualize results in R and share to google drive
  • Introduce Genome-Wide Association Studies (GWAS) in ANGSD using genotype probabilities
  • Do GWAS on seedling heights
  • The unfolded vs. folded SFS
  • The big difference here is whether we are confident in the ancestral state of each variable site (SNP) in our dataset

If we know the anestral state, then the best info is contained in the unfolded spectrum, which shows the frequency histogram of how many derived loci are rare vs. common bins in the unfolded spectra go from 0 to 2N – why? When you don’t know the ancestral state confidently, you can make the SFS based on the minor allele (the less frequent allele; always < 0.5 in the population) bins in the folded spectra go from 0 to 1N – why? Essentially, the folded spectra wraps the SFS around such that high frequency “derived” alleles are put in the small bins (low minor allele freq).

5.2 Calculate SFS and diversity stats

In your myscripts folder, let’s revise the script ANGSD_mypop.sh to work on the folded SFS

The first part will be identical as last time, except: 1. Let’s change the -out name to -out ${output}/${mypop}_outFold 2. Take out the HWE test (not necessary to run again) and replace it with fold 1

For info on using the folded spectrum, see the ANGSD manual page for Theta stats under “folded”.

The above command generated the site allele frequencies, which we can now use to estimate the folded SFS for your pop realSFS \({output}/\){mypop}_outFold.saf.idx -maxIter 1000 -tole 1e-6 -P 1 > \({output}/\){mypop}_outFold.sfs Once you have the SFS, you can estimate the theta diversity stats: We do this by running ANGSD again, as above, but this time we include the -pest flag which uses the SFS we estimated previously as a prior on the allele frequency estimation. We also include the doThetas flag, and of course the fold 1 to tell ANGSD we want the diversities based on the folded SFS

So, we’re copying our previous set of ANGSD commands to make a new entry and adding:

-pest ${output}/${mypop}_outFold.sfs \
-doSaf 1 \
-doThetas 1 \
-fold 1

thetaStat do_stat ${output}/${mypop}_outFold.thetas.idx

The first column of the results file (${mypop}.thetas.idx.pestPG) is formatted a bit funny and we don’t really need it. We can use the cut command to get rid of it if we want to, or just ignore it.

cut -f2- ${mypop}.thetas.idx.pestPG > ${mypop}.thetas

This is now ready to bring into R (transfer via git) to look at the mean per-site nucleotide diversity for your focal population. How does it compare to other populations? Also bring in your SFS for plotting, and to calculate the SNP frequency in your population.

  • Share you population specific stats with the group

  • Share your population stats on by pasting your values into this googledoc.

5.3 Genome-Wide Association Studies (GWAS)

The last type of analysis we’ll look at is GWAS. Recall that the purpose of GWAS is to ‘scan’ the genome SNP by SNP and test for an association between the genotypes at that locus (e.g., AA, AC, CC) and some response variable of interest. Often this is a phenotypic trait that has been carefully measured so as to reduce the amount of environmental (non-genetic) contribution.

We’ll use GWAS to map seedling height growth data in common gardens that my group planted in April-May of 2019 at three sites: Burlington VT, Frostburg MD, and Bent Creek, NC

Each family is represented by 5 seedlings per site, randomized within each block

Let’s start with a GWAS that maps phenotype~genotype associations in the MD garden, since that is close to the “local” climate environment for many of these edge populations. I’ve generated best linear unbiased predictors (BLUPs) for each family at the MD site based on an initial ANOVA that removed effects of block and initial height at planting. The values are here:

/data/project_data/RS_ExomeSeq/ANGSD/blups/Ht_EDGE_MD.blups

The ANGSD command we’ll use is called doAsso and there’s some good guidance on the manual page for the different implementations of this amnalysis. We’ll use the hybrid approach doAsso 5 which tests every SNP for association, but then follows up with only the significant SNPs to estimate the effect size (slope of the phenotype~genotype relationship)

The statistical approach is a generalized linear model of the form:

y ~ u + beta*SNP + covariates + error

where,

  • y = the reponse variable (in this case the BLUPs for height, 1 for each family)
  • u = the intercept
  • beta = the coefficient (or slope) relating the change in the response variable to the SNP genotype
  • SNP = an individual SNP locus, as a factor with 3 levels (e.g., AA, AC, CC)
  • covariates are 1 or more variables that are needed to help control the type 1 error rate (false positives) Typically one uses PC axes or Admixture ancestry coefficients for this (remember the Norway spruce paper?). We can use the 1st two PC scores from our PCA, even though the structure seemed very minimal. You can find the scores here:
[srkeller@pbio381 PCA]$ pwd
/data/project_data/RS_ExomeSeq/ANGSD/PCA
[srkeller@pbio381 PCA]$ ll EDGE_PC*
-rw-r--r--. 1 srkeller users 4342 Feb 18 21:01 EDGE_PC12.txt
-rw-r--r--. 1 srkeller users 2193 Feb 18 18:10 EDGE_PC1.txt
-rw-r--r--. 1 srkeller users 2149 Feb 18 18:10 EDGE_PC2.txt
error = the error variance

ANGSD uses genotype probabilities as input and tests each SNP for association with a Likelihood Ratio Test (LRT) which tests the likelihood of there being an association / likelihood of NO association. The LRT is ~ chisq distributed with 1 d.f.

The basic set up is:

  • Call ANGSD with a list of all the bam files and apply the normal filtering as above to reduce noise (poor mapping or read coverage)
  • Ask ANGSD to estimate genotype probabilities to pass to doAsso
  • Tell ANGSD the name of your trait file and covariates file if you are using one.
  • Other options as needed to minimze type 1 error rates – see guidance at bottom of ANGSD manual page
  • Bring results into R for plotting and outlier detection
  • Let’s define our directories first, then layer in the main body of the script
blups="/data/project_data/RS_ExomeSeq/ANGSD/blups"

OUT="/data/project_data/RS_ExomeSeq/ANGSD"

To speed things up in class a bit, we’re going to break the total number of contigs into 20 equal size chunks of 1280. That way, each student runs their GWAS on a set of 1280 contigs, and then we pool our result together at the end. No need to do this if you’re working on your own in the future…this is just a convenience for class time.

contigs="/data/project_data/RS_ExomeSeq/ANGSD/contig_splits"

mycontigs="xaa"

And then the main script:

ANGSD -b ${OUT}/EDGE_bam.list \
-ref ${REF} 
-out ${OUT}/GWAS/Ht_EDGE_MD_PC12_${mycontigs} \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doPost 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-SNP_pval 1e-6 \
-yQuant ${blups}/Ht_EDGE_MD.blups \
-doAsso 5 \
-rf ${contigs}/${mycontigs} \
-cov ${OUT}/PCA/EDGE_PC12.txt

Couple of new options here:

-yQuant ${blups}/Ht_EDGE_MD.blups – this is your response variable; no header -doAsso 5 – this is the model you’re choosing to run. See ANGSD manual under doAsso -rf contigs/{mycontigs} – this means you want the GWAS to run just on a set of regions (r) defined in an external file (f) -cov ${OUT}/PCA/EDGE_PC12.txt – these are the covariates; as many columns as you like; no header

Give it some time to run (should take 20 min or so), then we’ll bring the results into R to plot and summarize.

Is there evidence for genetic associations with early seedling height growth? What do the beta estimates suggest about the direction of the effect (more often increasing, decreasing)? Are SNPs that have a positive effect on height in the MD garden the same as those in NC? For the SNPs that have the strongest effect on height, are they polymorphic in your focal pop? NOTE: You can do the same analysis replacing climate data for heigh as the response variable. This is known as Gene-Environment Association (GEA), and is a common way to test for regions of the genome that confer local adaptation to climate (or other aspects of the environment). I’ve placed two relevant climate vareiables for red spruce in the blups directory:

[srkeller@pbio381 blups]$ pwd
/data/project_data/RS_ExomeSeq/ANGSD/blups
[srkeller@pbio381 blups]$ ll
total 20
-rw-r--r--. 1 srkeller users  432 Feb 18 20:41 bio18.blups
-rw-r--r--. 1 srkeller users  432 Feb 18 20:37 bio5.blups

These variables are derived from the bioclim climatic variables available here.

Bio5 = 10(Max temp of the warmest month) Bio18 = 10(Precip of the warmest quarter)

6 Transcriptomics

6.1 Learning Objectives

  • Review Red Spruce ecology and biogeography and the transcriptomics experimental design.
  • Articulate the questions we can address and hypotheses we can test with this experimental design.
  • Understand the general work flow or “pipeline” for processing and analyzing RNAseq data.
  • Review how to make/write a bash script and how write a script to process files in batches.
  • Visualize and interpret Illumina data quality (Run FastQC on raw and cleaned reads).
  • Start mapping reads and quantifying abundance simultaneously using Salmon.
  • Import quant.sf files generated by Salmon into DESeq2 for analysis and visualization.
  • Add to your growing list of bioinformatics tricks (take notes!)

6.2 Red spruce stress transcriptomics experiment

Red spruce is a montane coniferous tree that inhabits cool moist habitats in the northeast. Occasionally it also less ideal sites that are warmer and drier, particularly driven by microsite differences in aspect and substrate (rocky scree slopes exposed to direct sun). The purpose of this experiment was to sample red spruce genetic variation from sites that were polarized into cool & wet vs. hot and dry based on historic climate (based on Bio18 (precip of warmest quarter) and Bio5 (temp of warmest month), and to assess the gene expression responses of individuals from these habitats in response to experimental treatments of heat and heat+drought.

6.3 Experimental Design

  • Ten maternal families total; sample labels are “POP_FAM”
    • ASC_06, BRU_05, ESC_01, XBM_07, NOR_02, CAM_02, JAY_02, KAN_04, LOL_02, MMF_13

Two Source Climates (SourceClim):

  • HotDry (5 fams): ASC_06, BRU_05, ESC_01, XBM_07, NOR_02
  • CoolWet (5 fams): CAM_02, JAY_02, KAN_04, LOL_02, MMF_13

Experimental Treatments (Trt):

  • Control: watered every day, 16:8 L:D photoperiod at 23C:17C temps
  • Heat: 16:8 L:D photoperiod at 35C:26C temps (50% increase in day and night temps over controls)
  • Heat+Drought: Heat plus complete water witholding

Three time periods (Day):

  • Harvested tissues on Days 0, 5, and 10

Extracted RNA from whole seedlings (root, stem, needle tissue)

Aimed for 5 biological reps per Trt x SourceClim x Day combo, but day5 had few RNA extractions that worked

Realized sample replication after sequencing: N=76

Trt SourceClim Day Nreps
Control CoolWet 0 5
Control CoolWet 5 3
Control CoolWet 10 5
Control HotDry 0 5
Control HotDry 5 3
Control HotDry 10 5
Heat CoolWet 0 5
Heat CoolWet 5 3
Heat CoolWet 10 5
Heat HotDry 0 5
Heat HotDry 5 2
Heat HotDry 10 5
HeatDry CoolWet 0 5
HeatDry CoolWet 5 1
HeatDry CoolWet 10 5
HeatDry HotDry 0 5
HeatDry HotDry 5 4
HeatDry HotDry 10 5

6.4 Library prep and sequencing

Samples were quantified for RNA concentration and quality on the Bioanalyzer Samples >1 ng/ul were sent to Cornell for 3’ tag sequencing Library prep followed the LexoGen protocol and sequencing was on 1 lane of a NextSeq500 (1x86 bp reads) Samples were demultiplexed and named according to the convention: POP_FAM_TRT_DAY

What questions can we ask/address with this experimental design, with these data? How do we translate these questions to models in DESeq2?

  1. Do families from different source climates differ for their gene expression?

  2. Is there a transcriptome wide response to heat stress? Does this change when the additional stress of drought is added?

  3. Is there a significant interaction between source climate and stress treatment, such that families from hot/dry climates have a unique expression response to heat or heat+drought compared to families from cool/wet climates?

  4. Which specific genes are involved in the above responses, and do they reveal functional enrichment of particular pathways?

  5. Do DE genes or pathways show evidence of positive selection (compare DE genes with popgen signatures of sel’n)?

  6. Can we use association mapping to identify eQTL associated with DE responses?

6.5 Data Processing Pipeline

FastQC on raw reads –> Trimmomatic (done!) –> FastQC on cleaned reads

Reference transcriptome:

/data/project_data/RS_RNASeq/ReferenceTranscriptome/Pabies1.0-all-cds.fna.gz

Downloaded from Congenie.org

  • 66,632 unigenes, consisting of 26,437 high-confidence gene models, 32,150 medium-confidence gene models, and 8,045 low-confidence gene models
  • Use Salmon to simulateously map reads and quantify abundance.
  • Import the data into DESeq2 in R for data normalization, visualization, and statistical tests for differential gene expression.
  • Choose samples to visualize for quality (FastQC) and to map (Salmon)
  • Ignoring day for the moment, there are 10 PopulationByFamily groups (POP_XX) and three treatment groups (C, H, and D), so 30 groups. If you each take two of these pop by treatment groups to process, we’ll have all the samples covered. cd into the /data/project_data/RS_RNAseq/fastq directory to view the files. Let’s write on the board each of our chosen groups.

6.6 FastQC

  • Remember how… Where do you start
  • Clean the reads with Trimmomatic
  • \(\checkmark\) Already done - Here’s the script:
#!/bin/bash

cd /data/project_data/RS_RNASeq/fastq/

########## Trimmomatic for single end reads

for R1 in *R1.fastq.gz  

do 
    echo "starting sample ${R1}"
    f=${R1/_R1.fastq.gz/}
    name=`basename ${f}`

    java -classpath /data/popgen/Trimmomatic-0.33/trimmomatic-0.33.jar org.usadellab.trimmomatic.TrimmomaticSE \
        -threads 1 \
        -phred33 \
         "$R1" \
         /data/project_data/RS_RNASeq/fastq/cleanreads/${name}_R1.cl.fq \
        ILLUMINACLIP:/data/popgen/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 \
        LEADING:20 \
        TRAILING:20 \
        SLIDINGWINDOW:6:20 \
        HEADCROP:12 \
        MINLEN:35 
     echo "sample ${R1} done"
done 

Run FastQC on the clean reads to confirm improvement What do we look for in this second run?

7 Salmon

Salmon to quantify transcript abundance. Mapping and counting the reads.

First step: Index the reference transcriptome. This only needs to be done once and has been done already, but here’s the code:

cd /data/project_data/RS_RNAseq/ReferenceTrancriptome/
salmon index -t Pabies1.0-all-cds.fna.gz -i Pabies_cds_index --decoys decoys.txt -k 31

Note that the -k flag sets the minimum acceptable length for a valid match between query (read) and the reference. This is a parameter that the creators of Salmon suggest can be made “slightly smaller” if the mapping rate is lower than expected. We may want to explore reducing -k.

Second step: Start quantification! Let’s see if we can write a for loop together to start the mapping based on the Salmon tutorial

salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant

The descriptions of all of the options can be found on the Salmon github page and by using the command salmon quant -h.

7.1 salmon.sh

salmon.sh

#! /bin/bash/

# run salmon to do the mapping and the counting of the cleanreads

# salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant

cd /data/project_data/RS_RNASeq/fastq/cleanreads/

for file in CAM_02_H_*.cl.fq

do

  salmon quant \
  -i /data/project_data/RS_RNASeq/ReferenceTranscriptome/Pabies_cds_index \
  -l A \
  -r ${file} \
  --validateMappings \
  -p 1 \
  --seqBias \
  -o /data/project_data/RS_RNASeq/salmon/allmapping/${file}

done

7.2 Explore mapping rate

For each sample mapped, you now have a directory with several output files including a log of the run. In that log, the mapping rate (% of reads mapped with sufficient quality) is reported. We can view the contents of the file using cat. We can also use grep (i.e., regular expressions) to pull out the mapping rate for all the samples. Though there’s probably a more elegant solution, here is one:

grep -r --include \*.log -e 'Mapping rate'

How could we save this output?

What is our mapping rate? Is this good/enough? What factors could affect mapping rate?

We ended up using the the index file called Pabies_cds_index, which contains all of the coding sequences for P. abies, not just the high confidence coding sequences. This improved the percent mapping from ~ 4% to ~ 25%.

7.3 R import!

7.3.1 Combine individual quant.sf files

Combine individual quant.sf files from their respective directories into a counts data matrix with all 76 samples in one table

To do this, we’ll use the R package tximport on our class server. Here’s a tximport tutorial by the creators of the two programs, Salmon and DESeq2. Unless we decide to use multiple rounds of mapping (changing references, etc), only one person needs to make this compiled matrix.

Here’s an example of how it can be done. First start R on the server by typing capital R and enter. Exciting! Right? Load the libraries we need (already installed).

library(tximportData)
library(tximport)

#locate the directory containing the files. 
dir <- "/data/project_data/RS_RNASeq/salmon/"
list.files(dir)

# read in table with sample ids
samples <- read.table("/data/project_data/RS_RNASeq/salmon/RS_samples.txt", header=TRUE)

# now point to quant files
all_files <- file.path(dir, samples$sample, "quant.sf")
names(all_files) <- samples$sample

# what would be used if linked transcripts to genes
#txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
# to be able to run without tx2gene
txi <- tximport(all_files, type = "salmon", txOut=TRUE)  
names(txi)

head(txi$counts)

countsMatrix <- txi$counts
dim(countsMatrix)
#[1] 66069    76

# To write out
write.table(countsMatrix, file = "RS_countsMatrix.txt", col.names = T, row.names = T, quote = F)

7.4 Move counts data matrix to local

Move this counts data matrix to your individual machines using your perferred ftp (fetch, Winscp, scp). Also move the RS_samples.txt file to your machine. This file is a table that associates each of our samples with their conditions (climate, treatment, day, population).

Note that in the code above, we are using the counts data that are not normalized to length because we are working with 3’ RNAseq data. This is informed by a cautionary note from the creators on working with 3’ tagged RNA-seq: “If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR or limma functions without calculating an offset and without using countsFromAbundance.”

Important considerations: How was this transcriptome assembled? Should we collapse transcripts to genes?

Cautionary note from the creators on working with 3’ tagged RNA-seq: “If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR or limma functions without calculating an offset and without using countsFromAbundance.”

To do this: The rownames of the sampleTable RS_samples need to align with the colnames of txi$counts, that is the sample names need to be row names in the sample/factor table and need to be the column names of the data matrix (with rows as genes and columns as samples).

dds <- DESeqDataSetFromTximport(txi, sampleTable)

Let’s get R set up for for Transcriptomics, data import and visualization! Here is a list of some of the libraries we will need:

setwd("~/github/2020_Ecological_Genomics")

library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(ggpubr)
library(wesanderson)


countsTable <- read.table("RS_countsMatrix.txt", header=TRUE, row.names=1)

conds <- read.delim("RS_samples.txt", header=TRUE, stringsAsFactors = TRUE, row.names=1, colClasses=c('factor', 'factor', 'factor', 'factor'))

dds <- DESeqDataSetFromMatrix(countData = countsTableRound, colData = conds, design = ~ day + climate + treatment)

8 DESeq2 in R

8.1 Objectives

  • Recap on Mapping of 3’ RNA-seq data to transcriptome and assembly of data matrix
  • Import data matrix and sample information into R and DESeq2
  • Normalize, visualize and analyze expression data using DESeq2.

8.2 Troubleshooting mapping rate

Recall that two weeks ago you all wrote for loops to map your set of cleaned fastq files to the reference transcriptome. We discovered that we had low mapping rates, ~2%! We did some troubleshooting in class. We further hypothesized that many of our reads were not mapping because the reference we had selected included only the coding region. In working with 3’ RNAseq data, much of our sequencing effort is likely to be in the 3’ UTR (untranslated region).

  • We concatenated the reference sequences for the coding (“cds”) and the 3’ UTR (“2kb downstream”) for each gene.
  • We then mapped to this new reference using salmon (as you had done before).
  • Our mapping rate improved dramatically, ranging from 40-70% of reads mapping across samples, mean of 52%.

8.3 Assembling counts matrix

As described in the last tutorial, after all samples have been mapped to the refenence, the next step is to assemble the counts matrix using tximport in R on the server to be able to move from read mapping with Salmon to differential gene expression (DGE) analysis with DESeq2. See previous tutorial for code to assemble counts matrix.

8.4 Import counts matrix and metadata

Now we will work in R on our individual machines, each of us working with the complete data set (n=76, not just a subset of samples).

# Transcriptomics in R

# Set your working directory
setwd(here::here("myresults/transcript_counts"))

# Packages 
library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(ggpubr)
library(wesanderson)
library(vsn)
library(pheatmap)

# Import data ------------------------------------------------------------------
countsTable <- read.table("RS_cds2kb_countsMatrix.txt", 
                          header=TRUE, 
                          row.names=1)
# head(countsTable)
# dim(countsTable)
# Need to round because DESeq wants only integers
countsTableRound <- round(countsTable)
# head(countsTableRound)
# Set your working directory
setwd(here::here("myresults/transcript_counts"))

# Import the samples description table 
# links each sample to factors of the experimental design
# Need the colClasses otherwise imports "day" as numeric which DESeq doesn't do
conds <- read.delim("RS_samples.txt", 
                    header = TRUE, 
                    stringsAsFactors = TRUE, 
                    row.names = 1, 
                    colClasses = c('factor', 'factor', 'factor', 'factor'))
head(conds)
##                                  pop treatment day climate
## ASC_06_C_0_GAGTCC_R1.cl.fq    ASC_06         C   0      HD
## ASC_06_C_10_ACGTCT_R1.cl.fq   ASC_06         C  10      HD
## ASC_06_C_5_GTGTAG_R1.cl.fq    ASC_06         C   5      HD
## ASC_06_D_H_0_CGCAAC_R1.cl.fq  ASC_06         D   0      HD
## ASC_06_D_H_10_CAGCGT_R1.cl.fq ASC_06         D  10      HD
## ASC_06_H_0_AATAGC_R1.cl.fq    ASC_06         H   0      HD
dim(conds)
## [1] 76  4

8.5 Summary statistics and visualization

# Read summary statistics and visualization ------------------------------------
# colSums(countsTableRound)
# mean(colSums(countsTableRound))
# bar plot to vizualize the sum of reads
barplot(colSums(countsTableRound), 
        las = 3, 
        cex.names = 0.5,
        names.arg = substring(colnames(countsTableRound), 1, 13))
abline(h = mean(colSums(countsTableRound)), col = "blue", lwd = 2)

# # Average number of counts per gene
# rowSums(countsTableRound)
# apply(countsTableRound, 2, mean)
# Create a DESeq object --------------------------------------------------------
# define the experimental design here with the tilde

dds <- DESeqDataSetFromMatrix(countData = countsTableRound,
                              colData = conds,
                              design = ~ climate + day + treatment)

# Filter out genes with few reads
# 76 was chosen to create an average of 1 read per sample
dds <- dds[rowSums(counts(dds)) > 76]

# Run the DESeq model to test for differential gene expression -----------------
# DESeq does these three things all in this function
# 1) estimate size factors (per sample) 
# 2) estimate dispersion (per gene)
# 3) run negative binomial glm
dds <- DESeq(dds)

# List the results you've generated
resultsNames(dds)
## [1] "Intercept"        "climate_HD_vs_CW" "day_10_vs_0"      "day_5_vs_0"      
## [5] "treatment_D_vs_C" "treatment_H_vs_C"

8.6 Specific contrasts

# Order and list and summarize results from specific contrasts -----------------
# Set your adjusted p-value cutoff
# Can make summary tables of the number of genes differentially expressed 
# up/down-regulated) for each contrast

res_treatCD <- results(dds, contrast = c("treatment", "C", "D"), alpha = 0.05)
res_treatCD <- res_treatCD[order(res_treatCD$padj), ]
# summary(res_treatCD)

8.7 MA Plot

# Data visualization -----------------------------------------------------------
# MA plot
ggmaplot(res_treatCD, 
         main = expression("Hot & Dry" %->% "Control"),
         fdr = 0.01, fc = 0, size = 1,
         palette = c("#B31B21", "#1465AC", "#A6A6A680"),
         legend = "top", 
         top = 0,
         # genenames = as.vector(res_treatCD$gene),
         # select.top.method = "padj",
         # font.label = c("bold", 11), 
         # label.rectangle = TRUE,
         font.legend = "bold",
         font.main = "bold",
         ggtheme = ggplot2::theme_classic()) + 
  theme(legend.text = element_text(size = 12))

8.8 PCA

# PCA
vsd <- vst(dds, blind = FALSE, nsub = 10000)
data <- plotPCA(vsd, 
                ntop = 10000,
                intgroup = c("climate", "treatment", "day"), 
                returnData = TRUE)
percentVar <- round(100 * attr(data, "percentVar"))

# reordering the factors
data$treatment <- factor(data$treatment, 
                         levels = c("C", "H", "D"), 
                         labels = c("C", "H", "D"))
data$day <- factor(data$day, 
                   levels = c("0", "5", "10"), 
                   labels = c("0","5","10"))

ggplot(data, aes(PC1, PC2, color = climate, shape=treatment)) +
  geom_point(size = 4, alpha = 0.85) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_minimal()

ggplot(data) +
  geom_histogram(aes(x = PC1, fill = day), 
                 bins = 20, color = "grey50", position = "dodge") +
  theme_classic()

8.9 Specific gene counts

# Counts of specific top gene! 
# important validatition that the normalization, model is working
d <- plotCounts(dds, 
                gene = "MA_10426407g0030", 
                intgroup = (c("treatment","climate", "day")), 
                returnData = TRUE)

ggplot(d, aes(x = treatment, 
                   y = count, 
                   shape = climate, 
                   color = day)) + 
  geom_jitter(width = 0.3, size = 3) +
  scale_x_discrete(limits = c("C", "H", "D")) +
  theme_classic() + 
  theme(text = element_text(size = 20), 
        panel.grid.major = element_line(colour = "grey"))

8.10 Heatmap

# Heatmap of top 20 genes sorted by pvalue
topgenes <- head(rownames(res_treatCD), 20)
mat <- assay(vsd)[topgenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[, c("treatment", "climate", "day")])
pheatmap(mat, 
         annotation_col = df, 
         show_colnames = FALSE)

9 Epigenetics

9.1 Learning objectives

  • Background on the copepod selection experimental design
  • Think about the hypotheses we can address with this experiment
  • Cover how bisulfite sequencing data differs from regular data and not to panic when you see your fastqc output
  • Begin mapping using Bismark

9.2 Copepod selection experiment

Acartia tonsa is a calanoid copepod that has a world wide distribution. It inhabits estuaries and coastal waters and it typically the most abundant zooplankton present. Because of their huge population sizes and global distribution, they play an important role in ocean biogeochemical cycling and ecosystem functioning. For example, they’re an important food base for many fishes. Given their broad distribution in dynamic habitats (estuaries), they can tolerate and live in a broad range of salinities, freshwater to seawater, and temperatures, among other stressors.

We know that the world is rapidly changing due to human activities and it is important to understand how species will respond to these changes. We were interested in understanding how A. tonsa could respond to stressful environments that will likely be caused by climate change. Can they adapt to rapid changes in temperature and acidity? How might epigenetic responses interact with adaptation to enable resilience?

A. tonsa is a great system to ask these questions for a number of reasons. First, their generation time is only ~10-20 days and they’re tiny! Only 1-2 mm. This means we can easily execute multi-generational studies with thousands of individuals. Second, because they can tolerate such variable environments, they should have lots of plasticity to respond to environmental fluctuations. Finally, their large population sizes mean that genetic diversity should be high, giving them high adaptive potential. This means that if any species can respond to rapid environmental change, it is likely A. tonsa.

9.3 Experimental design

We collected A. tonsa from the wild, common gardened them for three generations, then split them into four treatments with four replicates each and about 3,000-5,000 individuals per replicate. We left them to evolve in these conditions for 25 generations.

Samples were collected at generation F0 and F25 to quantify methylation frequencies using reduced representation bisulfite sequencing (RRBS). Below are the sample ID’s

Sample_ID Treatment treatment_abbrev generation sample_rep
AA_F00_1 Control: ambient temp, ambient CO2 AA F00 1
AA_F00_2 Control: ambient temp, ambient CO2 AA F00 2
AA_F00_3 Control: ambient temp, ambient CO2 AA F00 3
AA_F00_4 Control: ambient temp, ambient CO2 AA F00 4
AA_F25_1 Control: ambient temp, ambient CO2 AA F25 1
AA_F25_2 Control: ambient temp, ambient CO2 AA F25 2
AA_F25_3 Control: ambient temp, ambient CO2 AA F25 3
AA_F25_4 Control: ambient temp, ambient CO2 AA F25 4
AH_F25_1 ambient temp, high CO2 AH F25 1
AH_F25_2 ambient temp, high CO2 AH F25 2
AH_F25_3 ambient temp, high CO2 AH F25 3
AH_F25_4 ambient temp, high CO2 AH F25 4
HA_F25_1 high temp, ambient CO2 HA F25 1
HA_F25_2 high temp, ambient CO2 HA F25 2
HA_F25_3 high temp, ambient CO2 HA F25 3
HA_F25_4 high temp, ambient CO2 HA F25 4
HH_F25_1 high temp, high CO2 HH F25 1
HH_F25_2 high temp, high CO2 HH F25 2
HH_F25_3 high temp, high CO2 HH F25 3
HH_F25_4 high temp, high CO2 HH F25 4

9.4 RRBS

Reduced representation bisulfite sequencing. Focuses on CpG parts of the genome. Similar restriction enzyme process to RAD sequencing.

Following the adapter ligation, we bisulfite convert all unmethylated C’s to T’s.

Before starting we also spiked in a small amount of DNA from E. coli that we know wasn’t methylated. Using this, we can calculate downstream how efficient our bisulfite conversion was.

9.5 Hypotheses and questions

9.6 Pipeline

  • Visualize, clean, visualize
    • You won’t have to do this step
    • Visualize quality of raw data with fastqc
    • Clean raw data with Trimmomatic
    • Visualize quality of cleaned data with fastqc
  • Align to Acartia tonsa reference genome (Bismark)
  • Align lambda DNA to check for conversion efficiency (Done for you)
  • Extract methylation calls
  • Process and filter calls
  • Summary figures (PCA, clustering, etc)
  • Test for differential methylation (Methylkit)

9.7 Bisulfite sequence data visualization

Take a look at the fastqc files:

What do you notice that is different from fastqc that you’ve seen previously?

  • The base content is different, because all the non-methylated cytosines have been converted to thymines, so there is very low cytosine content and higher thymine content than expected. Although, it is the base complement for the reverse read.

Why do we see these differences?

  • Due to the bisulfite sequencing replacement

9.8 Align with Bismark

Why is this different from typical DNA alignment?

  • Two separate alignments for the forward and the reverse genome, because of the bisulfite sequencing replacement
  • You are also generally less likely to get unique alignments because there are only three bases, so there is less spave to be unique.

What do we need to do to the genome?

  • The genome needs to be converted from C to T and the complement for reverse reads due to the bisufite sequencing

Sample assignments for mapping:

Person Sample
Bertrand B AA_F00_1
Erika B AA_F00_2
Kerria B AA_F00_3
Ben C AA_F00_4
Brendan AA_F25_1
Chege H. AA_F25_2
Alison H. AA_F25_3
Sandra N. AA_F25_4
Thomas O. AH_F25_1
Csenge P. AH_F25_2
Zoe P. AH_F25_3
Anoob P. AH_F25_4
Shervin R. HA_F25_1
Jorge R. HA_F25_2
Lily S. HA_F25_3
Baxter W. HA_F25_4
Katelynn W. HH_F25_1

9.9 Extract methylation calls and look for bias

Mapping report from Bismark

10 Epigenetics Analysis

10.1 Learning objectives

  1. Summarize, visualize, and filter methylation data
  2. Identify differential methylation
  3. Link to genes

10.2 Extract methylation calls

Look at mapping rate:

10.3 Extract methylation calls

bismark_methylation_extractor --bedGraph --scaffolds --gzip \
    --cytosine_report --comprehensive \
    --no_header \
    --parallel 6 \
    --output ~/tonsa_epigenetics/analysis/methylation_extract/ \
    --genome_folder /data/copepods/tonsa_genome/ \
    *pe.bam

Link to methylation extraction report

To ignore first two bases:

bismark_methylation_extractor --bedGraph --scaffolds --gzip \
    --cytosine_report --comprehensive \
    --no_header \
    --parallel 6 \
    --ignore 2 --ignore_r2 2 \
    --output ~/tonsa_epigenetics/analysis/methylation_extract/ \
    --genome_folder /data/copepods/tonsa_genome/ \
    *pe.bam

Link to methylation extraction report ignoring bias

Column 1: Chromosome
Column 2: Start position
Column 3: End position
Column 4: Methylation percentage
Column 5: Number of methylated C's
Column 6: Number of unmethylated C's

We can look at sites with some data using the following:

zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | awk '$4 > 5 && $5 > 10' | head

10.4 Analyze with methylkit

library(methylKit)
library(tidyverse)
library(ggplot2)
library(pheatmap)

# Read in the raw methylation calls with methylKit -----------------------------

# Set directory
dir <- here::here("myresults/epigenetics")

# Read in the sample IDs
samples <- read.table(paste0(dir, "/", "sample_id.txt"), 
                      header = FALSE)
# Point to coverage files
files <- file.path(dir, samples$V1)
all(file.exists(files))
## [1] TRUE
# Convert to list
file.list <- as.list(files)
# Get the names only for naming our samples
nmlist <- as.list(gsub("_1_bismark_bt2_pe.bismark.cov.gz", "", samples$V1))

# use methRead to read in the coverage files
myobj <- methRead(location = file.list,
                  sample.id = nmlist,
                  assembly = "atonsa",
                  dbtype = "tabix",
                  context = "CpG",
                  resolution = "base",
                  mincov = 20,
                  treatment = c(0, 0, 0, 0,
                                1, 1, 1, 1,
                                2, 2, 2, 2,
                                3, 3, 3, 3,
                                4, 4, 4, 4),
                  pipeline = "bismarkCoverage",
                  dbdir = dir)

10.4.1 Visualize coverage and filter

# Visualize coverage and filter ------------------------------------------------

# We can look at the coverage for individual samples with getCoverageStats()
getCoverageStats(myobj[[1]], plot = TRUE)

# Plot all of our samples at once to compare.


# Filter samples by depth
filtered.myobj <- filterByCoverage(myobj, 
                                   lo.count = 20,
                                   lo.perc = NULL,
                                   hi.count = NULL,
                                   hi.perc = 97.5,
                                   dbdir = dir)

10.4.2 Merge samples

# Merge samples ----------------------------------------------------------------
# Merge all the samples. We will require sites to be present in each sample
# Note this takes a couple of hours to run
meth <- unite(filtered.myobj, mc.core = 3, suffix = united, dbdir = dir)

10.4.3 Load merged samples

# Load a previously created database -------------------------------------------
# So Reid created this merged database and we are going to move on from here
meth <- methylKit:::readMethylBaseDB(
  dbpath = paste0(dir, "/methylBase_united.txt.bgz"),
  dbtype = "tabix",
  sample.id =   unlist(nmlist),
  assembly = "atonsa", # this is just a string. no actual database
  context = "CpG",
  resolution = "base",
  treatment = c(0,0,0,0,
                1,1,1,1,
                2,2,2,2,
                3,3,3,3,
                4,4,4,4),
  destrand = FALSE)

10.4.4 Campare across groups

# Methylation statistics across samples ----------------------------------------

# percMethylation() calculates the percent methylation for each site and sample
pm <- percMethylation(meth)

#plot methylation histograms
ggplot(gather(as.data.frame(pm)), aes(value)) + 
  geom_histogram(bins = 10, color = "black", fill = "grey") + 
  facet_wrap(~ key, ncol = 4)

# calculate and plot mean methylation
sp.means <- colMeans(pm)

p.df <- data.frame(sample = names(sp.means),
                   group = substr(names(sp.means), 1, 6),
                   methylation = sp.means)

ggplot(p.df, aes(x = group, y = methylation, color = group)) + 
  stat_summary(color = "black") + 
  geom_jitter(width = 0.1, size = 3) +
  theme_classic()

10.4.5 Summarize variation

# Summarize variation ----------------------------------------------------------

# Sample clustering
clusterSamples(meth, dist = "correlation", method = "ward.D", plot = TRUE)

## 
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
## 
## Cluster method   : ward.D 
## Distance         : pearson 
## Number of objects: 20
# PCA
PCASamples(meth, screeplot = TRUE)

PCASamples(meth, screeplot = FALSE)

10.4.6 Differential methylation

# Find differentially methylated sites between two groups ----------------------

# Subset groups
meth_sub <- reorganize(meth,
                       sample.ids = c("AA_F00_1","AA_F00_2","AA_F00_3", 
                                      "AA_F00_4","HH_F25_1","HH_F25_2",
                                      "HH_F25_3","HH_F25_4"),
                       treatment = c(0,0,0,0,1,1,1,1),
                       save.db = FALSE)

# Calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
                            overdispersion = "MN",
                            mc.cores = 1,
                            suffix = "AA_HH",
                            adjust = "qvalue",
                            test = "Chisq")

Where MN corrects for overdispersion:

  • Fit a logistic regression to methylation values where explanatory variable is the treatment (case or control) and we compare the fit of the model with explanatory variable vs the null model (no explanatory variable) and ask if the fit is better using a Chisq test.

  • The methylation proportions are weighted by their coverage, as in a typical logistic regression. Note that in theory you could enter these as two column success and failure data frame, which is common in logistic regressions.

  • Use overdispersion: Chisq without overdispersion finds more true positives, but many more false positives. good compromise is overdispersion with Chisq. reduced true pos, but really reduces false pos rate.

# Get all differentially methylated bases
myDiff <- getMethylDiff(myDiff, qvalue = 0.05, difference = 10)

# We can visualize the changes in methylation frequencies quickly.
hist(getData(myDiff)$meth.diff)

# Hyper-methylated bases
hyper <- getMethylDiff(myDiff, difference = 10, qvalue = 0.05, type = "hyper")

# Hypo-methylated bases
hypo <- getMethylDiff(myDiff, difference = 10, qvalue = 0.05, type = "hypo")

10.4.7 Plot differential methylation

# Plots of differentially methylated groups ------------------------------------
# Heatmaps -----
pm <- percMethylation(meth_sub)
# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in, ]

# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr, 
                      getData(myDiff)$start, 
                      sep= ":"), 
                din, pm.sig)
colnames(df.out) <- c("snp", colnames(din), colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[, 5:7]))

pheatmap(pm.sig, show_rownames = FALSE)

# You can also normalize to one of the groups
ctrmean <- rowMeans(pm.sig[,1:4])
h.norm <- (pm.sig-ctrmean)
pheatmap(h.norm, show_rownames = FALSE)

10.4.8 Specfic genes & SNPs

# Methylation of specific gene or snp ------------------------------------------
df.plot <- df.out[,c(1,5:12)] %>% pivot_longer(-snp, values_to = "methylation")
df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
##   snp            name     methylation group
##   <fct>          <chr>          <dbl> <chr>
## 1 LS041600.1:376 AA_F00_1        84.4 AA   
## 2 LS041600.1:376 AA_F00_2        84.0 AA   
## 3 LS041600.1:376 AA_F00_3        84.3 AA   
## 4 LS041600.1:376 AA_F00_4        83.6 AA   
## 5 LS041600.1:376 HH_F25_1        76.7 HH   
## 6 LS041600.1:376 HH_F25_2        77.5 HH
# Looking at snp LS049205.1:248
# if you choose a different snp, you can create different plots.

df.plot %>% filter(snp=="LS049205.1:248") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")

# Write bed file for intersection with genome annotation
setwd(dir)
write.table(file = "diffmeth.bed",
          data.frame(chr = df.out$chr, 
                     start = df.out$start, 
                     end = df.out$end),
          row.names = FALSE, 
          col.names = FALSE, 
          quote = FALSE, 
          sep = "\t")