Command-line (aka terminal or shell) - a way of interacting with your local computer or a remote server - commands or scripts, without using a graphical user interface (GUI).
The first step is to open a terminal shell on your local computer. For MacOS users, this is called “Terminal”.
We’ll connect to our remote server running Unix using the secure shell (ssh) protocol. Our server’s name is pbio381 and we can connect to it using our UVM netid username and password (as long as we’re on-campus)
ip0af52fbf:papers tsoleary$ ssh tsoleary@pbio381.uvm.edu
tsoleary@pbio381.uvm.edu's password:
Last login: Tue Jan 31 10:51:15 2017 from ip040027.uvm.edu
[tsoleary@pbio381 ~]$
The log-in screen tells us some basic info on when we last logged in, and then gives us our current location in the filesystem (~) followed by the $ prompt, that tells us the computer is ready for our next command.
To see the full path to your current directory, use the pwd command:
[tsoleary@pbio381 ~]$ pwd
/users/s/r/tsoleary
[tsoleary@pbio381 ~]$
The path shows the full directory address up from the “root” of the file structure, which is the most basal level (appealing to all you phylogeneticists here…). The root is symbolized as “/” and each subdirectory is separated by an additional “/”. So, the full path to my working directory on the server is /users/s/r/tsoleary/
Let’s make a new folder (aka, directory) using the mkdir command. Let’s name this folder “mydata”
[tsoleary@pbio381 ~]$ mkdir mydata
[tsoleary@pbio381 ~]$ ll
total 0
drwxr-xr-x. 6 tsoleary users 82 Jan 31 17:21 archive
drwxr-xr-x. 2 tsoleary users 6 Jan 31 17:21 mydata
drwxr-xr-x. 2 tsoleary users 6 Jan 31 17:15 scripts
[tsoleary@pbio381 ~]$
ll
command to list the folders againcd
command. Let’s use cd
to move inside the mydata/
directory and ll
to list its contents:[tsoleary@pbio381 ~]$ cd mydata/
[tsoleary@pbio381 mydata]$ ll
total 0
[tsoleary@pbio381 mydata]$
drwxr-xr-x. 5 root root 73 Jan 31 17:35 archive
drwxrwxr-x. 2 root pb381adm 40 Nov 30 2015 packages
drwxrwxr-x. 33 root pb381adm 4096 Nov 30 2015 popgen
drwxrwxr-x. 3 root pb381adm 42 Jan 30 09:08 project_data
drwxrwxr-x. 2 root pb381adm 6 Oct 2 2015 scripts
drwxr-xr-x. 18 root root 4096 Sep 2 2015 users
[tsoleary@pbio381 data]$
[tsoleary@pbio381 data]$ cd project_data/
[tsoleary@pbio381 project_data]$ ll
total 8
drwxr-xr-x. 12 tsoleary users 4096 Jan 30 09:06 archive
-rw-r--r--. 1 tsoleary users 1255 Jan 30 09:08 ssw_samples.txt
[tsoleary@pbio381 project_data]$
ssw_samples.txt
is the one with the seastar metadata. We don’t want to open and make changes to this file in the shared space, because we don’t want to have our edits affect the rest of the group. So, let’s first make a copy of this file over to our home directory and put it inside the “mydata” folder. Use the cp
command, followed by the filename, and the path to your destination (remember the ~ signals your home directory, and each subdirectory is then separated by a /
):[tsoleary@pbio381 project_data]$ cp ssw_samples.txt ~/mydata/
[tsoleary@pbio381 project_data]$ cd ~/mydata/
[tsoleary@pbio381 mydata]$ ll
total 4
-rw-r--r--. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$
[tsoleary@pbio381 mydata]$ head ssw_samples.txt
Individual Trajectory Location Day3 Day6 Day9 Day12 Day15
10 HH INT 10_5-08_H 10_5-11_H 10_5-14_H 10_5-17_H 10_5-20_H
24 HH INT 24_5-08_H 24_5-11_H 24_5-14_H 24_5-17_H 24_5-20_H
27 HH INT 27_5-08_H 27_5-11_H 27_5-14_H 27_5-17_H 27_5-20_H
08 HS INT 08_5-08_H 08_5-11_S 08_5-14_S 08_5-17_S 08_5-20_S
09 HS INT 09_5-08_H 09_5-14_S 09_5-17_S 09_5-20_S
15 HS INT 15_5-08_H 15_5-11_H 15_5-14_H 15_5-17_S 15_5-20_S
19 HS INT 19_5-11_H 19_5-14_H 19_5-17_H 19_5-20_S
20 HS INT 20_5-08_H 20_5-11_H 20_5-14_H 20_5-17_H 20_5-20_S
03 SS INT 03_5-08_S 03_5-11_S
[tsoleary@pbio381 mydata]$ grep 'HH' ssw_samples.txt
10 HH INT 10_5-08_H 10_5-11_H 10_5-14_H 10_5-17_H 10_5-20_H
24 HH INT 24_5-08_H 24_5-11_H 24_5-14_H 24_5-17_H 24_5-20_H
27 HH INT 27_5-08_H 27_5-11_H 27_5-14_H 27_5-17_H 27_5-20_H
31 HH SUB 31_6-12_H 31_6-15_H 31_6-18_H 31_6-21_H 31_6-24_H
32 HH SUB 32_6-12_H 32_6-15_H 32_6-18_H 32_6-21_H
33 HH SUB 33_6-12_H 33_6-15_H 33_6-18_H 33_6-21_H 33_6-24_H
34 HH SUB 34_6-12_H 34_6-15_H 34_6-18_H 34_6-21_H 34_6-24_H
35 HH SUB 35_6-12_H 35_6-15_H 35_6-18_H 35_6-21_H
[tsoleary@pbio381 mydata]$
[tsoleary@pbio381 mydata]$ grep 'HH' ssw_samples.txt >ssw_HHonly.txt
[tsoleary@pbio381 mydata]$ ll
total 8
-rw-r--r--. 1 tsoleary users 462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$
[tsoleary@pbio381 mydata]$ grep 'SS' ssw_samples.txt >ssw_SSonly.txt
[tsoleary@pbio381 mydata]$ ll
total 12
-rw-r--r--. 1 tsoleary users 462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
-rw-r--r--. 1 tsoleary users 342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 mydata]$
grep
is a useful search tool and has many additional features for sorting and output of the results. These kinds of search algorithms are called “regular expressions”, or “regexp”, and are one of the most powerful tools for wokring with large text files. If you want to learn more about grep
and its regexp capabilities, you can look at the man
page or manual. In fact, every UNIX command-line program has a built-in man
page that you can call up to help you. Just type man
and then the program name and it will give you the manual (small excerpt shown below).[tsoleary@pbio381 mydata]$ man grep
GREP(1) General Commands Manual GREP(1)
NAME
grep, egrep, fgrep - print lines matching a pattern
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
DESCRIPTION
grep searches the named input FILEs (or standard input if no files are named, or if a
single hyphen-minus (-) is given as file name) for lines containing a match to the
given PATTERN. By default, grep prints the matching lines.
In addition, two variant programs egrep and fgrep are available. egrep is the same
as grep -E. fgrep is the same as grep -F. Direct invocation as either egrep or
fgrep is deprecated, but is provided to allow historical applications that rely on
them to run unmodified.
OPTIONS
Generic Program Information
--help Print a usage message briefly summarizing these command-line options and the
bug-reporting address, then exit.
-V, --version
Print the version number of grep to the standard output stream. This version
number should be included in all bug reports (see below).
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is
specified by POSIX.)
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN as a list of fixed strings, separated by newlines, any of
which is to be matched. (-F is specified by POSIX, --fixed-regexp is an
obsoleted alias, please do not use it in new scripts.)
-G, --basic-regexp
Interpret PATTERN as a basic regular expression (BRE, see below). This is the
default.
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental
and grep -P may warn of unimplemented features.
grep
to do the search, and pipe the results to the command wc
which will tally up the number of lines, words, and characters in the file…voila![tsoleary@pbio381 mydata]$ grep 'INT' ssw_samples.txt | wc
16 106 762
[tsoleary@pbio381 mydata]$
Now, what if we want to move the files we created with just individuals of a particular disease status. There’s a way to do this quickly using the wildcard character *
. With the wildcard, the *\*
takes the place of any character, and in fact any length of characters.
samples_by_disease/
inside the mydata/
folderMove all files that contain the word “only” into the new directory using the mv
command.
[tsoleary@pbio381 mydata]$ mkdir sample_by_disease/
[tsoleary@pbio381 mydata]$ ll
total 12
drwxr-xr-x. 2 tsoleary users 10 Jan 31 21:12 sample_by_disease
-rw-r--r--. 1 tsoleary users 462 Jan 31 20:46 ssw_HHonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
-rw-r--r--. 1 tsoleary users 342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 mydata]$ mv *only* sample_by_disease/
[tsoleary@pbio381 mydata]$ ll
total 4
drwxr-xr-x. 2 tsoleary users 60 Jan 31 21:12 sample_by_disease
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ cd sample_by_disease/
[tsoleary@pbio381 sample_by_disease]$ ll
total 8
-rw-r--r--. 1 tsoleary users 462 Jan 31 20:46 ssw_HHonly.txt
-rw-r--r--. 1 tsoleary users 342 Jan 31 20:48 ssw_SSonly.txt
[tsoleary@pbio381 sample_by_disease]$
rm
command. However, in its default mode, UNIX will not ask if you really mean it before getting rid of it forever(!), so this can be dangerous if you’re not paying attention.grep
command to pull out he seastar samples that started healthy and then became sick. But perhaps we later decide we’re not going to work with those samples, so we use rm
to delete that file:[tsoleary@pbio381 mydata]$ ll
total 8
drwxr-xr-x. 2 tsoleary users 60 Jan 31 21:12 sample_by_disease
-rw-r--r--. 1 tsoleary users 282 Feb 1 05:35 ssw_HSonly.txt
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$ rm ssw_HSonly.txt
[tsoleary@pbio381 mydata]$ ll
total 4
drwxr-xr-x. 2 tsoleary users 60 Jan 31 21:12 sample_by_disease
-rwxrwxr-x. 1 tsoleary users 1255 Jan 31 17:42 ssw_samples.txt
[tsoleary@pbio381 mydata]$
cd
to your home directory (~/)ll -a
.rm
confirms deletion with us. To edit text files on the fly in UNIX, you can use the built-in text editor, “vim”: vim .bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
Use your arrow key to move your cursor down to the last line, below "“# User specific aliases and functions” — this is where we’re going to insert our new function.
By defauly, vim is in read-only mode when it opens files. To go into edit mode, press your “i” key (for “insert”). You are now able to make changes to the file.
Add the following text on a new line directly below the “# User specific…” line:
alias rm='rm -i'
Your file should now look like this:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
alias rm='rm -i'
You’re now ready to get out of edit mode (hit the escape key
), save your changes (type :w
), and exit vim (type :q
).
These changes won’t take effect until you log out (type exit
to log out of the server). But from now on, every time you log in, the server will remember that you want a reminder before deleting any of your work.
ssh netid@pbio381.uvm.edu
pwd
ll
, cd
mkdir foldername
[tsoleary@pbio381 ~]$ cd /data/
[tsoleary@pbio381 data]$ ll
total 8
drwxr-xr-x. 5 root root 73 Jan 31 17:35 archive
drwxrwxr-x. 2 root pb381adm 40 Nov 30 2015 packages
drwxrwxr-x. 33 root pb381adm 4096 Nov 30 2015 popgen
drwxrwxr-x. 3 root pb381adm 42 Jan 30 09:08 project_data
drwxrwxr-x. 2 root pb381adm 6 Oct 2 2015 scripts
drwxr-xr-x. 18 root root 4096 Sep 2 2015 users
[tsoleary@pbio381 data]$
cp filename destinationpath/
or mv filename destinationpath/
head filename
, tail filename
grep 'search string' filename
grep 'search string' filename >outputfilename
mv *.txt ~/newfolder
grep 'INT' filename | wc
rm
vim filename
Handy UNIX cheat sheet for helping to remember some of these commonly used commands (and others)
Here’s another useful UNIX cheatsheet
Red spruce is a coniferous tree that plays a prominent role in montane communities throughout the Appalachians.
Thrives in the cool, moist climates of the high elevation mountains of the Apppalachians and northward along the coastal areas of Atlantic Canada.
One region where populations are particular vulnerable to climate change is in the low-latitude trailing edge of the range, from Maryland to Tennessee, where - populations are highly fragmented and isolated on mountaintops. - These “island” populations are remnants of spruce forests that covered the southern U.S. glaciers extended as far south as Long Island, NY. - As the climate warmed at the end of the Pleistocene (~20K years ago), red spruce retreated upward in elevation to these mountaintop refugia, where they are now highlty isolated from other such stands and from the core of the range further north.
With funding from the National Science Foundation, the Keller Lab is studying the genetic basis of climate adaptation across the distribution of P. rubens.
Our main goals are to:
characterize the genetic diversity and population structure across the range
identify regions of the genome that show evidence of positive selection in response to climate gradients
map the genetic basis of climate adaptive phenotypes
We hope to use this information to inform areas of the range most likely to experience climate maladaptation, and to help guide mitigation strategies.
In 2017, we collected seeds and needle tissue from 340 mother trees at 65 populations spread throughout the range. - Extracted whole genomic DNA from needles to use for exome capture sequencing. - Sample size in the edge region = 110 mother trees from 23 populations. - Exome capture was designed based on transcriptomes from multiple tissues and developmental stages in the related species, white spruce (P. glauca). Bait design used 2 transcriptomes previously assembled by Rigault et al. (2011) and Yeaman et al. (2014). - A total of 80,000 120bp probes were designed, including 75,732 probes within or overlapping exomic regions, and an additional 4,268 probes in intergenic regions. - Each probe was required to represent a single blast hit to the P. glauca reference genome of at least 90bp long and 85% identity, covering 38,570 unigenes. - Libraries were made by random mechanical shearing of DNA (250 ng -1ug) to an average size of 400 bp followed by end-repair reaction, ligation of an adenine residue to the 3’-end of the blunt-end fragments to allow the ligation of barcoded adapters, and PCR-amplification of the library. SureSelect probes (Agilent Technologies: Santa Clara, CA) were used for solution-based targeted enrichment of pools of 16 libraries, following the SureSelectxt Target Enrichment System for Illumina Paired-End Multiplexed Sequencing Library protocol. - Libraries were sequenced on a single run of a Illumina HiSeq X to generate paired-end 150-bp reads.
Whenever you get a new batch of NGS data, the first step is to look at the data quality of coming off the sequencer and see if we notice any problems with base quality, sequence length, PCR duplicates, or adapter contamination.
.fastq
filesA fastq file is the standard sequence data format for NGS. It contains the sequence of the read itself, the corresponding quality scores for each base, and some meta-data about the read.
The files are big (typically many Gb compressed), so we can’t open them completely. Instead, we can peek inside the file using head. But size these files are compressed (note the .gz
ending in the filenames), and we want them to stay compressed while we peek. Bash has a solution to that called zcat
. This lets us look at the .gz
file without uncompressing it all the way.
The fastq files are in this path: /data/project_data/RS_ExomeSeq/fastq/edge_fastq
cd /data/project_data/RS_ExomeSeq/fastq/edge_fastq
zcat AB_05_R1_fastq.gz | head -n 4
@GWNJ-0842:368:GW1809211440:2:1101:17168:1907 1:N:0:NGAAGAGA+NTTCGCCT
GATGGGATTAGAGCCCCTGAAGGCTGATAGAACTTGAGTTTCACAGGCTCATTGCATTGAAGTGGCATTTGTGTGAATGCAGAGGAGGTACATAGGTCCTCGAGAATAAAAGAGATGTTGCTCCTCACCAAAATCAGTACAGATTATTTT
+
A<A-F<AFJFJFJA7FJJJJFFJJJJJJ<AJ-FJJ7-A-FJAJJ-JJJA7A7AFJ<FF--<FF7-AJJFJFJA-<A-FAJ<AJJ<JJF--<A-7F-777-FA77---7AJ-JF-FJF-A--AJF-7FJFF77F-A--7<-F--77<JFF<
Note: zcat
lets us open a .gz
(gzipped) file; we then “pipe” | this output from zcat to the head command and print just the top 4 lines -n4
.
The .fastq
file format** has 4 lines for each read:
Line Description 1. Always begins with ‘@’ and then information about the read 2. The actual DNA sequence 3. Always begins with a ‘+’ and sometimes the same info in line 1 4. A string of characters which represent the quality scores; always has same number of characters as line 2
Here’s a useful reference for understanding Quality (Phred) scores. If P is the probability that a base call is an error, then:
\[P = 10^(–Q/10)\]
\[ Q = –10 log10(P)\]
We’re going to use the program FastQC (already installed on our server). FastQC looks at the quality collectively across all reads in a sample.
First, let’s make a new dir within myresults to hold the outputs
mkdir ~/<myrepo>/myresults/fastqc
Then, the basic FastQC command is like this:
fastqc FILENAME.fastq.gz -o outputdirectory/
This will generate an .html output file for each input file you’ve run.
But, we want to be clever and process multiple files (i.e., ALL files from our population) without having to manually submit each one. We can do this by writing a bash script that contains a loop.
The basic syntax of a bash loops is like this:
for file in myfiles
do
command 1 -options ${file}
command 2 -options ${file}
done
Note the use of variable assignment using ${}
. We define the word file in the for loop as the variable of interest, and then call the iterations of it using ${file}
. For example, we could use the wildcard character (*) in a loop to call all files that include the population code “AB” and then pass those filenames in a loop to fastqc. Something like:
for file in AB*fastq.gz
do
fastqc ${file} -o ~/<myrepo>/myresults/fastqc
done
Let’s write the above into a script using the Vim text editor at the command line. Type vim to get into the editor, then type “i” to enter INSERT mode. You can then type your script (remember to make necessary changes to the population code and output path). Lastly, to save the file and quit Vim, hit the ESCAPE key to get out of INSERT mode, followed by :wq fastqc.sh
Back at the command line, you should be able to ll and see your new script!
You may find that you need to change the permission on your script to make it executable. Do this using chmod u+x, which changes the permissions to give the user (you) permission to execute (x). Then give it a run!
chmod u+x fastqc.sh # makes the script "executable" by the "user"
./fastqc.sh # executes the script
It’ll take just a couple of minutes per fastq file. Once you’ve got results, let’s look at them by pushing your html files up to your Github. Remember,
git pull
git add --all .
git commit -m "note"
git push
Once synced to Github, use Github desktop to pull them down to your laptop where you can open them with a browser.
How does the quality look?
We’ll use the Trimmomatic program to clean the reads for each file. The program is already installed on our server.
We’ve provided an example script in the /data/scripts/
directory this time because the program is a java based program and thus a bit more particular in its call.
Copy the bash script over to your ~/myrepo/myscripts
directory Open and edit the bash script using the program vim. Edit the file so that you’re trimming the fastq files for the population assigned to you Change the permissions on your script to make it executable, then run it! (examples below)
cp /data/scripts/trim_loop.sh ~/myrepo/myscripts/ # copies the script to your home scripts dir
vim trim_loop.sh # open the script with vim to edit
This time we use the variable coding to call the name of the R1 read pair, define the name for the second read in the pair (R2), and create a basename that only contains the “pop_ind” part of the name, i.e. AB_05
R2=${R1/_R1_fastq.gz/_R2_fastq.gz} # defines the name for the second read in the pair (R2) based on knowing the R1 name (the file names are identifcal except for the R1 vs. R2 designation)
f=${R1/_R1_fastq.gz/} # creates a new variable from R1 that has the "_R1_fastq.gz" stripped off
name=`basename ${f}` # calls the handy "basename" function to define a new variable containing only the very last part of the filename while stripping off all the path information. This gets us the "AB_05" bit we want.
Here’s how it should look (replace AB with your population name):
trim.sh
#!/bin/bash
cd /data/project_data/RS_ExomeSeq/fastq/edge_fastq
for R1 in AB*R1_fastq.gz
do
R2=${R1/_R1_fastq.gz/_R2_fastq.gz}
f=${R1/_R1_fastq.gz/}
name=`basename ${f}`
java -classpath /data/popgen/Trimmomatic-0.33/trimmomatic-0.33.jar org.usadellab.trimmomatic.TrimmomaticPE \
-threads 1 \
-phred33 \
"$R1" \
"$R2" \
/data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${name}_R1.cl.pd.fq \
/data/project_data/RS_ExomeSeq/fastq/edge_fastq/unpairedcleanreads/${name}_R1.cl.un.fq \
/data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${name}_R2.cl.pd.fq \
/data/project_data/RS_ExomeSeq/fastq/edge_fastq/unpairedcleanreads/${name}_R2.cl.un.fq \
ILLUMINACLIP:/data/popgen/Trimmomatic-0.33/adapters/TruSeq3-PE.fa:2:30:10 \
LEADING:20 \
TRAILING:20 \
SLIDINGWINDOW:6:20 \
MINLEN:35
done
Trimmomatic performs the cleaning steps in the order they are presented. It’s recommended to clip adapter early in the process and clean for length at the end.
The steps and options are from the Trimmomatic website:
ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. MINLEN: Drop the read if it is below a specified length
Now that we have cleaned and trimmed read pairs, we’re ready to map them against the reference genome.
We’ll be using a reduced reference genome based on selecting only those scaffolds of the full genome reference that contain at least one bait. We’ve placed it on our server here: /data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa
The reference genome is based on Norway spruce (P. abies) and is available from congenie.org.
We’ll use the program bwa, which is a very efficient and very well vetted read mapper. Lots of others exist and can be useful to explore for future datasets. We tried several, and for our exome data, bwa seems to be the best
We are going to write a bash script together that calls the R1 and R2 reads for each individual in our population, and uses the bwa-mem algorithm to map reads to the reference genome. The resulting output will be a .sam
file alignment. The basic bwa
command we’ll use is:
bwa mem -t 1 -M -a ${ref} ${forward} ${reverse} > ${output}/BWA/${name}.sam
# where:
-t # the number of threads, or computer cpus to use (in this case, just 1)
-M # labels a read with a special flag if its mapping is split across >1 contig
-a # keeps alignments involving unpaired reads
${ref} # specifies the path and filename for the reference genome
${forward} # specifies the path and filename for the cleaned and trimmed R1 reads
${reverse} # specifies the path and filename for the cleaned and trimmed R2 reads
>${output}/BWA/${name}.sam # directs the .sam file to be saved into a directory called BWA
# Other bwa options detailed here: bwa manual page
mapping.sh
#! /bin/bash
# this script is to run the read mapping using the bwa program
# -t 1 is that we are using only one thread
# -M is a special flag if its mapping is split across more than one contig
# this just allows it to map across contigs
# ${ref} variable ref that we will define as the reference genome
# then it will redirect the output file .sam to the output dir we made
# > redirects the output to save it to a file, otherwise bash would just vomit
# the words on the screen
ref="/data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa"
# a loop to map each individual within my population
# ${input} is the path to the
for forward in ${input}*_R1.cl.pd.fq
do
# define the reverse read based on that forward read
# then define the forward read as the whole text
# and replace it with nothing so that you only have the base part of the name
# having the slash inside the {} means basically find and replace... so the
# first part of the slash is interpreted as the find and the second part is
# what to replace it with
reverse=${forward/_R1.cl.pd.fq/_R2.cl.pd.fq}
f=${forward/_R1.cl.pd.fq/}
# basename is a function in base that creates the basename that will be used
# for naming the rest of the files
name=`basename ${f}`
bwa mem -t 1 -M ${ref} ${forward} ${reverse} > ${output}/BWA/${name}.sam
done
samtools
and sambamba
Our last steps (with basic syntax) are to:
Convert our .sam
files to the more efficient binary version .bam
and sort them sambamba-0.7.1-linux-static view -S --format=bam file.sam -o file.bam
Get rid of any PCR duplicate sequences (why are these a problem?) and re-sort after removing duplicates.
sambamba-0.7.1-linux-static markdup -r -t 1 file.bam file.rmdup.bam
samtools sort file.rmdup.bam -o file.sorted.rmdup.bam
samtools flagstat file.sorted.rmdup.bam | awk 'NR>=5&&NR<=13 {print $1}' | column -x
samtools depth file.sorted.rmdup.bam | awk '{sum+=$3} END {print sum/NR}
For this, we’ll use a combination of two new programs: samtools and sambamba. Samtools was writtend by Heng Li, the same person who wrote bwa, and is a powerful tool for manipulating sam/bam files. Sambamba is derived from samtools, and has been re-coded to increase efficiency (speed). We’ll use them both at different steps.
I’ve put a bash script with the commands to run steps 1-3 above. It’s located here:
/data/scripts/process_bam.sh
process_bam.sh
#! /bin/bash
# this is where out output sam files are going to get converted into binary
# format (bam)
# then we are going to sort the bam files, remove PCR duplicates, and index them
# first convert sam to bam and then sort
for f in ${output}/BWA/${mypop}*.sam
do
out=${f/.sam/}
# this converts it to the bam
sambamba-0.7.1-linux-static view -S --format=bam ${f} -o ${out}.bam
# now we need to sort the bam files so it can quickly find the function
samtools sort ${out}.bam -o ${out}.sorted.bam
done
# next remove the pcr duplicates from our bam files
for file in ${output}/BWA/${mypop}*.sorted.bam
do
f=${file/.sorted.bam/}
# this is going to mark duplicates if they have the exact same coordinates, because they are randomly fragmented so if they star
# -r removes all dups except for one
# no dash o -o needed for the output
# just calling the command on the server without any of the options gives you the help information
sambamba-0.7.1-linux-static markdup -r -t 1 ${file} ${f}.sorted.rmdup.bam
done
# finish by indexing the files
for file in ${output}/BWA/${mypop}*sorted.rmdup.bam
do
samtools index ${file}
done
Last step, we’re going to put these scripts altogether into a “wrapper” that will exectue each of them one after the other, and work for us while we’re off getting a coffee or sleeping. :) I’ll show you how to code this together in class.
mypipeline.sh
#! /bin/bash
# we will use this as a wrappeer to run our different scripts
# path to my repo on the server
myrepo="/users/t/s/tsoleary/eco_genomics"
# define the population
mypop="BFA"
# directory to our cleaned and paired paired
input="/data/project_data/RS_ExomeSeq/fastq/edge_fastq/pairedcleanreads/${mypop}"
# define an output directory in the common space not your own repo
output="/data/project_data/RS_ExomeSeq/mapping"
# call the mapping.sh and process_bam.sh files, you could also add the stats but for the sake of time we will do that later
source ./mapping.sh
source ./process_bam.sh
Once your wrapper script is ready, you’re going to want to start a screen. The screen command initiates a new shell window that won’t interupt or stop your work if you close your computer, log off the server, or leave the UVM network. Anytime you’re running long jobs, you definiteily want to use screen.
Using it is easy. Just type screen followed by . It will take you to a new empyt terminal. You can then start your wrapper bash script and see that it starts running. Once everything looks good, you have to detach from the screen by typing Ctrl-A + Ctrl-D
. If you don’t do this, you’ll lose your work!
When you’re ready to check back on the progress of your program, you can recover your screen by typing screen -r
. That’ll re-attach you back to your program!
/data/project_data/RS_ExomeSeq/mapping/BWA/
tail -n 100 FILENAME.sam
A SAM file is a tab delimited text file that stores information about the alignment of reads in a FASTQ file to a reference genome or transcriptome. For each read in a FASTQ file, there’s a line in the SAM file that includes
The left (R1) and right (R2) reads alternate through the file. SAM files usually have a header section with general information where each line starts with the ‘@’ symbol. SAM and BAM files contain the same information; SAM is human readable and BAM is in binary code and therefore has a smaller file size.
Find the official Sequence AlignMent file documentation can be found [here](https://en.wikipedia.org/wiki/SAM_(file_format) or more officially.
We can use the program samtools Written by Heng Li, the same person who wrote bwa. It is a powerful tool for manipulating sam/bam files.
The command flagstat
gets us some basic info on how well the mapping worked:
samtools flagstat FILENAME.sam
bam_stats.sh.
#! /bin/bash
# set repo
myrepo="/users/t/s/tsoleary/eco_genomics"
mypop="BFA"
output="/data/project_data/RS_ExomeSeq/mapping"
# this echo is creating a header for the file and saving it on the flagstats.txt files
echo "Num.reads R1 R2 Paired MateMapped Singletons MateMappedDiffChr" > ${myrepo}/myresults/${mypop}.flagstats.txt
for file in ${output}/BWA/${mypop}*sorted.rmdup.bam
do
f=${file/.sorted.rmdup.bam/}
name=`basename ${f}`
# the double >> makes it so that it appends it and doesn't just rewrite the file every time
echo ${name} >> ${myrepo}/myresults/${mypop}.names.txt
# NR is the number of rows, row 6 through 12 have the meaningful results in them
samtools flagstat ${file} | awk 'NR>=6&&NR<=12 {print $1}' | column -x
done >> ${myrepo}/myresults/${mypop}.flagstats.txt
# calculate the depth of coverage from our bam files
for file in ${output}/BWA/${mypop}*sorted.rmdup.bam
do
samtools depth ${file} | awk '{sum+=$3} END {print sum/NR}'
done >> ${myrepo}/myresults/${mypop}.coverage.txt
awk
tool to help format the output.We’ll use the samtools flagstat
command to get read counts after mapping, and the depth command to get perhaps the most important statistic in read mapping: the depth of coverage, or how many reads cover each mapped position, on average.
Put the following into a loop for each individual for which you’ve generated a .sorted.rmdup.bam file:
samtools flagstat file.sorted.rmdup.bam | awk 'NR>=6&&NR<=13 {print $1}' | column -x
>> ${myrepo}/myresults/${mypop}.flagstats.txt
samtools depth file.sorted.rmdup.bam | awk '{sum+=$3} END {print sum/NR}
>> ${myrepo}/myresults/${mypop}.coverage.txt
While that’s running, we can take a look at one of our alignment files (sam or bam) using an integrated viewed in samtools called tview. To use it, simply call the program and command, followed by the sam/bam file you want to view and the path to the reference genome. For example:
samtools tview /data/project_data/RS_ExomeSeq/mapping/BWA/AB_05.sorted.rmdup.bam /data/project_data/RS_ExomeSeq/ReferenceGenomes/Pabies1.0-genome_reduced.fa
Inference of population genetics from the aligned sequence data. Should we call genotypes?
Many of the papers you’ll read that do popgen on NGS data have a SNP calling step that results in a specific gneotype being called for each SNP site for each individual. For example,
SNP | Ind1 | Ind 2 |
---|---|---|
1 | CC | CT |
2 | AG | AA |
3 | GT | TT |
But how do we know that Ind1 is homoozygous at SNP-1 (CC) – couldn’t it be CT and we just didn’t have enough coverage to observe the second allele?
The basic problem is that read data are counts that produce a binomial (actually multinomial) distribution of allele calls at a given site, and if you have few reads, you might by chance not observe the true genotype. So, what’s the right thing to do?
As with almost anything in statistics, the right thing to do is not throw away that uncertainty, but instead incorporate it into your analysis. That’s what we’re going to do…
Genotype-free population genetics using genotype likelihoods A growing movement in popgen analysis of NGS data is embracing the use of genotype likelihoods to calculate stats based on each individual having a likelihood (probability) of being each genotype.
A genotype likelihood (GL) is essentially the probability of observing the sequencing data (reads containing a particular base), given the genotype of the individual at that site.
These probabilities are modeled explicitly in the calculation of population diversty stats like pi, Tajima’s D, Fst, PCA, etc…; thus not throwing out any precious data, but also making fewer assumptions about the true (unknown) genotype at each locus
We’re going to use this approach with the program ‘ANGSD’, which stands for ‘Analysis of Next Generation Sequence Data’
This approach was pioneered by Rasmus Nielsen, published originally in Korneliussen et al. 2014.
ANGSD has a user’s manual (it’s a work in progress…)
The basic work flow of ANGSD looks like this:
Create a list of bam files for the samples you want to analyze Estimate genotype likelihoods (GL’s) and allele frequencies after filtering to minimize noise Use GL’s to: - estimate the site frequency spectrum (SFS) - estimate nucleotide diversities (Watterson’s theta, pi, Tajima’s D, …) - estimate Fst between all populations, or pairwise between sets of populations - perform a genetic PCA based on estimation of the genetic covariance matrix (this is done on the entire set of Edge ind’s)
myrepo="/users/s/r/srkeller/Ecological_Genomics/Spring_2020"
mkdir ${myrepo}/myresults/ANGSD
output="${myrepo}/myresults/ANGSD"
mypop="AB"
ls /data/project_data/RS_ExomeSeq/mapping/BWA/${mypop}_*sorted.rm*.bam >${output}/${mypop}_bam.list
Check your output bamlist to see it was made properly.
Estimating GL’s and allele frequencies for all sites with ANGSD
ANGSD -b ${output}/${mypop}_bam.list \
-ref ${REF} -anc ${REF} \
-out ${output}/${mypop}_allsites \
-nThreads 1 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-doSaf 1 \
-doHWE 1 \
# -SNP_pval 1e-6
What do all these options mean?
-nThreads 1 # how many cpus to use – be conservative
-remove_bads 1 # remove reads flagged as ‘bad’ by samtools
-C 50 # enforce downgrading of map quality if contains excessive mismatches
-baq 1 # estimates base alignment qualities for bases around indels
-minMapQ 20 # threshold for minimum read mapping quality (Phred)
-minQ 20 # threshold for minimum base quality (Phred)
-setMinDepth 3 # min read depth across ALL individual to keep a site
-minInd 2 # min number of individuals to keep a site
-setMinDepthInd 1 # min read depth for an individual to keep a site
-setMaxDepthInd 17 # max read depth for an individual to keep a site
-skipTriallelic 1 # don’t use sites with >2 alleles
-GL 1 # estimate GL’s using the Samtools formula
-doCounts 1 # output allele counts for each site
-doMajorMinor 1 # fix major and minor alleles the same across all samples
-doMaf 1 # calculate minor allele frequency
-doSaf 1 # output allele frequencies for each site
-doHWE 1 # calculate obs. v. exp. heterozygosity
-SNP_pval 1e-6 # Keep only site highly likely to be polymorphic (SNPs)
NOTES
If you want to restrict the estimation of the genotype likelihoods to a particular set of sites you’re interested in, add the option -sites ./selected_sites.txt (tab delimited file with the position of the site in column 1 and the chromosome in column 2) or use -rf ./selected_chromosome.chrs (if listing just the unique “chromosomes” or contigs you want to anlayze) Some popgen stats you want to estimate only the polymorphic sites; for this you should include the -SNP_pval 1e-6 option to elininate monomorphic sites when calculating your GL’s There are good reasons to do it BOTH ways, with and without the -SNP_pval 1e-6 option. Keeping the monomorphic sites in is essential for getting proper estimates of nucleotide diversity and Tajima’s D. But other analyses such as PCA or GWAS want only the SNPs. 3a. Estimate the SFS for your pop
realSFS {$output}/${mypop}_allsites.saf.idx -maxIter 1000 -tole 1e-6 -P 1 > ${output}/${mypop}_allsites.sfs
3b. Once you have the SFS, you can estimate the theta diversity stats:
Estimating thetas for all sites, using the SFS from above as a prior to estimate the GL’s
ANGSD -b ${output}/${mypop}_bam.list \
-ref ${REF} -anc ${REF} \
-out ${output}/${mypop} \
-nThreads 1 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-setMinDepth 3 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-pest ${output}/${mypop}_allsites.sfs \
-doSaf 1 \
-doThetas 1
thetaStat do_stat ${output}/${mypop}_allsites.thetas.idx
The first column of the results file (${mypop}.thetas.idx.pestPG) is formatted a bit funny and we don’t really need it. We can use the cut command to get rid of it:
cut -f2- ${mypop}.thetas.idx.pestPG > ${mypop}.thetas This is now ready to bring into R to look at the mean and variability in nucleotide diversity for our pop. How does it compare to others?
3c. We can calculate Fst between any pair of populations by comparing their SFS to each other. For this, we’ll need to estimate the SFS for pairs of populations; we can each contribute to the overall analysis by looking at how our focal pop is divergent from the others in the edge region We first want to get a list of the unique population codes that doesn’t include our focal population (it doesn’t make sense to calculate an Fst of a population with itself…) cat /data/project_data/RS_ExomeSeq/metadata/RS_Exome_metadata.txt | grep -w “E” | cut -f1 | uniq | grep -v \({mypop} > ~/yourrepo/mydata/edgepops_no\){mypop}.txt
Next set up a loop to calculate Fst your population and each other pop (coded in the loop as variable “POP2”:
for POP2 in `cat ${myrepo}/mydata/edgepops_no${mypop}.txt`
do
realSFS ${output}/${mypop}_allsites.saf.idx ${output}/${POP2}_allsites.saf.idx -P 1 > ${output}/${mypop}_${POP2}_allsites.sfs
realSFS fst index ${output}/${mypop}_allsites.saf.idx ${output}/${POP2}_allsites.saf.idx -sfs ${output}/${mypop}_${POP2}_allsites.sfs -fstout ${output}/${mypop}_${POP2}_allsites -whichFst 1
realSFS fst print ${output}/${mypop}_${POP2}_allsites.fst.idx > ${output}/${mypop}_${POP2}_allsites.fst
done
3d. If we have time (and the previous analysis I started yesterday finally finishes running!), we can look at the ancestry relatioships among all the edge individuals using a genetic Principal Component Analysis. ANGSD has a routine for that too – PCAngsd. First, we estimate the GL’s for the entire sample, keeping just the polymorphic sites, and outputing the GL’s in “Beagle” format (file.beagle.gz)
Make a list of all the sorted and PCR dup removed bam files:
ls /data/project_data/RS_ExomeSeq/mapping/BWA/sorted.rmbam >${myrepo}/myresults/EDGE_bam.list
ANGSD -b ${myrepo}/myresults/EDGE_bam.list \
-ref ${REF} -anc ${REF} -out ${myrepo}/myresults/EDGE_poly \
-nThreads 10 \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-doGlf 2 \
-SNP_pval 1e-6
The resulting GL file EDGE_poly.beagle.gz can be used as input to PCAngsd to estimate the covariance matrix:
python /data/popgen/pcangsd/pcangsd.py -beagle ${myrepo}/myresults/EDGE_poly.beagle.gz -o ${myrepo}/myresults/EDGE_poly_covmatrix -threads 1
Once you have that, you’ll want to send all your outputs up to Github and then back down to your local machine. We can use R to find the eigenvectors of the genetic covariance matrix and plot it:
setwd("path to your repo")
covmat <- read.table("myresults/myresults/EDGE_poly_covmatrix")
PCA <- eigen(covmat)
plot(PCA[,1],PCA[,2])
If we know the anestral state, then the best info is contained in the unfolded spectrum, which shows the frequency histogram of how many derived loci are rare vs. common bins in the unfolded spectra go from 0 to 2N – why? When you don’t know the ancestral state confidently, you can make the SFS based on the minor allele (the less frequent allele; always < 0.5 in the population) bins in the folded spectra go from 0 to 1N – why? Essentially, the folded spectra wraps the SFS around such that high frequency “derived” alleles are put in the small bins (low minor allele freq).
In your myscripts folder, let’s revise the script ANGSD_mypop.sh to work on the folded SFS
The first part will be identical as last time, except: 1. Let’s change the -out
name to -out ${output}/${mypop}_outFold
2. Take out the HWE test (not necessary to run again) and replace it with fold 1
For info on using the folded spectrum, see the ANGSD manual page for Theta stats under “folded”.
The above command generated the site allele frequencies, which we can now use to estimate the folded SFS for your pop realSFS \({output}/\){mypop}_outFold.saf.idx -maxIter 1000 -tole 1e-6 -P 1 > \({output}/\){mypop}_outFold.sfs Once you have the SFS, you can estimate the theta diversity stats: We do this by running ANGSD again, as above, but this time we include the -pest flag which uses the SFS we estimated previously as a prior on the allele frequency estimation. We also include the doThetas flag, and of course the fold 1 to tell ANGSD we want the diversities based on the folded SFS
So, we’re copying our previous set of ANGSD commands to make a new entry and adding:
-pest ${output}/${mypop}_outFold.sfs \
-doSaf 1 \
-doThetas 1 \
-fold 1
thetaStat do_stat ${output}/${mypop}_outFold.thetas.idx
The first column of the results file (${mypop}.thetas.idx.pestPG
) is formatted a bit funny and we don’t really need it. We can use the cut command to get rid of it if we want to, or just ignore it.
cut -f2- ${mypop}.thetas.idx.pestPG > ${mypop}.thetas
This is now ready to bring into R (transfer via git) to look at the mean per-site nucleotide diversity for your focal population. How does it compare to other populations? Also bring in your SFS for plotting, and to calculate the SNP frequency in your population.
Share you population specific stats with the group
Share your population stats on by pasting your values into this googledoc.
The last type of analysis we’ll look at is GWAS. Recall that the purpose of GWAS is to ‘scan’ the genome SNP by SNP and test for an association between the genotypes at that locus (e.g., AA, AC, CC) and some response variable of interest. Often this is a phenotypic trait that has been carefully measured so as to reduce the amount of environmental (non-genetic) contribution.
We’ll use GWAS to map seedling height growth data in common gardens that my group planted in April-May of 2019 at three sites: Burlington VT, Frostburg MD, and Bent Creek, NC
Each family is represented by 5 seedlings per site, randomized within each block
Let’s start with a GWAS that maps phenotype~genotype associations in the MD garden, since that is close to the “local” climate environment for many of these edge populations. I’ve generated best linear unbiased predictors (BLUPs) for each family at the MD site based on an initial ANOVA that removed effects of block and initial height at planting. The values are here:
/data/project_data/RS_ExomeSeq/ANGSD/blups/Ht_EDGE_MD.blups
The ANGSD command we’ll use is called doAsso and there’s some good guidance on the manual page for the different implementations of this amnalysis. We’ll use the hybrid approach doAsso 5 which tests every SNP for association, but then follows up with only the significant SNPs to estimate the effect size (slope of the phenotype~genotype relationship)
The statistical approach is a generalized linear model of the form:
y ~ u + beta*SNP + covariates + error
where,
[srkeller@pbio381 PCA]$ pwd
/data/project_data/RS_ExomeSeq/ANGSD/PCA
[srkeller@pbio381 PCA]$ ll EDGE_PC*
-rw-r--r--. 1 srkeller users 4342 Feb 18 21:01 EDGE_PC12.txt
-rw-r--r--. 1 srkeller users 2193 Feb 18 18:10 EDGE_PC1.txt
-rw-r--r--. 1 srkeller users 2149 Feb 18 18:10 EDGE_PC2.txt
error = the error variance
ANGSD uses genotype probabilities as input and tests each SNP for association with a Likelihood Ratio Test (LRT) which tests the likelihood of there being an association / likelihood of NO association. The LRT is ~ chisq distributed with 1 d.f.
The basic set up is:
blups="/data/project_data/RS_ExomeSeq/ANGSD/blups"
OUT="/data/project_data/RS_ExomeSeq/ANGSD"
To speed things up in class a bit, we’re going to break the total number of contigs into 20 equal size chunks of 1280. That way, each student runs their GWAS on a set of 1280 contigs, and then we pool our result together at the end. No need to do this if you’re working on your own in the future…this is just a convenience for class time.
contigs="/data/project_data/RS_ExomeSeq/ANGSD/contig_splits"
mycontigs="xaa"
And then the main script:
ANGSD -b ${OUT}/EDGE_bam.list \
-ref ${REF}
-out ${OUT}/GWAS/Ht_EDGE_MD_PC12_${mycontigs} \
-remove_bads 1 \
-C 50 \
-baq 1 \
-minMapQ 20 \
-minQ 20 \
-setMinDepth 3 \
-minInd 2 \
-setMinDepthInd 1 \
-setMaxDepthInd 17 \
-skipTriallelic 1 \
-GL 1 \
-doPost 1 \
-doCounts 1 \
-doMajorMinor 1 \
-doMaf 1 \
-SNP_pval 1e-6 \
-yQuant ${blups}/Ht_EDGE_MD.blups \
-doAsso 5 \
-rf ${contigs}/${mycontigs} \
-cov ${OUT}/PCA/EDGE_PC12.txt
Couple of new options here:
-yQuant ${blups}/Ht_EDGE_MD.blups – this is your response variable; no header -doAsso 5 – this is the model you’re choosing to run. See ANGSD manual under doAsso -rf contigs/{mycontigs} – this means you want the GWAS to run just on a set of regions (r) defined in an external file (f) -cov ${OUT}/PCA/EDGE_PC12.txt – these are the covariates; as many columns as you like; no header
Give it some time to run (should take 20 min or so), then we’ll bring the results into R to plot and summarize.
Is there evidence for genetic associations with early seedling height growth? What do the beta estimates suggest about the direction of the effect (more often increasing, decreasing)? Are SNPs that have a positive effect on height in the MD garden the same as those in NC? For the SNPs that have the strongest effect on height, are they polymorphic in your focal pop? NOTE: You can do the same analysis replacing climate data for heigh as the response variable. This is known as Gene-Environment Association (GEA), and is a common way to test for regions of the genome that confer local adaptation to climate (or other aspects of the environment). I’ve placed two relevant climate vareiables for red spruce in the blups directory:
[srkeller@pbio381 blups]$ pwd
/data/project_data/RS_ExomeSeq/ANGSD/blups
[srkeller@pbio381 blups]$ ll
total 20
-rw-r--r--. 1 srkeller users 432 Feb 18 20:41 bio18.blups
-rw-r--r--. 1 srkeller users 432 Feb 18 20:37 bio5.blups
These variables are derived from the bioclim climatic variables available here.
Bio5 = 10(Max temp of the warmest month) Bio18 = 10(Precip of the warmest quarter)
Red spruce is a montane coniferous tree that inhabits cool moist habitats in the northeast. Occasionally it also less ideal sites that are warmer and drier, particularly driven by microsite differences in aspect and substrate (rocky scree slopes exposed to direct sun). The purpose of this experiment was to sample red spruce genetic variation from sites that were polarized into cool & wet vs. hot and dry based on historic climate (based on Bio18 (precip of warmest quarter) and Bio5 (temp of warmest month), and to assess the gene expression responses of individuals from these habitats in response to experimental treatments of heat and heat+drought.
Two Source Climates (SourceClim):
Experimental Treatments (Trt):
Three time periods (Day):
Extracted RNA from whole seedlings (root, stem, needle tissue)
Aimed for 5 biological reps per Trt x SourceClim x Day combo, but day5 had few RNA extractions that worked
Realized sample replication after sequencing: N=76
Trt | SourceClim | Day | Nreps |
---|---|---|---|
Control | CoolWet | 0 | 5 |
Control | CoolWet | 5 | 3 |
Control | CoolWet | 10 | 5 |
Control | HotDry | 0 | 5 |
Control | HotDry | 5 | 3 |
Control | HotDry | 10 | 5 |
Heat | CoolWet | 0 | 5 |
Heat | CoolWet | 5 | 3 |
Heat | CoolWet | 10 | 5 |
Heat | HotDry | 0 | 5 |
Heat | HotDry | 5 | 2 |
Heat | HotDry | 10 | 5 |
HeatDry | CoolWet | 0 | 5 |
HeatDry | CoolWet | 5 | 1 |
HeatDry | CoolWet | 10 | 5 |
HeatDry | HotDry | 0 | 5 |
HeatDry | HotDry | 5 | 4 |
HeatDry | HotDry | 10 | 5 |
Samples were quantified for RNA concentration and quality on the Bioanalyzer Samples >1 ng/ul were sent to Cornell for 3’ tag sequencing Library prep followed the LexoGen protocol and sequencing was on 1 lane of a NextSeq500 (1x86 bp reads) Samples were demultiplexed and named according to the convention: POP_FAM_TRT_DAY
What questions can we ask/address with this experimental design, with these data? How do we translate these questions to models in DESeq2?
Do families from different source climates differ for their gene expression?
Is there a transcriptome wide response to heat stress? Does this change when the additional stress of drought is added?
Is there a significant interaction between source climate and stress treatment, such that families from hot/dry climates have a unique expression response to heat or heat+drought compared to families from cool/wet climates?
Which specific genes are involved in the above responses, and do they reveal functional enrichment of particular pathways?
Do DE genes or pathways show evidence of positive selection (compare DE genes with popgen signatures of sel’n)?
Can we use association mapping to identify eQTL associated with DE responses?
FastQC on raw reads –> Trimmomatic (done!) –> FastQC on cleaned reads
Reference transcriptome:
/data/project_data/RS_RNASeq/ReferenceTranscriptome/Pabies1.0-all-cds.fna.gz
Downloaded from Congenie.org
#!/bin/bash
cd /data/project_data/RS_RNASeq/fastq/
########## Trimmomatic for single end reads
for R1 in *R1.fastq.gz
do
echo "starting sample ${R1}"
f=${R1/_R1.fastq.gz/}
name=`basename ${f}`
java -classpath /data/popgen/Trimmomatic-0.33/trimmomatic-0.33.jar org.usadellab.trimmomatic.TrimmomaticSE \
-threads 1 \
-phred33 \
"$R1" \
/data/project_data/RS_RNASeq/fastq/cleanreads/${name}_R1.cl.fq \
ILLUMINACLIP:/data/popgen/Trimmomatic-0.33/adapters/TruSeq3-SE.fa:2:30:10 \
LEADING:20 \
TRAILING:20 \
SLIDINGWINDOW:6:20 \
HEADCROP:12 \
MINLEN:35
echo "sample ${R1} done"
done
Run FastQC on the clean reads to confirm improvement What do we look for in this second run?
Salmon to quantify transcript abundance. Mapping and counting the reads.
First step: Index the reference transcriptome. This only needs to be done once and has been done already, but here’s the code:
cd /data/project_data/RS_RNAseq/ReferenceTrancriptome/
salmon index -t Pabies1.0-all-cds.fna.gz -i Pabies_cds_index --decoys decoys.txt -k 31
Note that the -k
flag sets the minimum acceptable length for a valid match between query (read) and the reference. This is a parameter that the creators of Salmon suggest can be made “slightly smaller” if the mapping rate is lower than expected. We may want to explore reducing -k
.
Second step: Start quantification! Let’s see if we can write a for loop together to start the mapping based on the Salmon tutorial
salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant
The descriptions of all of the options can be found on the Salmon github page and by using the command salmon quant -h
.
salmon.sh
#! /bin/bash/
# run salmon to do the mapping and the counting of the cleanreads
# salmon quant -i transcripts_index -l <LIBTYPE> -r reads.fq --validateMappings -o transcripts_quant
cd /data/project_data/RS_RNASeq/fastq/cleanreads/
for file in CAM_02_H_*.cl.fq
do
salmon quant \
-i /data/project_data/RS_RNASeq/ReferenceTranscriptome/Pabies_cds_index \
-l A \
-r ${file} \
--validateMappings \
-p 1 \
--seqBias \
-o /data/project_data/RS_RNASeq/salmon/allmapping/${file}
done
For each sample mapped, you now have a directory with several output files including a log of the run. In that log, the mapping rate (% of reads mapped with sufficient quality) is reported. We can view the contents of the file using cat. We can also use grep (i.e., regular expressions) to pull out the mapping rate for all the samples. Though there’s probably a more elegant solution, here is one:
grep -r --include \*.log -e 'Mapping rate'
How could we save this output?
What is our mapping rate? Is this good/enough? What factors could affect mapping rate?
We ended up using the the index file called Pabies_cds_index
, which contains all of the coding sequences for P. abies, not just the high confidence coding sequences. This improved the percent mapping from ~ 4% to ~ 25%.
quant.sf
filesCombine individual quant.sf
files from their respective directories into a counts data matrix with all 76 samples in one table
To do this, we’ll use the R
package tximport
on our class server. Here’s a tximport
tutorial by the creators of the two programs, Salmon and DESeq2. Unless we decide to use multiple rounds of mapping (changing references, etc), only one person needs to make this compiled matrix.
Here’s an example of how it can be done. First start R
on the server by typing capital R
and enter. Exciting! Right? Load the libraries we need (already installed).
library(tximportData)
library(tximport)
#locate the directory containing the files.
dir <- "/data/project_data/RS_RNASeq/salmon/"
list.files(dir)
# read in table with sample ids
samples <- read.table("/data/project_data/RS_RNASeq/salmon/RS_samples.txt", header=TRUE)
# now point to quant files
all_files <- file.path(dir, samples$sample, "quant.sf")
names(all_files) <- samples$sample
# what would be used if linked transcripts to genes
#txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
# to be able to run without tx2gene
txi <- tximport(all_files, type = "salmon", txOut=TRUE)
names(txi)
head(txi$counts)
countsMatrix <- txi$counts
dim(countsMatrix)
#[1] 66069 76
# To write out
write.table(countsMatrix, file = "RS_countsMatrix.txt", col.names = T, row.names = T, quote = F)
Move this counts data matrix to your individual machines using your perferred ftp
(fetch
, Winscp
, scp
). Also move the RS_samples.txt
file to your machine. This file is a table that associates each of our samples with their conditions (climate, treatment, day, population).
Note that in the code above, we are using the counts data that are not normalized to length because we are working with 3’ RNAseq data. This is informed by a cautionary note from the creators on working with 3’ tagged RNA-seq: “If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts
as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR
or limma
functions without calculating an offset and without using countsFromAbundance.”
Important considerations: How was this transcriptome assembled? Should we collapse transcripts to genes?
Cautionary note from the creators on working with 3’ tagged RNA-seq: “If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR or limma functions without calculating an offset and without using countsFromAbundance.”
To do this: The rownames of the sampleTable RS_samples need to align with the colnames of txi$counts, that is the sample names need to be row names in the sample/factor table and need to be the column names of the data matrix (with rows as genes and columns as samples).
dds <- DESeqDataSetFromTximport(txi, sampleTable)
Let’s get R set up for for Transcriptomics, data import and visualization! Here is a list of some of the libraries we will need:
setwd("~/github/2020_Ecological_Genomics")
library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(ggpubr)
library(wesanderson)
countsTable <- read.table("RS_countsMatrix.txt", header=TRUE, row.names=1)
conds <- read.delim("RS_samples.txt", header=TRUE, stringsAsFactors = TRUE, row.names=1, colClasses=c('factor', 'factor', 'factor', 'factor'))
dds <- DESeqDataSetFromMatrix(countData = countsTableRound, colData = conds, design = ~ day + climate + treatment)
Recall that two weeks ago you all wrote for loops to map your set of cleaned fastq files to the reference transcriptome. We discovered that we had low mapping rates, ~2%! We did some troubleshooting in class. We further hypothesized that many of our reads were not mapping because the reference we had selected included only the coding region. In working with 3’ RNAseq data, much of our sequencing effort is likely to be in the 3’ UTR (untranslated region).
As described in the last tutorial, after all samples have been mapped to the refenence, the next step is to assemble the counts matrix using tximport in R on the server to be able to move from read mapping with Salmon to differential gene expression (DGE) analysis with DESeq2. See previous tutorial for code to assemble counts matrix.
Now we will work in R on our individual machines, each of us working with the complete data set (n=76, not just a subset of samples).
# Transcriptomics in R
# Set your working directory
setwd(here::here("myresults/transcript_counts"))
# Packages
library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(ggpubr)
library(wesanderson)
library(vsn)
library(pheatmap)
# Import data ------------------------------------------------------------------
countsTable <- read.table("RS_cds2kb_countsMatrix.txt",
header=TRUE,
row.names=1)
# head(countsTable)
# dim(countsTable)
# Need to round because DESeq wants only integers
countsTableRound <- round(countsTable)
# head(countsTableRound)
# Set your working directory
setwd(here::here("myresults/transcript_counts"))
# Import the samples description table
# links each sample to factors of the experimental design
# Need the colClasses otherwise imports "day" as numeric which DESeq doesn't do
conds <- read.delim("RS_samples.txt",
header = TRUE,
stringsAsFactors = TRUE,
row.names = 1,
colClasses = c('factor', 'factor', 'factor', 'factor'))
head(conds)
## pop treatment day climate
## ASC_06_C_0_GAGTCC_R1.cl.fq ASC_06 C 0 HD
## ASC_06_C_10_ACGTCT_R1.cl.fq ASC_06 C 10 HD
## ASC_06_C_5_GTGTAG_R1.cl.fq ASC_06 C 5 HD
## ASC_06_D_H_0_CGCAAC_R1.cl.fq ASC_06 D 0 HD
## ASC_06_D_H_10_CAGCGT_R1.cl.fq ASC_06 D 10 HD
## ASC_06_H_0_AATAGC_R1.cl.fq ASC_06 H 0 HD
dim(conds)
## [1] 76 4
# Read summary statistics and visualization ------------------------------------
# colSums(countsTableRound)
# mean(colSums(countsTableRound))
# bar plot to vizualize the sum of reads
barplot(colSums(countsTableRound),
las = 3,
cex.names = 0.5,
names.arg = substring(colnames(countsTableRound), 1, 13))
abline(h = mean(colSums(countsTableRound)), col = "blue", lwd = 2)
# # Average number of counts per gene
# rowSums(countsTableRound)
# apply(countsTableRound, 2, mean)
# Create a DESeq object --------------------------------------------------------
# define the experimental design here with the tilde
dds <- DESeqDataSetFromMatrix(countData = countsTableRound,
colData = conds,
design = ~ climate + day + treatment)
# Filter out genes with few reads
# 76 was chosen to create an average of 1 read per sample
dds <- dds[rowSums(counts(dds)) > 76]
# Run the DESeq model to test for differential gene expression -----------------
# DESeq does these three things all in this function
# 1) estimate size factors (per sample)
# 2) estimate dispersion (per gene)
# 3) run negative binomial glm
dds <- DESeq(dds)
# List the results you've generated
resultsNames(dds)
## [1] "Intercept" "climate_HD_vs_CW" "day_10_vs_0" "day_5_vs_0"
## [5] "treatment_D_vs_C" "treatment_H_vs_C"
# Order and list and summarize results from specific contrasts -----------------
# Set your adjusted p-value cutoff
# Can make summary tables of the number of genes differentially expressed
# up/down-regulated) for each contrast
res_treatCD <- results(dds, contrast = c("treatment", "C", "D"), alpha = 0.05)
res_treatCD <- res_treatCD[order(res_treatCD$padj), ]
# summary(res_treatCD)
# Data visualization -----------------------------------------------------------
# MA plot
ggmaplot(res_treatCD,
main = expression("Hot & Dry" %->% "Control"),
fdr = 0.01, fc = 0, size = 1,
palette = c("#B31B21", "#1465AC", "#A6A6A680"),
legend = "top",
top = 0,
# genenames = as.vector(res_treatCD$gene),
# select.top.method = "padj",
# font.label = c("bold", 11),
# label.rectangle = TRUE,
font.legend = "bold",
font.main = "bold",
ggtheme = ggplot2::theme_classic()) +
theme(legend.text = element_text(size = 12))
# PCA
vsd <- vst(dds, blind = FALSE, nsub = 10000)
data <- plotPCA(vsd,
ntop = 10000,
intgroup = c("climate", "treatment", "day"),
returnData = TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
# reordering the factors
data$treatment <- factor(data$treatment,
levels = c("C", "H", "D"),
labels = c("C", "H", "D"))
data$day <- factor(data$day,
levels = c("0", "5", "10"),
labels = c("0","5","10"))
ggplot(data, aes(PC1, PC2, color = climate, shape=treatment)) +
geom_point(size = 4, alpha = 0.85) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
theme_minimal()
ggplot(data) +
geom_histogram(aes(x = PC1, fill = day),
bins = 20, color = "grey50", position = "dodge") +
theme_classic()
# Counts of specific top gene!
# important validatition that the normalization, model is working
d <- plotCounts(dds,
gene = "MA_10426407g0030",
intgroup = (c("treatment","climate", "day")),
returnData = TRUE)
ggplot(d, aes(x = treatment,
y = count,
shape = climate,
color = day)) +
geom_jitter(width = 0.3, size = 3) +
scale_x_discrete(limits = c("C", "H", "D")) +
theme_classic() +
theme(text = element_text(size = 20),
panel.grid.major = element_line(colour = "grey"))
# Heatmap of top 20 genes sorted by pvalue
topgenes <- head(rownames(res_treatCD), 20)
mat <- assay(vsd)[topgenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[, c("treatment", "climate", "day")])
pheatmap(mat,
annotation_col = df,
show_colnames = FALSE)
Acartia tonsa is a calanoid copepod that has a world wide distribution. It inhabits estuaries and coastal waters and it typically the most abundant zooplankton present. Because of their huge population sizes and global distribution, they play an important role in ocean biogeochemical cycling and ecosystem functioning. For example, they’re an important food base for many fishes. Given their broad distribution in dynamic habitats (estuaries), they can tolerate and live in a broad range of salinities, freshwater to seawater, and temperatures, among other stressors.
We know that the world is rapidly changing due to human activities and it is important to understand how species will respond to these changes. We were interested in understanding how A. tonsa could respond to stressful environments that will likely be caused by climate change. Can they adapt to rapid changes in temperature and acidity? How might epigenetic responses interact with adaptation to enable resilience?
A. tonsa is a great system to ask these questions for a number of reasons. First, their generation time is only ~10-20 days and they’re tiny! Only 1-2 mm. This means we can easily execute multi-generational studies with thousands of individuals. Second, because they can tolerate such variable environments, they should have lots of plasticity to respond to environmental fluctuations. Finally, their large population sizes mean that genetic diversity should be high, giving them high adaptive potential. This means that if any species can respond to rapid environmental change, it is likely A. tonsa.
We collected A. tonsa from the wild, common gardened them for three generations, then split them into four treatments with four replicates each and about 3,000-5,000 individuals per replicate. We left them to evolve in these conditions for 25 generations.
Samples were collected at generation F0 and F25 to quantify methylation frequencies using reduced representation bisulfite sequencing (RRBS). Below are the sample ID’s
Sample_ID | Treatment | treatment_abbrev | generation | sample_rep |
---|---|---|---|---|
AA_F00_1 | Control: ambient temp, ambient CO2 | AA | F00 | 1 |
AA_F00_2 | Control: ambient temp, ambient CO2 | AA | F00 | 2 |
AA_F00_3 | Control: ambient temp, ambient CO2 | AA | F00 | 3 |
AA_F00_4 | Control: ambient temp, ambient CO2 | AA | F00 | 4 |
AA_F25_1 | Control: ambient temp, ambient CO2 | AA | F25 | 1 |
AA_F25_2 | Control: ambient temp, ambient CO2 | AA | F25 | 2 |
AA_F25_3 | Control: ambient temp, ambient CO2 | AA | F25 | 3 |
AA_F25_4 | Control: ambient temp, ambient CO2 | AA | F25 | 4 |
AH_F25_1 | ambient temp, high CO2 | AH | F25 | 1 |
AH_F25_2 | ambient temp, high CO2 | AH | F25 | 2 |
AH_F25_3 | ambient temp, high CO2 | AH | F25 | 3 |
AH_F25_4 | ambient temp, high CO2 | AH | F25 | 4 |
HA_F25_1 | high temp, ambient CO2 | HA | F25 | 1 |
HA_F25_2 | high temp, ambient CO2 | HA | F25 | 2 |
HA_F25_3 | high temp, ambient CO2 | HA | F25 | 3 |
HA_F25_4 | high temp, ambient CO2 | HA | F25 | 4 |
HH_F25_1 | high temp, high CO2 | HH | F25 | 1 |
HH_F25_2 | high temp, high CO2 | HH | F25 | 2 |
HH_F25_3 | high temp, high CO2 | HH | F25 | 3 |
HH_F25_4 | high temp, high CO2 | HH | F25 | 4 |
Reduced representation bisulfite sequencing. Focuses on CpG parts of the genome. Similar restriction enzyme process to RAD sequencing.
Following the adapter ligation, we bisulfite convert all unmethylated C’s to T’s.
Before starting we also spiked in a small amount of DNA from E. coli that we know wasn’t methylated. Using this, we can calculate downstream how efficient our bisulfite conversion was.
Take a look at the fastqc files:
What do you notice that is different from fastqc that you’ve seen previously?
Why do we see these differences?
Why is this different from typical DNA alignment?
What do we need to do to the genome?
Sample assignments for mapping:
Person | Sample |
---|---|
Bertrand B | AA_F00_1 |
Erika B | AA_F00_2 |
Kerria B | AA_F00_3 |
Ben C | AA_F00_4 |
Brendan | AA_F25_1 |
Chege H. | AA_F25_2 |
Alison H. | AA_F25_3 |
Sandra N. | AA_F25_4 |
Thomas O. | AH_F25_1 |
Csenge P. | AH_F25_2 |
Zoe P. | AH_F25_3 |
Anoob P. | AH_F25_4 |
Shervin R. | HA_F25_1 |
Jorge R. | HA_F25_2 |
Lily S. | HA_F25_3 |
Baxter W. | HA_F25_4 |
Katelynn W. | HH_F25_1 |
Mapping report from Bismark
Look at mapping rate:
bismark_methylation_extractor --bedGraph --scaffolds --gzip \
--cytosine_report --comprehensive \
--no_header \
--parallel 6 \
--output ~/tonsa_epigenetics/analysis/methylation_extract/ \
--genome_folder /data/copepods/tonsa_genome/ \
*pe.bam
Link to methylation extraction report
To ignore first two bases:
bismark_methylation_extractor --bedGraph --scaffolds --gzip \
--cytosine_report --comprehensive \
--no_header \
--parallel 6 \
--ignore 2 --ignore_r2 2 \
--output ~/tonsa_epigenetics/analysis/methylation_extract/ \
--genome_folder /data/copepods/tonsa_genome/ \
*pe.bam
Link to methylation extraction report ignoring bias
Column 1: Chromosome
Column 2: Start position
Column 3: End position
Column 4: Methylation percentage
Column 5: Number of methylated C's
Column 6: Number of unmethylated C's
We can look at sites with some data using the following:
zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | awk '$4 > 5 && $5 > 10' | head
library(methylKit)
library(tidyverse)
library(ggplot2)
library(pheatmap)
# Read in the raw methylation calls with methylKit -----------------------------
# Set directory
dir <- here::here("myresults/epigenetics")
# Read in the sample IDs
samples <- read.table(paste0(dir, "/", "sample_id.txt"),
header = FALSE)
# Point to coverage files
files <- file.path(dir, samples$V1)
all(file.exists(files))
## [1] TRUE
# Convert to list
file.list <- as.list(files)
# Get the names only for naming our samples
nmlist <- as.list(gsub("_1_bismark_bt2_pe.bismark.cov.gz", "", samples$V1))
# use methRead to read in the coverage files
myobj <- methRead(location = file.list,
sample.id = nmlist,
assembly = "atonsa",
dbtype = "tabix",
context = "CpG",
resolution = "base",
mincov = 20,
treatment = c(0, 0, 0, 0,
1, 1, 1, 1,
2, 2, 2, 2,
3, 3, 3, 3,
4, 4, 4, 4),
pipeline = "bismarkCoverage",
dbdir = dir)
# Visualize coverage and filter ------------------------------------------------
# We can look at the coverage for individual samples with getCoverageStats()
getCoverageStats(myobj[[1]], plot = TRUE)
# Plot all of our samples at once to compare.
# Filter samples by depth
filtered.myobj <- filterByCoverage(myobj,
lo.count = 20,
lo.perc = NULL,
hi.count = NULL,
hi.perc = 97.5,
dbdir = dir)
# Merge samples ----------------------------------------------------------------
# Merge all the samples. We will require sites to be present in each sample
# Note this takes a couple of hours to run
meth <- unite(filtered.myobj, mc.core = 3, suffix = united, dbdir = dir)
# Load a previously created database -------------------------------------------
# So Reid created this merged database and we are going to move on from here
meth <- methylKit:::readMethylBaseDB(
dbpath = paste0(dir, "/methylBase_united.txt.bgz"),
dbtype = "tabix",
sample.id = unlist(nmlist),
assembly = "atonsa", # this is just a string. no actual database
context = "CpG",
resolution = "base",
treatment = c(0,0,0,0,
1,1,1,1,
2,2,2,2,
3,3,3,3,
4,4,4,4),
destrand = FALSE)
# Methylation statistics across samples ----------------------------------------
# percMethylation() calculates the percent methylation for each site and sample
pm <- percMethylation(meth)
#plot methylation histograms
ggplot(gather(as.data.frame(pm)), aes(value)) +
geom_histogram(bins = 10, color = "black", fill = "grey") +
facet_wrap(~ key, ncol = 4)
# calculate and plot mean methylation
sp.means <- colMeans(pm)
p.df <- data.frame(sample = names(sp.means),
group = substr(names(sp.means), 1, 6),
methylation = sp.means)
ggplot(p.df, aes(x = group, y = methylation, color = group)) +
stat_summary(color = "black") +
geom_jitter(width = 0.1, size = 3) +
theme_classic()
# Summarize variation ----------------------------------------------------------
# Sample clustering
clusterSamples(meth, dist = "correlation", method = "ward.D", plot = TRUE)
##
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
##
## Cluster method : ward.D
## Distance : pearson
## Number of objects: 20
# PCA
PCASamples(meth, screeplot = TRUE)
PCASamples(meth, screeplot = FALSE)
# Find differentially methylated sites between two groups ----------------------
# Subset groups
meth_sub <- reorganize(meth,
sample.ids = c("AA_F00_1","AA_F00_2","AA_F00_3",
"AA_F00_4","HH_F25_1","HH_F25_2",
"HH_F25_3","HH_F25_4"),
treatment = c(0,0,0,0,1,1,1,1),
save.db = FALSE)
# Calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
overdispersion = "MN",
mc.cores = 1,
suffix = "AA_HH",
adjust = "qvalue",
test = "Chisq")
Where MN corrects for overdispersion:
Fit a logistic regression to methylation values where explanatory variable is the treatment (case or control) and we compare the fit of the model with explanatory variable vs the null model (no explanatory variable) and ask if the fit is better using a Chisq test.
The methylation proportions are weighted by their coverage, as in a typical logistic regression. Note that in theory you could enter these as two column success and failure data frame, which is common in logistic regressions.
Use overdispersion: Chisq without overdispersion finds more true positives, but many more false positives. good compromise is overdispersion with Chisq. reduced true pos, but really reduces false pos rate.
# Get all differentially methylated bases
myDiff <- getMethylDiff(myDiff, qvalue = 0.05, difference = 10)
# We can visualize the changes in methylation frequencies quickly.
hist(getData(myDiff)$meth.diff)
# Hyper-methylated bases
hyper <- getMethylDiff(myDiff, difference = 10, qvalue = 0.05, type = "hyper")
# Hypo-methylated bases
hypo <- getMethylDiff(myDiff, difference = 10, qvalue = 0.05, type = "hypo")
# Plots of differentially methylated groups ------------------------------------
# Heatmaps -----
pm <- percMethylation(meth_sub)
# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in, ]
# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr,
getData(myDiff)$start,
sep= ":"),
din, pm.sig)
colnames(df.out) <- c("snp", colnames(din), colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[, 5:7]))
pheatmap(pm.sig, show_rownames = FALSE)
# You can also normalize to one of the groups
ctrmean <- rowMeans(pm.sig[,1:4])
h.norm <- (pm.sig-ctrmean)
pheatmap(h.norm, show_rownames = FALSE)
# Methylation of specific gene or snp ------------------------------------------
df.plot <- df.out[,c(1,5:12)] %>% pivot_longer(-snp, values_to = "methylation")
df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
## snp name methylation group
## <fct> <chr> <dbl> <chr>
## 1 LS041600.1:376 AA_F00_1 84.4 AA
## 2 LS041600.1:376 AA_F00_2 84.0 AA
## 3 LS041600.1:376 AA_F00_3 84.3 AA
## 4 LS041600.1:376 AA_F00_4 83.6 AA
## 5 LS041600.1:376 HH_F25_1 76.7 HH
## 6 LS041600.1:376 HH_F25_2 77.5 HH
# Looking at snp LS049205.1:248
# if you choose a different snp, you can create different plots.
df.plot %>% filter(snp=="LS049205.1:248") %>%
ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
stat_summary(fun.data = "mean_se", size = 2) +
geom_jitter(width = 0.1, size=3, pch=21, color="black")
# Write bed file for intersection with genome annotation
setwd(dir)
write.table(file = "diffmeth.bed",
data.frame(chr = df.out$chr,
start = df.out$start,
end = df.out$end),
row.names = FALSE,
col.names = FALSE,
quote = FALSE,
sep = "\t")
/data/popgen/bedtools2/bin/bedtools closest -a diffmeth.bed \
-b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
-D b | \
awk '!($10=="-")' > hits.bed
The annotation file here has the format: ScaffoldName FromPosition ToPosition Sense TranscriptName TranscriptPath GeneAccession GeneName GeneAltNames GenePfam GeneGo CellularComponent MolecularFunction BiologicalProcess
Note that the hits.bed
file will paste the diffmeth.bed
file before the annotation table. So the first 3 columns are from diffmeth.bed
, then next 8 from the annotation table.
# count up number of hits
cat hits.bed | wc -l
# count number of unique named genes
cat hits.bed | cut -f 8 | sort | uniq -c