Guide for creation of reference genomes in Linux with dDocent and the Stacks package.
A reference genome (aka reference assembly) serves as a representative sample of a set of genes for an idealized organism of a species and are typically used as guides for developing new genomes. Their creation often comprises assembly of reference contigs, which are sets of overlapping DNA segments that form a consensus region of DNA.
This guide describes the process of using genetic sequence data to plot reference contig data, generate a fasta file with complete restriction enzyme associated DNA (RAD) fragments, and output the optimal interval of cutoff values to capture maximum diversity and accuracy.
Requirements
- Linux OS (I used Windows Subsystem for Linux (WSL) in Windows 10)
- dDocent (a bash wrapper to QC, assemble, map, and call SNPs from almost any kind of RAD sequencing)
- with conda installed in your Linux terminal:
# create a working directory
mkdir ddocentdir
cd ddocentdir
# add the bioconda channel
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# install dDocent and activate dDocent environment
conda install ddocent
conda create -n ddocent_env ddocent
source activate ddocent_env
Test Dataset The SimRAD dataset contains DNA sequences from double digest restriction-site associated DNA (ddRAD). Download the dataset, unzip it, and view contents
curl -L -o data.zip https://www.dropbox.com/s/t09xjuudev4de72/data.zip?dl=0
unzip data.zip
ll
Create list of just barcodes
cut -f2 SimRAD.barcodes > barcodes
head barcodes
Install Stacks package with process_radtags
Download Stacks from http://catchenlab.life.illinois.edu/stacks/ to ddocentdir Untar and install Stacks
tar xfvz stacks-2.xx.tar.gz
cd stacks-2.xx.tar.gz
./configure --prefix=/home/joe/ddocentdir
sudo make install
Run process_radtags
process_radtags -1 SimRAD_R1.fastq.gz -2 SimRAD_R2.fastq.gz -b barcodes -e ecoRI --renz_2 mspI -r -i gzfastq
The option -e specifies the 5' restriction site and --renze_2 specifes the 3' restriction site. -i states the format of the input sequences. The -r option tells the program to fix cut sites and barcodes that have up to 1-2 mutations in them.
When the program is completed, ddocentdir should have several files that look like: sample_AAGAGG.1.fq.gz, sample_AAGAGG.2.fq.gz, sample_AAGAGG.rem.1.fq.gz, and sample_AAGAGG.rem.2.fq.gz
The .rem..fq.gz files would normally have files that fail process_radtags (bad barcode, ambitious cut sites), but we have simulated data and none of those bad reads. We can delete.
rm *rem*
The individual files are currently only names by barcode sequence. They can be renamed in an easier convention with the following bash script. It reads the names into an array and all barcodes into second array, gets length of both arrays, then iterates the task of renaming the samples.
curl -L -O /~https://github.com/jpuritz/dDocent/raw/master/Rename_for_dDocent.sh
bash Rename_for_dDocent.sh SimRAD.barcodes
ls *.fq.gz
There should now be 40 individually labeled .F.fq.gz and 40 .R.fq.gz. Twenty from PopA and Twenty from PopB.
Assemble RAD data
Create a set of unique reads with counts for each individual
ls *.F.fq.gz > namelist
sed -i'' -e 's/.F.fq.gz//g' namelist
AWK1='BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}'
AWK2='!/>/'
AWK3='!/NNN/'
PERLT='while (<>) {chomp; $z{$_}++;} while(($k,$v) = each(%z)) {print "$v\t$k\n";}'
cat namelist | parallel --no-notice -j 8 "zcat {}.F.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.forward"
cat namelist | parallel --no-notice -j 8 "zcat {}.R.fq.gz | mawk '$AWK1' | mawk '$AWK2' > {}.reverse"
cat namelist | parallel --no-notice -j 8 "paste -d '-' {}.forward {}.reverse | mawk '$AWK3' | sed 's/-/NNNNNNNNNN/' | perl -e '$PERLT' > {}.uniq.seqs"
The first four lines set shell variables for various bits of AWK and perl code, to make parallelization with GNU-parallel easier. The first line after the variables, creates a set of forward reads for each individual by using mawk (a faster, c++ version of awk) to sort through the fastq file and strip away the quality scores. The second line does the same for the PE reads. Lastly, the final line concatentates the forward and PE reads together (with 10 Ns between them) and then find the unique reads within that individual and counts the occurences (coverage).
Sequences with very small levels of coverage within an individual are likely to be sequencing errors. So for assembly we can eliminate reads with low copy numbers to remove non-informative data.
Sum up the numbers within individual coverage levels of unique reads in our data set. Use mawk to query the first column and select data above a certain copy number (from 2-20) and print to a file.
cat *.uniq.seqs > uniq.seqs
for i in {2..20};
do
echo $i >> pfile
done
cat pfile | parallel --no-notice "echo -n {}xxx && mawk -v x={} '\$1 >= x' uniq.seqs | wc -l" | mawk '{gsub("xxx","\t",$0); print;}'| sort -g > uniqseq.data
Plot to terminal with gnuplot
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
set xrange [2:20]
unset label
set title "Number of Unique Sequences with More than X Coverage (Counted within individuals)"
set xlabel "Coverage"
set ylabel "Number of Unique Sequences"
plot 'uniqseq.data' with lines notitle
pause -1
EOF
Choose a cutoff value that captures as much of the diversity of the data as possible while simultaneously eliminating sequences that are likely errors. Try 4.
cat pfile | parallel --no-notice mawk -v x=4 \''$1 >= x'\' ::: *.uniq.seqs | cut -f2 | perl -e 'while (<>) {chomp; $z{$_}++;} while (($k,$v) = each(%z)) {print "$v\t$k\n";}' > uniqCperindv
wc -l uniqCperindv
The data has been reduced down to 7598 sequences. Yet, it can be reduced further. Restrict data by the number of different individuals a sequence appears within.
for ((i = 2; i <= 10; i++));
do
echo $i >> ufile
done
cat ufile | parallel --no-notice "echo -n {}xxx && mawk -v x={} '\$1 >= x' uniqCperindv | wc -l" | mawk '{gsub("xxx","\t",$0); print;}'| sort -g > uniqseq.peri.data
Plot the data
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Number of Unique Sequences present in more than X Individuals"
set xlabel "Number of Individuals"
set ylabel "Number of Unique Sequences"
plot 'uniqseq.peri.data' with lines notitle
pause -1
EOF
Another cutoff value is chosen to capture maximum diversity in the data, while eliminating sequences with little value on the population scale. Try 4.
mawk -v x=4 '$1 >= x' uniqCperindv > uniq.k.4.c.4.seqs
wc -l uniq.k.4.c.4.seqs
The data has been reduced to 3840 sequences!
Convert sequences back to fasta format. This reads the totaluniqseq file line by line and adds a sequence header of >Contig X.
cut -f2 uniq.k.4.c.4.seqs > totaluniqseq
mawk '{c= c + 1; print ">Contig_" c "\n" $1}' totaluniqseq > uniq.fasta
At this point, dDocent also checks for reads that have a substantial amount of Illumina adapter in them. Our data is simulated and does not contain adapter, so we'll skip that step for the time being.
Assemble Reference Contigs
Extract forward reads with the sed command to replace the 10N separator into a tab character and then use the cut function to split the files into forward reads.
sed -e 's/NNNNNNNNNN/\t/g' uniq.fasta | cut -f1 > uniq.F.fasta
Cluster all forward reads by 80% similarity.
cd-hit-est -i uniq.F.fasta -o xxx -c 0.8 -T 0 -M 0 -g 1
Convert output of CD-hit to match output of first phase of rainbow.
mawk '{if ($1 ~ /Cl/) clus = clus + 1; else print $3 "\t" clus}' xxx.clstr | sed 's/[>Contig_,...]//g' | sort -g -k1 > sort.contig.cluster.ids
paste sort.contig.cluster.ids totaluniqseq > contig.cluster.totaluniqseq
sort -k2,2 -g contig.cluster.totaluniqseq | sed -e 's/NNNNNNNNNN/\t/g' > rcluster
The output should follow a simple text format of:
Read_ID Cluster_ID Forward_Read Reverse_Read
Get the exact number of clusters
cut -f2 rcluster | uniq | wc -l
In this case, the number of clusters is 1000.
Split clusters into smaller clusters that represent significant variants The first clustering steps found RAD loci, now split the loci into alleles.
rainbow div -i rcluster -o rbdiv.out
The output of the div process is similar to the previous output with the exception that the second column is now the new divided cluster_ID (this value is numbered sequentially) and there was a column added to the end of the file that holds the original first cluster ID The parameter -f can be set to control what is the minimum frequency of an allele necessary to divide it into its own cluster Since this is from multiple individuals, we want to lower this from the default of 0.2.
rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10
A parameter of interest to add here is the -r parameter, which is the minimum number of reads to assemble. The default is 5 which works well if assembling reads from a single individual. However, we are assembling a reduced data set, so there may only be one copy of a locus. Therefore, it's more appropriate to use a cutoff of 2.
rainbow merge -o rbasm.out -a -i rbdiv.out -r 2
The rbasm output lists optimal and suboptimal contigs. Previous versions of dDocent used rainbow's included perl scripts to retrieve optimal contigs. However, as of version 2.0, dDocent uses customized AWK code to extract optimal contigs for RAD sequencing.
cat rbasm.out <(echo "E") |sed 's/[0-9]*:[0-9]*://g' | mawk ' {
if (NR == 1) e=$2;
else if ($1 ~/E/ && lenp > len1) {c=c+1; print ">dDocent_Contig_" e "\n" seq2 "NNNNNNNNNN" seq1; seq1=0; seq2=0;lenp=0;e=$2;fclus=0;len1=0;freqp=0;lenf=0}
else if ($1 ~/E/ && lenp <= len1) {c=c+1; print ">dDocent_Contig_" e "\n" seq1; seq1=0; seq2=0;lenp=0;e=$2;fclus=0;len1=0;freqp=0;lenf=0}
else if ($1 ~/C/) clus=$2;
else if ($1 ~/L/) len=$2;
else if ($1 ~/S/) seq=$2;
else if ($1 ~/N/) freq=$2;
else if ($1 ~/R/ && $0 ~/0/ && $0 !~/1/ && len > lenf) {seq1 = seq; fclus=clus;lenf=len}
else if ($1 ~/R/ && $0 ~/0/ && $0 ~/1/) {seq1 = seq; fclus=clus; len1=len}
else if ($1 ~/R/ && $0 ~!/0/ && freq > freqp && len >= lenp || $1 ~/R/ && $0 ~!/0/ && freq == freqp && len > lenp) {seq2 = seq; lenp = len; freqp=freq}
}' > rainbow.fasta
First, the script looks at all the contigs assembled for a cluster. If any of the contigs contain forward and PE reads, then that contig is output as optimal. If no overlap contigs exists (the usual for most RAD data sets), then the contig with the most assembled reads PE (most common) is output with the forward read contig with a 10 N spacer. If two contigs have equal number of reads, the longer contig is output.
Align and cluster data by sequence similarity with cd-hit.
cd-hit-est -i rainbow.fasta -o referenceRC.fasta -M 0 -T 0 -c 0.9
The -M and -T flags instruct the program on memory usage (-M) and number of threads (-T). Setting the value to 0 uses all available. The real parameter of significan is the -c parameter which sets the percentage of sequence similarity to group contigs by. The above code uses 90%. Try using 95%, 85%, 80%, and 99%. Since this is simulated data, we know the real number of contigs, 1000. By choosing an cutoffs of 4 and 4, we are able to get the real number of contigs, no matter what the similarty cutoff.
In this example, it's easy to know the correct number of reference contigs, but with real data this is less obvious. As you just demonstrated, varying the uniq sequence copy cutoff and the final clustering similarity have the the largest effect on the number of final contigs. You could go back and retype all the steps from above to explore the data, but scripting makes this easier. Place remake_reference.sh (in this repository) in ddocentdir, or whatever your working directory is.
Remake reference by calling the script along with a new cutoff value and similarity.
bash remake_reference.sh 4 4 0.90 PE 2
This command will remake the reference with a cutoff of 20 copies of a unique sequence to use for assembly and a final clustering value of 90%. It will output the number of reference sequences and create a new, indexed reference with the given parameters. The output from the code above should be "1000" Experiment with some different values on your own. What you choose for a final number of contigs will be something of a judgement call. However, we could try to heuristically search the parameter space to find an optimal value.
Use this repository's ReferenceOpt.sh to automate this process. This script uses different loops to assemble references from an interval of cutoff values and c values from 0.8-0.98.
Run ReferenceOpt.sh. It take as a while to run (5-20 mins).
bash ReferenceOpt.sh 4 8 4 8 PE 16
You can see that the most common number of contigs across all iteration is 1000, but also that the top three occuring and the average are all within 1% of the true value Again, this is simulated data and with real data, the number of exact reference contigs is unknown and you will ultimately have to make a judgement call.
Examine the reference
bash remake_reference.sh 4 4 0.90 PE 2
head reference.fasta
>dDocent_Contig_1
NAATTCCTCCGACATTGTCGGCTTTAAATAGCTCATAACTTGAGCCCAGGTAAAGACTTTAGTATACTCGCACCTTCCGCTTATCCCCCGGCCGCNNNNNNNNNNATTCAACCGCGGGACCTGAACTAACATAGCGTTGTGTATACCATCCGAGGTAACCTTATAACTCTCTGCCATTCGGACAGGTAACACGGCATATCGTCCGN
>dDocent_Contig_2
NAATTCAGAATGGTCATACAGGGCGGTAGAATGGAATCCTGAATCGAATGGCGGTTGCATTGAGAACCTGGTACCAGATAGGATCTGGATTAAATNNNNNNNNNNGTCGGGTACTAATTATCTATTGGGTCCAAACCCTCCGCCCCGTTTACTGCCCACCCGGCATGCAGTCATGAGAATTCCAAGGAACTAAGATAAGAGACCGN
>dDocent_Contig_3
NAATTCGGGCTCCTTGGAGAGATTCTTTCAATTATGCCCCCTACGTGGGAAACAGGGTCGGAAGTGGTCGGCTGAGAATTACTCGAAAGCCGCTCNNNNNNNNNNCCACCAGCATGATAGGACTTCAAGCTTGCCGTTTGTTGGGAGGACCGGTCGCTACGGAGCTGACGCTATCTCCCGCATCGGACCTCGTGGACAAAAACCGN
>dDocent_Contig_4
NAATTCAAAAGTCGCCCATAGGTACGTGATGAATTAGGTCAAGCGGGGACGTCGCATAGATGCGTGACGTCTGGAGCATGATGTTGTTTCTAACCNNNNNNNNNNAATCACTCGGTCAACGTGGTCCGTGCTCTGCAACGAAAAAAACTTCGCATGTGAACGATGATGCCTATAGGTGCGACCGCCGTCAGAGGCCCGTTGACCGN
>dDocent_Contig_5
NAATTCATACGGATATGATACTTCGTCTGGCAGGGTGGCTAGCGAGTTTAAGGATTCTTGGATAAAGGTAGGTAAAATTCTCGAGATTCTGATCTNNNNNNNNNNTAGAGGTGCTGGCGGGGCCTAGACGTGTTTCTACGCTTACTGATCAAATTAGCTAGCTTAGGTTCCTATAGTCTACGCTGGATTGTCCTTAGATGCACCGN
You can now see that we have complete RAD fragments starting with our EcoRI restriction site (AATT), followed by R1, then a filler of 10Ns, and then R2 ending with the mspI restriction site (CCG). The start and end of the sequence are buffered with a single N
We can use simple shell commands to query this data. Find out how many lines in the file (this is double the number of sequences)
wc -l reference.fasta
Find out how many sequences there are directly by counting lines that only start with the header character ">"
mawk '/>/' reference.fasta | wc -l
We can test that all sequences follow the expected format.
mawk '/^NAATT.*N*.*CCGN$/' reference.fasta | wc -l
grep '^NAATT.*N*.*CCGN$' reference.fasta | wc -l
1000 it is!
Assemble references Assemble references across cutoff values, then map 20 random samples and evaluate mappings to the reference, along with number of contigs and coverage.
Run this repository's RefMapOpt.sh. It take as a while to run (5-20 mins).
RefMapOpt.sh 4 8 4 8 0.9 64 PE
This will loop across cutoffs of 4-8 using a similarity of 90% for clustering, parellized across 64 processors, using PE assembly technique.
The output is stored in a file called mapping.results
cat mapping.results
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM Mapped SUM Properly Mean Mapped Mean Properly MisMatched
37.3382 39.6684 1000 942.25 4 4 747510 747342 37375.5 37367.1 0
37.4003 39.7343 1000 942.25 4 5 748753 748546 37437.7 37427.3 0
37.4625 39.7919 1000 942.45 4 6 749999 749874 37499.9 37493.7 0
37.4967 39.8282 1000 942.45 4 7 750685 750541 37534.2 37527.1 0
37.486 39.8169 1000 942.45 4 8 750469 750205 37523.4 37510.2 0
37.3517 39.6785 1000 942.35 5 4 747780 747612 37389 37380.6 0
37.4147 39.7454 1000 942.35 5 5 749042 748835 37452.1 37441.8 0
37.4701 39.7999 1000 942.45 5 6 750151 750009 37507.6 37500.4 0
37.4852 39.8161 1000 942.45 5 7 750453 750210 37522.7 37510.5 0
37.4551 39.7824 999 941.55 5 8 749102 748837 37455.1 37441.8 0
37.3561 39.6833 1000 942.35 6 4 747870 747731 37393.5 37386.6 0
37.453 39.7776 1000 942.55 6 5 749809 749734 37490.4 37486.7 0
37.4923 39.8193 1000 942.55 6 6 750595 750376 37529.8 37518.8 0
37.4784 39.8089 1000 942.45 6 7 750318 750075 37515.9 37503.8 0
37.4437 39.766 999 941.65 6 8 748874 748616 37443.7 37430.8 0
37.4013 39.7312 1000 942.35 7 4 748774 748698 37438.7 37434.9 0
37.4592 39.7907 1000 942.4 7 5 749934 749835 37496.7 37491.8 0
37.4682 39.7981 1000 942.45 7 6 750114 749897 37505.7 37494.8 0
37.4239 39.7468 1000 942.55 7 7 749227 748993 37461.3 37449.7 0
37.417 39.736 998 940.75 7 8 747591 747320 37379.6 37366 0
37.4413 39.761 1000 942.65 8 4 749575 749499 37478.8 37474.9 0
37.4492 39.7843 1000 942.3 8 5 749733 749562 37486.7 37478.1 0
37.4441 39.7711 998 940.6 8 6 748133 747888 37406.7 37394.4 0
37.4274 39.7517 997 939.7 8 7 747052 746779 37352.6 37338.9 0
37.5014 39.8269 989 932.25 8 8 742528 742279 37126.4 37113.9 0
The output contains the average coverage per contig, the average coverage per contig not counting zero coverage contigs, the number of contigs, the mean number of contigs mapped, the two cutoff values used, the sum of all mapped reads, the sum of all properly mapped reads, the mean number of mapped reads, the mean number of properly mapped reads, and the number of reads that are mapped to mismatching contigs. Here, we are looking to values that maximize properly mapped reads, the mean number of contigs mapped, and the coverage. In this example, it's easy. Values 4,7 produce the highest number of properly mapped reads, coverage, and contigs. Real data will involve a judgement call.
Updated 5/7/21