--David B. Stern, Ph.D.--
The pipeline was developed to process Illumina sequencing data generated from CD Genomic's HPV Capture Kit.
The scripts directory contains bash and R scripts to process the data.
The ref directory contains a fasta file of papillomavirus reference genomes from PaVE, including non-reference genomes, borrowed from HPV-EM which has nicely reformatted names. This reference fasta needs to be indexed by Bowtie 2
All scripts rely on a file called files.txt
which contains the names of all the samples to be processed.
-
clean_map_abundance.sh
: UGE array script to clean reads with bbduk, map reads to the reference genomes with Bowtie 2, and estimate the relative abundance of each genotype using msamtools. -
stats.sh
: Collects coverage and pairwise ID statistics from the bam files using awk and NanoStat, and generates coverage plots usingcollect_stats.R
. Should be run from the directory with the bam files. Be sure to check paths for reference fasta, index, andcollect_stats.R
-
collect_stats.R
: R script to generate table of statistics and coverage plots. Run automatically withstats.sh
. Requires the tidyverse R package. -
merge_msamtools_stats.R
: R script to merge the output of msamtools andcollect_stats.R
for each sample.