sourmash is our lab's implementation of an ultra-fast lightweight approach to nucleotide-level search and comparison, called MinHash.
You can read some background about MinHash sketches in this paper: Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
To install sourmash, run:
pip install /~https://github.com/dib-lab/sourmash/archive/2017-ucsc-metagenome.zip
(Note, we are installing from a development branch; many of the features below are not part of an official sourmash release yet. They should be included in sourmash 2.0.)
Use case: how much do two (or more!) unassembled metagenomes resemble each other?
Compute a scaled MinHash fingerprint from our reads:
mkdir ~/sourmash
cd ~/sourmash
sourmash compute --scaled 10000 ~/mapping/SRR*.pe.fq -k 21,31
Now, compare the two files at k=21:
sourmash compare SRR*.sig -k 21
or k=31:
sourmash compare SRR*.sig -k 31
Use case: how much of the read content is contained in the assembly?
Fingerprint the assembly:
sourmash compute --scaled 10000 -k 21,31 ~/mapping/subset_assembly.fa
and now evaluate containment, that is, what fraction of the read content is contained in the assembly:
sourmash search -k 21 SRR1976948.abundtrim.subset.pe.fq.sig \
subset_assembly.fa.sig --containment
and you should see:
1 matches; showing 3:
/home/titus/mapping/subset_assembly.fa 0.573 subset_assembly.fa.sig
Try the reverse - why is it bigger?
sourmash search -k 21 subset_assembly.fa.sig \
SRR1976948.abundtrim.subset.pe.fq.sig --threshold=0.0 --containment
what do you get if you do this with the other set of reads?
Adjust plotting (this is a bug in sourmash :) --
echo 'backend : Agg' > matplotlibrc
Do a comparison:
sourmash compare SRR*.sig subset*.sig -o comparison
and then plot:
sourmash plot --pdf comparison
which will produce a file comparison.matrix.pdf
and comparison.dendro.pdf
that you can grab view your Jupyter Notebook console.
Download and unpack the k=21 RefSeq index described in CTB's blog post:
curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/microbe-sbt-k21-2016-11-27.tar.gz
tar xzf microbe-sbt-k21-2016-11-27.tar.gz
This produces a file microbes.sbt.json
and a whole bunch of hidden
files in the directory .sbt.microbes
. This is an index of about 60,000
microbial genomes from RefSeq.
Next, run the 'gather' command to see what's in there --
sourmash sbt_gather -k 21 microbes subset_assembly.fa.sig
and you should get:
Final composition (sorted by percent of original query):
p_query p_genome
2.4 14.3 NC_010003.1 Petrotoga mobilis SJ95, complete genome
0.5 1.6 NC_017934.1 Mesotoga prima MesG1.Ag.4.2, complete genome
3.0% (percent of query identified)
If you go to the SRA information for this project, you'll see that Petrotoga and Mesotoga are both in there - yay! Of course, in this case we're looking at largely unknown critters (3% at most is in genbank!) so we wouldn't expect many matches.