ref sourmash
please post questions and comments and thoughts over at sourmash#1265
inspired by this sourmash issue: sourmash-bio/sourmash#1251
note: this requires sourmash 3.5.x or sourmash 4.x.
-
click on the 'raw' link below for sourmash-uniqify.
-
you can grab the script directly from the repo:
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py
- but this may not get the latest version...
- perhaps better, you can clone the repository: recommended
git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
- and/or or you can download the repository as a zip, but again, this may not get the latest version.
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip
python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]
Optional arguments:
- --prefix - output prefix, e.g.
--prefix=output_dir/clustme
will put all output inoutput_dir/clustme.*
- --threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2
sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed
. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.
you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.
note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.
output of this script consists of a founder file for each cluster, a cluster file for each non-singleton cluster, and a summary CSV.
the summary CSV output at '{prefix}.summary.csv' contains the following columns:
- 'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
- 'name' - signature name, as generated by
sourmash sketch
. may be duplicate. - 'filename' - filename of source sequences, as generated by
sourmash sketch
. may be duplicate. - 'md5sum' - ~unique hexadecimal string for each signature. should be unique.
- 'cluster' - cluster number (integer).
- 'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.
for each cluster, there is a founder signature output with the name '{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).
for each cluster with more than one signature in it, there will be a cluster file containing 1 or more signatures, under the name '{prefix}.cluster.{n}.cluster.sig'.
These are the signatures who compare to the founder signature with similarity >= threshold (as specified with --threshold).
In brief, you should be able to:
- take a directory with a bunch of genome files
- run
sourmash sketch dna *
to turn them into sourmash signatures. - make a directory for the output:
mkdir output
- then run this script,
sourmash-uniqify.py
, on the signatures in the directory:sourmash-uniqify.py *.sig --prefix=output/myproj
- examine
output/myproj.summary.csv
To select out the "unique" set of representative genomes, then do:
mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/
et voila!
this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.
it's also kinda brute-force dumb, for simplicity of prototype implementation.
file an issue in the sourmash tracker if you'd like it to scale better.