Skip to content

Latest commit

 

History

History
107 lines (69 loc) · 4.19 KB

README.md

File metadata and controls

107 lines (69 loc) · 4.19 KB

iterative greedy clustering of sourmash signatures

ref sourmash

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: sourmash-bio/sourmash#1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

  1. click on the 'raw' link below for sourmash-uniqify.

  2. you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

  1. perhaps better, you can clone the repository: recommended
git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
  1. and/or or you can download the repository as a zip, but again, this may not get the latest version.
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

  • --prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
  • --threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the following columns:

  • 'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
  • 'name' - signature name, as generated by sourmash sketch. may be duplicate.
  • 'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
  • 'md5sum' - ~unique hexadecimal string for each signature. should be unique.
  • 'cluster' - cluster number (integer).
  • 'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name '{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a cluster file containing 1 or more signatures, under the name '{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

  • take a directory with a bunch of genome files
  • run sourmash sketch dna * to turn them into sourmash signatures.
  • make a directory for the output: mkdir output
  • then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
  • examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.