Skip to content

Greedy iterative clustering of genomes using sourmash

Notifications You must be signed in to change notification settings

ctb/2022-sourmash-uniqify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

iterative greedy clustering of sourmash signatures

ref sourmash

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: sourmash-bio/sourmash#1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

  1. click on the 'raw' link below for sourmash-uniqify.

  2. you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

  1. perhaps better, you can clone the repository: recommended
git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284
  1. and/or or you can download the repository as a zip, but again, this may not get the latest version.
wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

  • --prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
  • --threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the following columns:

  • 'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
  • 'name' - signature name, as generated by sourmash sketch. may be duplicate.
  • 'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
  • 'md5sum' - ~unique hexadecimal string for each signature. should be unique.
  • 'cluster' - cluster number (integer).
  • 'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name '{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a cluster file containing 1 or more signatures, under the name '{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

  • take a directory with a bunch of genome files
  • run sourmash sketch dna * to turn them into sourmash signatures.
  • make a directory for the output: mkdir output
  • then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
  • examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.

About

Greedy iterative clustering of genomes using sourmash

Topics

Resources

Stars

Watchers

Forks

Languages