iterative greedy clustering of sourmash signatures

please post questions and comments and thoughts over at sourmash#1265

inspired by this sourmash issue: sourmash-bio/sourmash#1251

note: this requires sourmash 3.5.x or sourmash 4.x.

obtaining this script

click on the 'raw' link below for sourmash-uniqify.
you can grab the script directly from the repo:

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/raw/5953c1025bab0c23e7a06f316fc17dbcec299720/sourmash-uniqify.py

- but this may not get the latest version...

perhaps better, you can clone the repository: recommended

git clone https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284

and/or or you can download the repository as a zip, but again, this may not get the latest version.

wget https://gist.github.com/ctb/85fe2efb23b70bbd72ea6a69750d1284/archive/5953c1025bab0c23e7a06f316fc17dbcec299720.zip

usage:

python3 sourmash-uniqify.py [ multiple filenames, directory paths, or sourmash databases ]

Optional arguments:

--prefix - output prefix, e.g. --prefix=output_dir/clustme will put all output in output_dir/clustme.*
--threshold - Jaccard similarity threshold above which two signatures will be clustered. default 0.2

sourmash-uniqify shuffles the sequences using a random number generator seeded with --seed. by default, this is fixed at 1, so unless you change the seed you will always get the same output for the same input arguments. set --seed to 0 to change the seed each time.

you can specify -k/--ksize and --moltype to load specific types of signatures; default is k=31, DNA.

note, all signatures loaded must be compatible for comparison in terms of scaled/num; right now there is no way to pick just the scaled signatures, or just the num signatures.

output:

output of this script consists of a founder file for each cluster, a cluster file for each non-singleton cluster, and a summary CSV.

summary CSV

the summary CSV output at '{prefix}.summary.csv' contains the following columns:

'origin_path' - the file or path from which the signature was loaded (corresponds to command-line arguments)
'name' - signature name, as generated by sourmash sketch. may be duplicate.
'filename' - filename of source sequences, as generated by sourmash sketch. may be duplicate.
'md5sum' - ~unique hexadecimal string for each signature. should be unique.
'cluster' - cluster number (integer).
'member_type' - either 'founder' or 'member'. If 'founder', this is the signature to which all others were clustered; if 'member', this is a duplicate signature to the founder.

founder file

for each cluster, there is a founder signature output with the name '{prefix}.cluster.{n}.founder.sig'. This is the founder signature for this cluster, to which all members of this cluster match with similarity >= threshold (as specified with --threshold).

cluster file

for each cluster with more than one signature in it, there will be a cluster file containing 1 or more signatures, under the name '{prefix}.cluster.{n}.cluster.sig'.

These are the signatures who compare to the founder signature with similarity >= threshold (as specified with --threshold).

using this output to cluster actual genomes

In brief, you should be able to:

take a directory with a bunch of genome files
run sourmash sketch dna * to turn them into sourmash signatures.
make a directory for the output: mkdir output
then run this script, sourmash-uniqify.py, on the signatures in the directory: sourmash-uniqify.py *.sig --prefix=output/myproj
examine output/myproj.summary.csv

To select out the "unique" set of representative genomes, then do:

mkdir unique
ln $(grep ',founder' output/myproj.summary.csv | cut -d, -f2) unique/

et voila!

other notes

this uses an n-squared algorithm so you probably don't want to run this on more than a few thousand signatures.

it's also kinda brute-force dumb, for simplicity of prototype implementation.

file an issue in the sourmash tracker if you'd like it to scale better.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
sourmash-uniqify.py		sourmash-uniqify.py
uniqify-genomes.py		uniqify-genomes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iterative greedy clustering of sourmash signatures

obtaining this script

usage:

output:

summary CSV

founder file

cluster file

using this output to cluster actual genomes

other notes

About

Languages

ctb/2022-sourmash-uniqify

Folders and files

Latest commit

History

Repository files navigation

iterative greedy clustering of sourmash signatures

obtaining this script

usage:

output:

summary CSV

founder file

cluster file

using this output to cluster actual genomes

other notes

About

Topics

Resources

Stars

Watchers

Forks

Languages