Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containment check perf boost #320

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

morsecodist
Copy link
Collaborator

Comment says it all, this should speed things up without a cost really.

Copy link
Contributor

@phoenixAja phoenixAja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! thanks!

@phoenixAja phoenixAja merged commit aee5216 into tmorse-nt-compression Feb 5, 2024
11 checks passed
@phoenixAja phoenixAja deleted the tmorse-containment-check branch February 5, 2024 20:08
phoenixAja added a commit that referenced this pull request Feb 16, 2024
phoenixAja added a commit that referenced this pull request Feb 27, 2024
phoenixAja added a commit that referenced this pull request May 8, 2024
* first draft

* performance boost

* lint

* fix accessions call

* fix logging

* thread xp

* no threads

* filtering bugfix and sorting

* rewrite it in rust

* tests + lint

* restore step after reset

* fix double bytes

* create dir

* splitter bugfix

* remove split

* typo

* fix

* fix

* split speedup

* taxid drop

* no prot in compress nt

* mapping fixes

* file splitting

* remove proteins

* mising taxids

* versioning bugfix

* adding tests, restructuring files

* subsample test data

* cargo update

* add compressed test data

* remove intermediate file

* use temp file for testing

* reorganizing logging

* add some comments

* log when accession is being represented by another sequence

* log when sequence was found in tree

* skip_split_by_taxid

* logging in tsv, also output containment values

* also log containment for discarding from chunk

* pass in logging files as args

* restore index_generation.wdl to previous version

* notebook to check sequence retention

* small updates, test file WIP

* logging files

* optionally enable logging, adding some small test stuff

* split into files

* clean up imports

* add arg for temp dir path

* add ability to add protein sketches

* add step to compress NR, mock download for nt/nr, use old s3 path instead

* fix

* tweak

* wdl check

* install awscli via pip3

* fix accession_mapping path

* remove nt or nr after breaking apart into individual taxons

* remove NR for now

* remove nr compression defaults

* attempt to have logging, download mapping files from s3

* optionally enable protein compression and logging

* fix args

* fix

* fix paths to download taxid mappings from s3

* change paths

* gunzip mapping s3 dir

* change diamond naming
change diamond namind to include letter and sequence length

* swap sequence and letters in diamond file rename

* patch for accession2taxid mappings

* patch alignment config

* enable download of most recent ncbi index/mapping files

* optional fields for old nt/nr and mapping files

* comment out nr_info.marisa

* optionally compress nuc

* save paths from step outputs

* remove old test assets

* tell seqkit to use less threads for to limit memory usage

* hardcode paths for now

* reduce memory usage again

* put seqkit sort into separate task, also break apart NT/NR to make sort less mem intensive

* pad files to the left with zeros

* nt -> nt_sorted

* refer to nt/nr as input names

* run seqkitSort with docker

* cat files together with longest sequences on top

* make outputs dir for individually sorted files

* script fix

* add logging for progress

* add logger

* break apart input fasta in rust instead of python

* append to sorted chunk

* do groupings of sequence length in parallel

* larger chunk size

* use mutex when writing to sorted output

* renaming

* set threads to parallel

* split fasta and sort as different steps

* fix

* wip

* chmod on passed in directory

* copy files out instead

* read from input dir only

* merge break apart and sort back into one step

* test for test_break_up_fasta_by_sequence_length

* fix

* seqkit sort not in parallel

* break up fasta with bin size

* remove seqkit sort

* add test data, make better test paths

* breaking things up

* break into subcommands

* make commands directory

* only process logging files if enabled sequence logging

* cleaned up test, create new test

* test data for split_accessions_by_taxid

* add commands.rs

* add break apart by taxid as it's own step

* fix

* change break apart taxid

* remove cpu arg

* fix to break and sort

* change arg name

* small tweaks

* different read_by_taxid dirs for nuc and protein

* small fixed

* write taxons not found to 0

* add dir name for splitting apart larger taxons

* turn back on real reads count

* more logging when aquiring and writing to lock

* specify cpus for other tasks

* mkdir for split taxids

* add local disk constraints

* cpu resource usages that make more sense for the tasks

* move bin size to 50

* add back logging

* add limit for the amount of open file handles

* memory efficient sort (#310)

* memory efficient sort

* typo

* actions debug

* disk mapping

* null change

* email log

* email remove

* test email

* fix build

* fix temporary directory issue

* bloom set

* lint

* lint

* missing bugfix

* remove length binning

* fuse split and compress

* fix nr split

* good spot

* simplify

* syntax

* syntax

* typo

* fix set heuristic

* update memory and cpu limits

---------

Co-authored-by: Phoenix Logan <plogan@chanzuckerberg.com>

* parameter tweak

* naive search

* add command to count accessions in each file in a taxid dir and write output tsv

* fix test to use tempfile

* check if logging_enabled is specifically set to true

* memory efficient loc db (#315)



---------

Co-authored-by: Phoenix Logan <plogan@chanzuckerberg.com>

* NCBI Compress - parallel split by taxid (#314)

* similarity check rearrangement (#317)

* don't exceed threshold for the amount of open files allowed

* reduce number of open files to 2000

* revert back to non parallel split by taxid

* add back in parallel parts that were working

* add notebook for comparing alignment times between runs

* some changes in analysis notebooks to compare alignment times between two projects

* small fix

* clean up WDL syntax and fix info bug(#319)

* add back in split and sorted taxid dirs

* containment check perf boost (#320)

* shuffle fasta by seq index (#321)

* add shuffle to compress e2e command

* fix comment

* optionally skip creating downstream assets from NT and NR (#324)

* remove uploading sorted dirs to s3

* more chunks for minimap

* Revert "containment check perf boost (#320)" (#327)

This reverts commit aee5216.

* add db shuffle as downstream step from compress (#333)

* NT and NR download with blastdbcmd (#331)

* only grab unzipped NT and NR if they were provided as inputs

* fix return empty select_first

* remove logging, add more tests, splitting up larger functions

* remove unused code

* more cleanup

* more files helpful for analysis

* index generation cleanup

* clean up notebooks add comment on bloomset

* add cargo test to index-generation test

* fix spacings

* gha for cargo tests

* fix count test to not be dependent on ordering

* remove unneeded actions code

* Delete workflows/index-generation/helpful_for_analysis/check_seq_lengths.sh

* taxon lineage change log jupyter notebook

* taxon lineage change log jupyter notebook

* fix count accessions by taxid test

* github actions for cargo tests

* remove comment

* add some comments

* take out paths to check if job triggers

* change GHA paths

* add description for commands.rs

* add examples for querying the other trie structures

* Create README.md for index gen

* Create README.md for ncbi-compress

* Update README.md

* clean notebook

* comment out display of data

* Update README.md

* update readme with more descriptions

* update

* Create README.md

* update

* add lucidchart

* update

* taxon lineage changelog with no intermediate sample report downloads

---------

Co-authored-by: phoenixAja <plogan@chanzuckerberg.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants