containment check perf boost #320

morsecodist · 2024-01-26T21:35:46Z

Comment says it all, this should speed things up without a cost really.

phoenixAja

looks good! thanks!

This reverts commit aee5216.

* first draft * performance boost * lint * fix accessions call * fix logging * thread xp * no threads * filtering bugfix and sorting * rewrite it in rust * tests + lint * restore step after reset * fix double bytes * create dir * splitter bugfix * remove split * typo * fix * fix * split speedup * taxid drop * no prot in compress nt * mapping fixes * file splitting * remove proteins * mising taxids * versioning bugfix * adding tests, restructuring files * subsample test data * cargo update * add compressed test data * remove intermediate file * use temp file for testing * reorganizing logging * add some comments * log when accession is being represented by another sequence * log when sequence was found in tree * skip_split_by_taxid * logging in tsv, also output containment values * also log containment for discarding from chunk * pass in logging files as args * restore index_generation.wdl to previous version * notebook to check sequence retention * small updates, test file WIP * logging files * optionally enable logging, adding some small test stuff * split into files * clean up imports * add arg for temp dir path * add ability to add protein sketches * add step to compress NR, mock download for nt/nr, use old s3 path instead * fix * tweak * wdl check * install awscli via pip3 * fix accession_mapping path * remove nt or nr after breaking apart into individual taxons * remove NR for now * remove nr compression defaults * attempt to have logging, download mapping files from s3 * optionally enable protein compression and logging * fix args * fix * fix paths to download taxid mappings from s3 * change paths * gunzip mapping s3 dir * change diamond naming change diamond namind to include letter and sequence length * swap sequence and letters in diamond file rename * patch for accession2taxid mappings * patch alignment config * enable download of most recent ncbi index/mapping files * optional fields for old nt/nr and mapping files * comment out nr_info.marisa * optionally compress nuc * save paths from step outputs * remove old test assets * tell seqkit to use less threads for to limit memory usage * hardcode paths for now * reduce memory usage again * put seqkit sort into separate task, also break apart NT/NR to make sort less mem intensive * pad files to the left with zeros * nt -> nt_sorted * refer to nt/nr as input names * run seqkitSort with docker * cat files together with longest sequences on top * make outputs dir for individually sorted files * script fix * add logging for progress * add logger * break apart input fasta in rust instead of python * append to sorted chunk * do groupings of sequence length in parallel * larger chunk size * use mutex when writing to sorted output * renaming * set threads to parallel * split fasta and sort as different steps * fix * wip * chmod on passed in directory * copy files out instead * read from input dir only * merge break apart and sort back into one step * test for test_break_up_fasta_by_sequence_length * fix * seqkit sort not in parallel * break up fasta with bin size * remove seqkit sort * add test data, make better test paths * breaking things up * break into subcommands * make commands directory * only process logging files if enabled sequence logging * cleaned up test, create new test * test data for split_accessions_by_taxid * add commands.rs * add break apart by taxid as it's own step * fix * change break apart taxid * remove cpu arg * fix to break and sort * change arg name * small tweaks * different read_by_taxid dirs for nuc and protein * small fixed * write taxons not found to 0 * add dir name for splitting apart larger taxons * turn back on real reads count * more logging when aquiring and writing to lock * specify cpus for other tasks * mkdir for split taxids * add local disk constraints * cpu resource usages that make more sense for the tasks * move bin size to 50 * add back logging * add limit for the amount of open file handles * memory efficient sort (#310) * memory efficient sort * typo * actions debug * disk mapping * null change * email log * email remove * test email * fix build * fix temporary directory issue * bloom set * lint * lint * missing bugfix * remove length binning * fuse split and compress * fix nr split * good spot * simplify * syntax * syntax * typo * fix set heuristic * update memory and cpu limits --------- Co-authored-by: Phoenix Logan <plogan@chanzuckerberg.com> * parameter tweak * naive search * add command to count accessions in each file in a taxid dir and write output tsv * fix test to use tempfile * check if logging_enabled is specifically set to true * memory efficient loc db (#315) --------- Co-authored-by: Phoenix Logan <plogan@chanzuckerberg.com> * NCBI Compress - parallel split by taxid (#314) * similarity check rearrangement (#317) * don't exceed threshold for the amount of open files allowed * reduce number of open files to 2000 * revert back to non parallel split by taxid * add back in parallel parts that were working * add notebook for comparing alignment times between runs * some changes in analysis notebooks to compare alignment times between two projects * small fix * clean up WDL syntax and fix info bug(#319) * add back in split and sorted taxid dirs * containment check perf boost (#320) * shuffle fasta by seq index (#321) * add shuffle to compress e2e command * fix comment * optionally skip creating downstream assets from NT and NR (#324) * remove uploading sorted dirs to s3 * more chunks for minimap * Revert "containment check perf boost (#320)" (#327) This reverts commit aee5216. * add db shuffle as downstream step from compress (#333) * NT and NR download with blastdbcmd (#331) * only grab unzipped NT and NR if they were provided as inputs * fix return empty select_first * remove logging, add more tests, splitting up larger functions * remove unused code * more cleanup * more files helpful for analysis * index generation cleanup * clean up notebooks add comment on bloomset * add cargo test to index-generation test * fix spacings * gha for cargo tests * fix count test to not be dependent on ordering * remove unneeded actions code * Delete workflows/index-generation/helpful_for_analysis/check_seq_lengths.sh * taxon lineage change log jupyter notebook * taxon lineage change log jupyter notebook * fix count accessions by taxid test * github actions for cargo tests * remove comment * add some comments * take out paths to check if job triggers * change GHA paths * add description for commands.rs * add examples for querying the other trie structures * Create README.md for index gen * Create README.md for ncbi-compress * Update README.md * clean notebook * comment out display of data * Update README.md * update readme with more descriptions * update * Create README.md * update * add lucidchart * update * taxon lineage changelog with no intermediate sample report downloads --------- Co-authored-by: phoenixAja <plogan@chanzuckerberg.com>

containment check perf boost

d5d5bdd

morsecodist requested a review from phoenixAja January 26, 2024 21:35

phoenixAja approved these changes Jan 26, 2024

View reviewed changes

phoenixAja merged commit aee5216 into tmorse-nt-compression Feb 5, 2024
11 checks passed

phoenixAja deleted the tmorse-containment-check branch February 5, 2024 20:08

phoenixAja added a commit that referenced this pull request Feb 16, 2024

Revert "containment check perf boost (#320)"

18f8d42

This reverts commit aee5216.

phoenixAja mentioned this pull request Feb 16, 2024

Revert "containment check perf boost" #327

Merged

phoenixAja added a commit that referenced this pull request Feb 27, 2024

Revert "containment check perf boost (#320)" (#327)

459ad41

This reverts commit aee5216.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containment check perf boost #320

containment check perf boost #320

morsecodist commented Jan 26, 2024

phoenixAja left a comment

containment check perf boost #320

containment check perf boost #320

Conversation

morsecodist commented Jan 26, 2024

phoenixAja left a comment

Choose a reason for hiding this comment