Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add tax summarization dataclasses for safety and flexibility (#2439)
## Taxonomy Refactor Overview In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining `query_bp` in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the `query_bp`, `ksize`, and `scaled` columns into some testing results to keep tests functioning. 1. #2437 modifies `LineagePair` from a two-item `collections.namedtuple` to a three-item `typing.NamedTuple` containing an additional field, `taxid`, for storing NCBI taxid information. It also introduces classes (`BaseLineageInfo`, `RankLineageInfo`), which move lineage manipulation (from `lca_utils.py`) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen. 2. #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior `collections.namedtuple`s to `dataclasses` used for storing information about the gather query (`QueryInfo`), summarized gather information for metagenome queries (`SummarizedGatherResult`) and classification information for genome queries (`ClassificationResult`). It introduces three new classes for reading and manipulating gather results. `GatherRow`, is used for reading a each row from a gather file and automatically checking for required columns. `TaxResult` is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing all `TaxResult`s associated with a single query. `QueryTaxResult` add methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. 3. #2443 replaces the actual taxonomic summarization code in `tax/__main__.py` with code that uses the new classes. Modifies gather loading code to read using `GatherRow`, `TaxResult`, and `QueryTaxResult`. 4. #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests. ## Additional details for this PR (#2439) 1. Renamed existing `namedtuple`s. This renaming allows me to introduce modified `dataclasses` (below) with the same names without breaking functioning code. This is temporary, as these are removed in #2443. - `QueryInfo` --> `QueryInf` - `SummarizedGatherResult` -> `SumGathInf` - `ClassificationResult` --> `ClassInf` 2. Update these `namedtuple`s to dataclasses to allow additional functionality - `QueryInfo` - `SummarizedGatherResult` - `ClassificationResult` These contain several advantages over the prior namedtuples. In particular, dataclass post-initialization methods allows type checking and casting string--> float or int as appropriate. I also added methods to these dataclasses to estimate ANI and produce formatted dictionaries for each output format. This pulls formatting into one place, rather than independent output-writing functions. 3. Add dataclasses for reading and manipulating gather results. - `GatherRow`, for reading a each row from a gather file - innately checks that all required gather colnames are present when loading each gather csv - `TaxResult` for storing a single row from gather file - get and store `LineageInfo` taxonomic information directly with each row - `get_match_lineage` and `get_ident` are now methods - tracks whether lineage matching was attempted, including if the ident was `missed` or intentionally `skipped`. We don't actually allow skipping from the cli (yet?), but it's enabled in all methods to preserve existing functionality. I think I was using this to skip identical/exact database matches. - `QueryTaxResult` for storing all `TaxResult`s associated with a single query - only allows summarization over gather results from same query - more robust/simplified tracking of results with missed & skipped taxonomic identifiers + perfect matches - use `LineageInfo` classes (within `TaxResult`) for lineage tracking + summarization, to get all benefits (optional `taxid`, any hierarchical ranks, etc) - contains methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. Co-authored-by: C. Titus Brown <titus@idyll.org>
- Loading branch information