add tax summarization dataclasses for safety and flexibility (#2439)

## Taxonomy Refactor Overview In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining `query_bp` in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the `query_bp`, `ksize`, and `scaled` columns into some testing results to keep tests functioning. 1. #2437 modifies `LineagePair` from a two-item `collections.namedtuple` to a three-item `typing.NamedTuple` containing an additional field, `taxid`, for storing NCBI taxid information. It also introduces classes (`BaseLineageInfo`, `RankLineageInfo`), which move lineage manipulation (from `lca_utils.py`) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen. 2. #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior `collections.namedtuple`s to `dataclasses` used for storing information about the gather query (`QueryInfo`), summarized gather information for metagenome queries (`SummarizedGatherResult`) and classification information for genome queries (`ClassificationResult`). It introduces three new classes for reading and manipulating gather results. `GatherRow`, is used for reading a each row from a gather file and automatically checking for required columns. `TaxResult` is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing all `TaxResult`s associated with a single query. `QueryTaxResult` add methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. 3. #2443 replaces the actual taxonomic summarization code in `tax/__main__.py` with code that uses the new classes. Modifies gather loading code to read using `GatherRow`, `TaxResult`, and `QueryTaxResult`. 4. #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests. ## Additional details for this PR (#2439) 1. Renamed existing `namedtuple`s. This renaming allows me to introduce modified `dataclasses` (below) with the same names without breaking functioning code. This is temporary, as these are removed in #2443. - `QueryInfo` --> `QueryInf` - `SummarizedGatherResult` -> `SumGathInf` - `ClassificationResult` --> `ClassInf` 2. Update these `namedtuple`s to dataclasses to allow additional functionality - `QueryInfo` - `SummarizedGatherResult` - `ClassificationResult` These contain several advantages over the prior namedtuples. In particular, dataclass post-initialization methods allows type checking and casting string--> float or int as appropriate. I also added methods to these dataclasses to estimate ANI and produce formatted dictionaries for each output format. This pulls formatting into one place, rather than independent output-writing functions. 3. Add dataclasses for reading and manipulating gather results. - `GatherRow`, for reading a each row from a gather file - innately checks that all required gather colnames are present when loading each gather csv - `TaxResult` for storing a single row from gather file - get and store `LineageInfo` taxonomic information directly with each row - `get_match_lineage` and `get_ident` are now methods - tracks whether lineage matching was attempted, including if the ident was `missed` or intentionally `skipped`. We don't actually allow skipping from the cli (yet?), but it's enabled in all methods to preserve existing functionality. I think I was using this to skip identical/exact database matches. - `QueryTaxResult` for storing all `TaxResult`s associated with a single query - only allows summarization over gather results from same query - more robust/simplified tracking of results with missed & skipped taxonomic identifiers + perfect matches - use `LineageInfo` classes (within `TaxResult`) for lineage tracking + summarization, to get all benefits (optional `taxid`, any hierarchical ranks, etc) - contains methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. Co-authored-by: C. Titus Brown <titus@idyll.org>
sourmash-bio · Jan 16, 2023 · 145d1ad · 145d1ad
1 parent ca57995
commit 145d1ad
Show file tree

Hide file tree

Showing 3 changed files with 1,963 additions and 92 deletions.
diff --git a/src/sourmash/tax/__main__.py b/src/sourmash/tax/__main__.py
@@ -13,7 +13,7 @@
 from sourmash.lca.lca_utils import display_lineage, zip_lineage
 
 from . import tax_utils
-from .tax_utils import ClassificationResult, MultiLineageDB
+from .tax_utils import ClassInf, MultiLineageDB
 
 usage='''
 sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
@@ -222,7 +222,7 @@ def genome(args):
                 notify(f"WARNING: classifying query {sg.query_name} at desired rank {args.rank} does not meet containment threshold {args.containment_threshold}")
             else:
                 status="match"
-            classif = ClassificationResult(sg.query_name, status, sg.rank, sg.fraction, sg.lineage, sg.query_md5, sg.query_filename, sg.f_weighted_at_rank, sg.bp_match_at_rank, sg.query_ani_at_rank)
+            classif = ClassInf(sg.query_name, status, sg.rank, sg.fraction, sg.lineage, sg.query_md5, sg.query_filename, sg.f_weighted_at_rank, sg.bp_match_at_rank, sg.query_ani_at_rank)
             classifications[args.rank].append(classif)
             matched_queries.add(sg.query_name)
             if "krona" in args.output_format:
@@ -251,7 +251,7 @@ def genome(args):
                 elif sg.fraction >= args.containment_threshold:
                     status = "match"
                 if status == "match":
-                    classif = ClassificationResult(query_name=sg.query_name, status=status, rank=sg.rank,
+                    classif = ClassInf(query_name=sg.query_name, status=status, rank=sg.rank,
                                                     fraction=sg.fraction, lineage=sg.lineage,
                                                     query_md5=sg.query_md5, query_filename=sg.query_filename,
                                                     f_weighted_at_rank=sg.f_weighted_at_rank, bp_match_at_rank=sg.bp_match_at_rank,
@@ -261,7 +261,7 @@ def genome(args):
                     continue
                 elif rank == "superkingdom" and status == "nomatch":
                     status="below_threshold"
-                    classif = ClassificationResult(query_name=sg.query_name, status=status,
+                    classif = ClassInf(query_name=sg.query_name, status=status,
                                                    rank="", fraction=0, lineage="",
                                                    query_md5=sg.query_md5, query_filename=sg.query_filename,
                                                    f_weighted_at_rank=sg.f_weighted_at_rank, bp_match_at_rank=sg.bp_match_at_rank,