Skip to content

Commit

Permalink
add tax summarization dataclasses for safety and flexibility (#2439)
Browse files Browse the repository at this point in the history
## Taxonomy Refactor Overview

In an attempt to allow usage of NCBI taxid (motivation: CAMI
benchmarking) and alternate hierarchical taxonomic ranks (motivation:
LINS), I ended up refactoring the taxonomy code in a four-PR series.
Taxonomic summarization results should not change. Minor caveat: I was
previously obtaining `query_bp` in a hacky manner to allow gather <4.4
results. The class methods are more robust, and I'd like to stop
supporting gather <4.4 results. To allow this, I had to add the
`query_bp`, `ksize`, and `scaled` columns into some testing results to
keep tests functioning.

1. #2437 modifies
`LineagePair` from a two-item `collections.namedtuple` to a three-item
`typing.NamedTuple` containing an additional field, `taxid`, for storing
NCBI taxid information. It also introduces classes (`BaseLineageInfo`,
`RankLineageInfo`), which move lineage manipulation (from
`lca_utils.py`) to class methods in order to support robust
summarization across compatible lineages (lineages of same hierarchical
ranks). To ensure these can be used as dictionary keys, these classes
are frozen.

2. #2439 introduces classes
that facilitate reading, summarization, and writing of gather results.
First, it updates three prior `collections.namedtuple`s to `dataclasses`
used for storing information about the gather query (`QueryInfo`),
summarized gather information for metagenome queries
(`SummarizedGatherResult`) and classification information for genome
queries (`ClassificationResult`). It introduces three new classes for
reading and manipulating gather results. `GatherRow`, is used for
reading a each row from a gather file and automatically checking for
required columns. `TaxResult` is used for storing a single row from
gather file, optionally (and ideally) with taxonomic information, stored
as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing
all `TaxResult`s associated with a single query. `QueryTaxResult` add
methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

3. #2443 replaces the
actual taxonomic summarization code in `tax/__main__.py` with code that
uses the new classes. Modifies gather loading code to read using
`GatherRow`, `TaxResult`, and `QueryTaxResult`.

4. #2446 removes old,
unused functions that are rendered redundant by the new classes. Also
removes associated tests.

## Additional details for this PR (#2439) 

1. Renamed existing `namedtuple`s. This renaming allows me to introduce
modified `dataclasses` (below) with the same names without breaking
functioning code. This is temporary, as these are removed in #2443.
- `QueryInfo` --> `QueryInf`
- `SummarizedGatherResult` -> `SumGathInf`
- `ClassificationResult` --> `ClassInf`

2. Update these `namedtuple`s to dataclasses to allow additional
functionality
- `QueryInfo`
- `SummarizedGatherResult`
- `ClassificationResult`

These contain several advantages over the prior namedtuples. In
particular, dataclass post-initialization methods allows type checking
and casting string--> float or int as appropriate. I also added methods
to these dataclasses to estimate ANI and produce formatted dictionaries
for each output format. This pulls formatting into one place, rather
than independent output-writing functions.

3. Add dataclasses for reading and manipulating gather results.
- `GatherRow`, for reading a each row from a gather file
- innately checks that all required gather colnames are present when
loading each gather csv
- `TaxResult` for storing a single row from gather file
- get and store `LineageInfo` taxonomic information directly with each
row
    - `get_match_lineage` and `get_ident` are now methods
- tracks whether lineage matching was attempted, including if the ident
was `missed` or intentionally `skipped`. We don't actually allow
skipping from the cli (yet?), but it's enabled in all methods to
preserve existing functionality. I think I was using this to skip
identical/exact database matches.
- `QueryTaxResult` for storing all `TaxResult`s associated with a single
query
  -  only allows summarization over gather results from same query
- more robust/simplified tracking of results with missed & skipped
taxonomic identifiers + perfect matches
- use `LineageInfo` classes (within `TaxResult`) for lineage tracking +
summarization, to get all benefits (optional `taxid`, any hierarchical
ranks, etc)
- contains methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

Co-authored-by: C. Titus Brown <titus@idyll.org>
  • Loading branch information
bluegenes and ctb authored Jan 16, 2023
1 parent ca57995 commit 145d1ad
Show file tree
Hide file tree
Showing 3 changed files with 1,963 additions and 92 deletions.
8 changes: 4 additions & 4 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from sourmash.lca.lca_utils import display_lineage, zip_lineage

from . import tax_utils
from .tax_utils import ClassificationResult, MultiLineageDB
from .tax_utils import ClassInf, MultiLineageDB

usage='''
sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
Expand Down Expand Up @@ -222,7 +222,7 @@ def genome(args):
notify(f"WARNING: classifying query {sg.query_name} at desired rank {args.rank} does not meet containment threshold {args.containment_threshold}")
else:
status="match"
classif = ClassificationResult(sg.query_name, status, sg.rank, sg.fraction, sg.lineage, sg.query_md5, sg.query_filename, sg.f_weighted_at_rank, sg.bp_match_at_rank, sg.query_ani_at_rank)
classif = ClassInf(sg.query_name, status, sg.rank, sg.fraction, sg.lineage, sg.query_md5, sg.query_filename, sg.f_weighted_at_rank, sg.bp_match_at_rank, sg.query_ani_at_rank)
classifications[args.rank].append(classif)
matched_queries.add(sg.query_name)
if "krona" in args.output_format:
Expand Down Expand Up @@ -251,7 +251,7 @@ def genome(args):
elif sg.fraction >= args.containment_threshold:
status = "match"
if status == "match":
classif = ClassificationResult(query_name=sg.query_name, status=status, rank=sg.rank,
classif = ClassInf(query_name=sg.query_name, status=status, rank=sg.rank,
fraction=sg.fraction, lineage=sg.lineage,
query_md5=sg.query_md5, query_filename=sg.query_filename,
f_weighted_at_rank=sg.f_weighted_at_rank, bp_match_at_rank=sg.bp_match_at_rank,
Expand All @@ -261,7 +261,7 @@ def genome(args):
continue
elif rank == "superkingdom" and status == "nomatch":
status="below_threshold"
classif = ClassificationResult(query_name=sg.query_name, status=status,
classif = ClassInf(query_name=sg.query_name, status=status,
rank="", fraction=0, lineage="",
query_md5=sg.query_md5, query_filename=sg.query_filename,
f_weighted_at_rank=sg.f_weighted_at_rank, bp_match_at_rank=sg.bp_match_at_rank,
Expand Down
Loading

0 comments on commit 145d1ad

Please sign in to comment.