Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new tax classes for taxonomic summarization #2443

Merged
merged 51 commits into from
Feb 8, 2023
Merged

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jan 10, 2023

Taxonomy Refactor Overview

In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining query_bp in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the query_bp, ksize, and scaled columns into some testing results to keep tests functioning.

  1. Add new classes for lineage manipulation #2437 modifies LineagePair from a two-item collections.namedtuple to a three-item typing.NamedTuple containing an additional field, taxid, for storing NCBI taxid information. It also introduces classes (BaseLineageInfo, RankLineageInfo), which move lineage manipulation (from lca_utils.py) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen.

  2. add tax summarization dataclasses for safety and flexibility #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior collections.namedtuples to dataclasses used for storing information about the gather query (QueryInfo), summarized gather information for metagenome queries (SummarizedGatherResult) and classification information for genome queries (ClassificationResult). It introduces three new classes for reading and manipulating gather results. GatherRow, is used for reading a each row from a gather file and automatically checking for required columns. TaxResult is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as LineageInfo class from PR 1. QueryTaxResult is used for storing all TaxResults associated with a single query. QueryTaxResult add methods to replicate the summarization previously done within summarize_gather_at in tax_utils.py and the classification thresholding in genome within __main__.py.

  3. Use new tax classes for taxonomic summarization #2443 replaces the actual taxonomic summarization code in tax/__main__.py with code that uses the new classes. Modifies gather loading code to read using GatherRow, TaxResult, and QueryTaxResult.

  4. cleanup taxonomy code after refactor #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests.

Additional details for this PR (#2443)

Make use of LineagePair, LineageInfo classes (#2371) and GatherRow, SummarizedGatherResult, ClassificationResult, TaxResult, and QueryTaxResult classes (#2439) to do taxonomic summarization.

Replaces old code where it's easy to see changes, leaves larger cleanup for #2446 where it might make this PR confusing.

Benefits of this refactor: safer/more robust taxonomic summarization

  • innately checks that required gather columns are present when loading each gather csv
  • only allows summarization over gather results from same query, stores all info for a given query together.
  • only allows summarization/comparison between compatible lineages (same hierarchical ranks)
  • more robust/simplified tracking of results with missed & skipped taxonomic identifiers + perfect matches
  • where multi-query outputs would be misleading, prevent users from producing these outputs when summarizing multiple queries
  • enables future additions:
    • NCBI taxid can be included and reported during summarization (motivating reason: CAMI)
    • Alternate rank systems should work with all the functions tested for named ranks (motivating reason: LINS)

Neutral changes:

  • Some error messages changed slightly, because information was stored slightly differently/was differently available
  • requires sourmash gather >4.4, which contains columns needed for robust query compatibility checks
  • ANI is estimated by default, since we now require all columns needed for ANI estimation
  • added ksize, scaled columns to test csv files, where needed
  • fixed minor error introduced in [EXP] make lineage utils more flexible #2371, where an extra 0 was added in the query_bp column of test1.gather.csv

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #2443 (1a1bb1a) into latest (33bd439) will decrease coverage by 0.11%.
The diff coverage is 97.71%.

@@            Coverage Diff             @@
##           latest    #2443      +/-   ##
==========================================
- Coverage   84.67%   84.57%   -0.11%     
==========================================
  Files         132      132              
  Lines       15670    15484     -186     
  Branches     2558     2507      -51     
==========================================
- Hits        13269    13096     -173     
+ Misses       2093     2084       -9     
+ Partials      308      304       -4     
Flag Coverage Δ
python 92.54% <97.71%> (-0.01%) ⬇️
rust 57.79% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/tax/__main__.py 93.77% <93.65%> (+0.08%) ⬆️
src/sourmash/tax/tax_utils.py 98.07% <99.00%> (+0.63%) ⬆️
src/sourmash/lca/lca_utils.py 92.37% <0.00%> (-0.85%) ⬇️
src/sourmash/cli/tax/metagenome.py 93.33% <0.00%> (+3.33%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bluegenes bluegenes marked this pull request as ready for review January 13, 2023 19:54
@bluegenes
Copy link
Contributor Author

bluegenes commented Jan 13, 2023

ok @ctb, here's the PR where I actually use #2371 and #2439 -- ready for review.

Tests passing, but I'm somehow missing codecov target - is this because of deletions/refactoring?

Base automatically changed from upd-tax-summarization to latest January 16, 2023 17:48
bluegenes added a commit that referenced this pull request Jan 16, 2023
## Taxonomy Refactor Overview

In an attempt to allow usage of NCBI taxid (motivation: CAMI
benchmarking) and alternate hierarchical taxonomic ranks (motivation:
LINS), I ended up refactoring the taxonomy code in a four-PR series.
Taxonomic summarization results should not change. Minor caveat: I was
previously obtaining `query_bp` in a hacky manner to allow gather <4.4
results. The class methods are more robust, and I'd like to stop
supporting gather <4.4 results. To allow this, I had to add the
`query_bp`, `ksize`, and `scaled` columns into some testing results to
keep tests functioning.

1. #2437 modifies
`LineagePair` from a two-item `collections.namedtuple` to a three-item
`typing.NamedTuple` containing an additional field, `taxid`, for storing
NCBI taxid information. It also introduces classes (`BaseLineageInfo`,
`RankLineageInfo`), which move lineage manipulation (from
`lca_utils.py`) to class methods in order to support robust
summarization across compatible lineages (lineages of same hierarchical
ranks). To ensure these can be used as dictionary keys, these classes
are frozen.

2. #2439 introduces classes
that facilitate reading, summarization, and writing of gather results.
First, it updates three prior `collections.namedtuple`s to `dataclasses`
used for storing information about the gather query (`QueryInfo`),
summarized gather information for metagenome queries
(`SummarizedGatherResult`) and classification information for genome
queries (`ClassificationResult`). It introduces three new classes for
reading and manipulating gather results. `GatherRow`, is used for
reading a each row from a gather file and automatically checking for
required columns. `TaxResult` is used for storing a single row from
gather file, optionally (and ideally) with taxonomic information, stored
as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing
all `TaxResult`s associated with a single query. `QueryTaxResult` add
methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

3. #2443 replaces the
actual taxonomic summarization code in `tax/__main__.py` with code that
uses the new classes. Modifies gather loading code to read using
`GatherRow`, `TaxResult`, and `QueryTaxResult`.

4. #2446 removes old,
unused functions that are rendered redundant by the new classes. Also
removes associated tests.

## Additional details for this PR (#2439) 

1. Renamed existing `namedtuple`s. This renaming allows me to introduce
modified `dataclasses` (below) with the same names without breaking
functioning code. This is temporary, as these are removed in #2443.
- `QueryInfo` --> `QueryInf`
- `SummarizedGatherResult` -> `SumGathInf`
- `ClassificationResult` --> `ClassInf`

2. Update these `namedtuple`s to dataclasses to allow additional
functionality
- `QueryInfo`
- `SummarizedGatherResult`
- `ClassificationResult`

These contain several advantages over the prior namedtuples. In
particular, dataclass post-initialization methods allows type checking
and casting string--> float or int as appropriate. I also added methods
to these dataclasses to estimate ANI and produce formatted dictionaries
for each output format. This pulls formatting into one place, rather
than independent output-writing functions.

3. Add dataclasses for reading and manipulating gather results.
- `GatherRow`, for reading a each row from a gather file
- innately checks that all required gather colnames are present when
loading each gather csv
- `TaxResult` for storing a single row from gather file
- get and store `LineageInfo` taxonomic information directly with each
row
    - `get_match_lineage` and `get_ident` are now methods
- tracks whether lineage matching was attempted, including if the ident
was `missed` or intentionally `skipped`. We don't actually allow
skipping from the cli (yet?), but it's enabled in all methods to
preserve existing functionality. I think I was using this to skip
identical/exact database matches.
- `QueryTaxResult` for storing all `TaxResult`s associated with a single
query
  -  only allows summarization over gather results from same query
- more robust/simplified tracking of results with missed & skipped
taxonomic identifiers + perfect matches
- use `LineageInfo` classes (within `TaxResult`) for lineage tracking +
summarization, to get all benefits (optional `taxid`, any hierarchical
ranks, etc)
- contains methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

Co-authored-by: C. Titus Brown <titus@idyll.org>
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

src/sourmash/tax/tax_utils.py Outdated Show resolved Hide resolved
src/sourmash/tax/tax_utils.py Outdated Show resolved Hide resolved
elif containment_threshold and self.fraction >= containment_threshold:
#elif containment_threshold and self.f_weighted_at_rank >= containment_threshold:
elif containment_threshold is not None and self.fraction >= containment_threshold:
#elif containment_threshold is not None and self.f_weighted_at_rank >= containment_threshold:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified - at some point, I'd like to chat about potentially using weighted results instead of unique results for ordering ..

tests/test_tax.py Outdated Show resolved Hide resolved
tests/test_tax.py Outdated Show resolved Hide resolved
tests/test_tax.py Outdated Show resolved Hide resolved
bluegenes and others added 3 commits February 7, 2023 12:50
## Taxonomy Refactor Overview

In an attempt to allow usage of NCBI taxid (motivation: CAMI
benchmarking) and alternate hierarchical taxonomic ranks (motivation:
LINS), I ended up refactoring the taxonomy code in a four-PR series.
Taxonomic summarization results should not change. Minor caveat: I was
previously obtaining `query_bp` in a hacky manner to allow gather <4.4
results. The class methods are more robust, and I'd like to stop
supporting gather <4.4 results. To allow this, I had to add the
`query_bp`, `ksize`, and `scaled` columns into some testing results to
keep tests functioning.

1. #2437 modifies
`LineagePair` from a two-item `collections.namedtuple` to a three-item
`typing.NamedTuple` containing an additional field, `taxid`, for storing
NCBI taxid information. It also introduces classes (`BaseLineageInfo`,
`RankLineageInfo`), which move lineage manipulation (from
`lca_utils.py`) to class methods in order to support robust
summarization across compatible lineages (lineages of same hierarchical
ranks). To ensure these can be used as dictionary keys, these classes
are frozen.

2. #2439 introduces classes
that facilitate reading, summarization, and writing of gather results.
First, it updates three prior `collections.namedtuple`s to `dataclasses`
used for storing information about the gather query (`QueryInfo`),
summarized gather information for metagenome queries
(`SummarizedGatherResult`) and classification information for genome
queries (`ClassificationResult`). It introduces three new classes for
reading and manipulating gather results. `GatherRow`, is used for
reading a each row from a gather file and automatically checking for
required columns. `TaxResult` is used for storing a single row from
gather file, optionally (and ideally) with taxonomic information, stored
as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing
all `TaxResult`s associated with a single query. `QueryTaxResult` add
methods to replicate the summarization previously done within
`summarize_gather_at` in `tax_utils.py` and the classification
thresholding in `genome` within `__main__.py`.

3. #2443 replaces the
actual taxonomic summarization code in `tax/__main__.py` with code that
uses the new classes. Modifies gather loading code to read using
`GatherRow`, `TaxResult`, and `QueryTaxResult`.

4. #2446 removes old,
unused functions that are rendered redundant by the new classes. Also
removes associated tests.

## Additional details for this PR (#2446) 

- Delete old functions that aren't used outside of taxonomic
summarization + associated tests
  - Including old `namedtuple`s: `QueryInf`, `SumGathInf`, `ClassInf`
- Make sure any old comments/documentation make it into new code
- Don't use unnecessary empty `()` for dataclasses
@bluegenes bluegenes force-pushed the use-new-tax-classes branch from dcddf9b to 1a1bb1a Compare February 7, 2023 21:49
@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 8, 2023

@ctb Is this ok to merge with the codecov failure? I thought it would go away after merging #2446, but coverage still decreased (-0.11%). I assume this is a result of net deletions?

@ctb
Copy link
Contributor

ctb commented Feb 8, 2023

there's no uncovered code, right?! So sure, it's fine!

@bluegenes bluegenes merged commit ac400fa into latest Feb 8, 2023
@bluegenes bluegenes deleted the use-new-tax-classes branch February 8, 2023 16:13
@ctb ctb mentioned this pull request Mar 3, 2023
ctb added a commit that referenced this pull request Mar 3, 2023
# sourmash release 4.7.0

Major new features:

* provide an initial plugin architecture for sourmash that supports new
signature saving & loading mechanisms (#2428)
* add plugin support for new command-line subcommands (#2438)
* debias all containment values (#2243)

Minor new features:

* Use RankLineageInfo to simplify reading lineages (#2467)
* store taxids in lineageDB (#2466)
* Use new tax classes for taxonomic summarization (#2443)
* add tax summarization dataclasses for safety and flexibility (#2439)
* add `--scaled` to sourmash compare (#2414)
* replace `lca_utils.LineagePair` with `tax_utils.LineagePair` (#2441)
* Add new classes for lineage manipulation (#2437)

Cleanup and documentation updates:

* ReadTheDocs updates (#2445)
* update `sourmash compare` command-line docs (#2400)

Developer updates:

* fix python tests by bumping tox and pip cache versions (#2424)
* Update sphinx requirement from <6,>=4.4.0 to >=4.4.0,<7 (#2430)
* Build: replace milksnake with maturin (#2393)
* importlib_metadata is a dependency on old Python versions (#2484)
* Release docs: use two separate sed commands (#2483)
* minor fixes to release behavior (#2479)
* Use screed and maturin from nixpkgs in `flake.nix` (#2481)
* update release procedure after v4.6.0 and v4.6.1 (#2386)
* Update makefile and docs (#2432)

Dependabot updates:

* Bump once_cell from 1.17.0 to 1.17.1 (#2488)
* Bump ouroboros from 0.15.5 to 0.15.6 (#2487)
* Bump memmap2 from 0.5.8 to 0.5.9 (#2486)
* Bump supercharge/redis-github-action from 1.4.0 to 1.5.0 (#2485)
* Bump proptest from 1.0.0 to 1.1.0 (#2460)
* Bump web-sys from 0.3.60 to 0.3.61 (#2461)
* Bump serde_json from 1.0.91 to 1.0.93 (#2471)
* Bump wasm-bindgen-test from 0.3.33 to 0.3.34 (#2463)
* Bump cachix/install-nix-action from 18 to 19 (#2459)
* Bump wasm-bindgen from 0.2.83 to 0.2.84 (#2464)
* Bump typed-builder from 0.11.0 to 0.12.0 (#2451)
* Bump bumpalo from 3.9.1 to 3.12.0 (#2450)
* Bump pypa/cibuildwheel from 2.11.4 to 2.12.0 (#2447)
* Bump bzip2 from 0.4.3 to 0.4.4 (#2444)
* Bump once_cell from 1.14.0 to 1.17.0 (#2429)
* Bump serde from 1.0.151 to 1.0.152 (#2423)
* Bump pypa/cibuildwheel from 2.11.3 to 2.11.4 (#2422)
* Bump serde_json from 1.0.89 to 1.0.91 (#2418)
* Bump serde from 1.0.150 to 1.0.151 (#2419)
* Bump thiserror from 1.0.37 to 1.0.38 (#2417)
* Bump finch from 0.4.3 to 0.5.0 (#2416)
* Bump rayon from 1.6.0 to 1.6.1 (#2404)
* Bump serde from 1.0.149 to 1.0.150 (#2403)
* Bump pypa/cibuildwheel from 2.11.2 to 2.11.3 (#2402)
* Bump serde from 1.0.148 to 1.0.149 (#2397)
* Bump capnp from 0.14.5 to 0.14.11 (#2396)
bluegenes added a commit that referenced this pull request Apr 5, 2023
)

## Add taxonomic utilities for LINs; enable and test `tax metagenome`

With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we
are (mostly) no longer tied to named ranks. Here, I add a class for LIN
taxonomies and use it within `tax metagenome` to allow summarization up
LINs and reporting at specified `lingroups`.

With this PR, users can now use the flag `--lins` to read and use `lin`
taxonomies from the provided tax (`-t`, `--taxonomy`) file. If used,
`sourmash tax` will look for a `lin` column in the taxonomy file instead
of looking for `superkingdom`...`strain` columns. The `lin` column
should contain `;`-separated LINs, preferably with a standard number of
positions (e.g. all 20 positions in length or all 10 positions in
length).

For `tax metagenome`:

By default, `tax metagenome` will summarize up _all_ available ranks/LIN
positions. If a `lingroup` file is provided, we will also report a
subset of this summary: just the LIN prefixes that match groups in the
`lingroup` file. The `lingroup` file requires two columns in any order:
`name`, the name of the group, and `lin`, the lin prefix of the group.
The prefix will be used to select results from the full summary for
reporting. The `lingroup` format will build a file with the following
name: `{base}.lingroup.tsv`, where `{base}` is the name provided via the
`-o`,` --output-base` option.

## Demo / Tutorial

A draft tutorial is available
[here](https://sourmash--2469.org.readthedocs.build/en/2469/tutorial-lin-taxonomy.html).
Note that it does not contain the installation info for this branch (see
below). You can run the interactive version via binder
[here](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb)


## Testing

### Option A: Use the Demo Binder

You can test via the
[binder](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb).
You can add new cells or modify any existing cells, and even download
additional files for testing. The downside is that you'll have to make
sure to download and save your results, since the binder won't save them
for you.

### Option B: Alternatively, install on your own computer/cluster: 
Here is one way to test this code before it gets fully integrated into
sourmash:
- If you don't have conda, I'd recommend installing `mamba`,
[instructions
here](https://mamba.readthedocs.io/en/latest/installation.html) instead.
- if you do have `mamba`, replace the word `conda` with `mamba` in the
following commands.

Download an environment file that points to this branch:
```
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/sourmashLIN.yml
```
Create a virtual environment using this file:
```
conda env create -f sourmashLIN.yml
```
Activate that environment:
```
conda activate smashLIN
```
make sure `--lins` is in the `--help` for `sourmash tax metagenome`:
```
sourmash tax metagenome --help
```



## Command to run

The command to run is this one: 

```
sourmash tax metagenome -g $gather_csv  -t $taxonomy_csv \
                        --lins --lingroup $lingroups_csv
```

## Types of files you'll need

1. sketches of query metagenome
2. sketches of reference genomes (database)
3. taxonomy file with LIN information (two columns required: `ident`,
`lin`)
4. lingroup information file (two columns required: `name`, `lin`) 


To exit the environment when you're done testing, use `conda deactivate`

> Reminder, if you have `mamba`, you can use it in place of `conda` in
the commands above.


example `lingroup` output format. Note that the `1;0`.. paths are always
grouped together, but may come before or after the `0;0` and `2;0`
groups.

```
name	lin	percent_containment	num_bp_contained
lg3	2;0;0	1.56	192000
lg1	0;0;0	5.82	714000
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
```

```
name	lin	percent_containment	num_bp_contained
lg2	1;0;0	5.05	620000
lg3	1;0;1	0.65	80000
lg4	1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0	0.65	80000
lg1	0;0;0	5.82	714000
lg3	2;0;0	1.56	192000
```


## A few implementation details:

- In `tax_utils.py`, I add a `LINLineageInfo` class for using and
manipulated LIN taxonomies. It implements new methods to enable
specifically reading in `LIN` taxonomies into the class, but otherwise
uses the taxonomic utilities available in `BaseLineageInfo`, e.g.
taxonomic summarization up ranks, assessing whether two taxonomies are a
match at a given rank.
- In `tax_utils.py`, I add functionality for reading `lingroup`
information and reporting taxonomic summarization specifically at these
ranks.

Changes and Additions:
- [x] Add `LINLineageInfo` for working with `LIN` taxonomies
- [x] Add method for reading `LIN`s into `LineageDB`
- [x] Add methods for reading `LINgroups` and summarizing to these
- [x] Add LineageTree that can use `LineageInfo` to perform
`build_tree`, `find_lca` functions (originally in `lca_utils.py`) and
produce an ordered list of lineage paths
- [x] Add code + tests to use `LIN`s taxonomy in:
  - [x] tax metagenome
  - [x] tax annotate
  - [x] tax summarize
 
The following require additional changes and will be punted to an
issue/separate PR (see #2499):
  - tax genome
  - tax prepare
  - tax grep

---------

Co-authored-by: C. Titus Brown <titus@idyll.org>
bluegenes added a commit that referenced this pull request Apr 19, 2023
This fixes a human error bug, where lineage_summary was erroneously
included as a single-query output format. Fixes
#2578

Looking back, this bug was introduced in #2443, which means it went into
v4.7.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants