-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] debias all containment values #2243
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2243 +/- ##
==========================================
+ Coverage 84.10% 92.22% +8.11%
==========================================
Files 130 101 -29
Lines 15065 11500 -3565
Branches 2211 2208 -3
==========================================
- Hits 12671 10606 -2065
+ Misses 2098 598 -1500
Partials 296 296
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@dkoslicki @mahmudur is it expected that using the bias factor has such a small effect on containment and ANI values? |
To me, this is an improvement/bugfix, and I do not think it would/should require a major version bump. |
@bluegenes I wouldn't expect the de-biasing to cause large changes. It really only comes into play when the underlying sequence set is very small. And agreed this seems suitable for a 4.x, especially since the CLI didn't change with the bias term |
hot take - let's see what happens after merging #2057 first? |
how are we feeling about this one? I have no memory of this :) |
IIRC, as long as @bluegenes says it's all good, I'm good to review. Though I don't know if this incorporates the better set size estimation from #2268 |
@ctb you had wanted to wait to see how it went with #2057 🤷🏻♀️ @dkoslicki this shouldn't mess with any of the changes from #2268 (aka, they should still be in place). Could you review when you get a chance, please? Or @ctb? Primarily, this PR replaces the functionality in "contained_by" with the functionality in "contained_by_debiased" and removes the (now redundant) "debiased" function. This means we can remove duplicated/unnecessary tests. This only contains changes for containment - was there anything we needed to do with jaccard? |
Just noticed this detail. Yes, those need to be de-biased too, in broadly the same way. @bluegenes, the formula is the last theorem on page 20 of our manuscript. I'll get around to a PR/issue eventually unless @mahmudhera beats me to it. |
# sourmash release 4.7.0 Major new features: * provide an initial plugin architecture for sourmash that supports new signature saving & loading mechanisms (#2428) * add plugin support for new command-line subcommands (#2438) * debias all containment values (#2243) Minor new features: * Use RankLineageInfo to simplify reading lineages (#2467) * store taxids in lineageDB (#2466) * Use new tax classes for taxonomic summarization (#2443) * add tax summarization dataclasses for safety and flexibility (#2439) * add `--scaled` to sourmash compare (#2414) * replace `lca_utils.LineagePair` with `tax_utils.LineagePair` (#2441) * Add new classes for lineage manipulation (#2437) Cleanup and documentation updates: * ReadTheDocs updates (#2445) * update `sourmash compare` command-line docs (#2400) Developer updates: * fix python tests by bumping tox and pip cache versions (#2424) * Update sphinx requirement from <6,>=4.4.0 to >=4.4.0,<7 (#2430) * Build: replace milksnake with maturin (#2393) * importlib_metadata is a dependency on old Python versions (#2484) * Release docs: use two separate sed commands (#2483) * minor fixes to release behavior (#2479) * Use screed and maturin from nixpkgs in `flake.nix` (#2481) * update release procedure after v4.6.0 and v4.6.1 (#2386) * Update makefile and docs (#2432) Dependabot updates: * Bump once_cell from 1.17.0 to 1.17.1 (#2488) * Bump ouroboros from 0.15.5 to 0.15.6 (#2487) * Bump memmap2 from 0.5.8 to 0.5.9 (#2486) * Bump supercharge/redis-github-action from 1.4.0 to 1.5.0 (#2485) * Bump proptest from 1.0.0 to 1.1.0 (#2460) * Bump web-sys from 0.3.60 to 0.3.61 (#2461) * Bump serde_json from 1.0.91 to 1.0.93 (#2471) * Bump wasm-bindgen-test from 0.3.33 to 0.3.34 (#2463) * Bump cachix/install-nix-action from 18 to 19 (#2459) * Bump wasm-bindgen from 0.2.83 to 0.2.84 (#2464) * Bump typed-builder from 0.11.0 to 0.12.0 (#2451) * Bump bumpalo from 3.9.1 to 3.12.0 (#2450) * Bump pypa/cibuildwheel from 2.11.4 to 2.12.0 (#2447) * Bump bzip2 from 0.4.3 to 0.4.4 (#2444) * Bump once_cell from 1.14.0 to 1.17.0 (#2429) * Bump serde from 1.0.151 to 1.0.152 (#2423) * Bump pypa/cibuildwheel from 2.11.3 to 2.11.4 (#2422) * Bump serde_json from 1.0.89 to 1.0.91 (#2418) * Bump serde from 1.0.150 to 1.0.151 (#2419) * Bump thiserror from 1.0.37 to 1.0.38 (#2417) * Bump finch from 0.4.3 to 0.5.0 (#2416) * Bump rayon from 1.6.0 to 1.6.1 (#2404) * Bump serde from 1.0.149 to 1.0.150 (#2403) * Bump pypa/cibuildwheel from 2.11.2 to 2.11.3 (#2402) * Bump serde from 1.0.148 to 1.0.149 (#2397) * Bump capnp from 0.14.5 to 0.14.11 (#2396)
Alternative to #2057.
Instead of adding new functions to specifically use the debiased containment values, just debias within the original functions.
This does alter the containment values slightly, but not enough to screw up any tests. The only test I had to change was one where I was (improperly) asserting that the containment between downsampled minhashes was equal to the containment between those original minhashes. It also seems to affect ANI confidence intervals slightly.
But perhaps this is a big enough change to be for 5.0 instead of 4.x?