fastANI vs ANIm output differences #291

widdowquinn · 2021-06-17T11:09:12Z

widdowquinn
Jun 17, 2021
Maintainer

As a first impression, it looks like there are systematic differences between fastANI and ANIm. We expect differences between the percentage identities, from the literature - so that's no surprise. It will be useful to compare methods quantitatively.

IDEA 1: a new graphical/report output for comparing values between two runs. Possible CLI:

pyani compare run1 run2 -o outputdir --nographics

I'm thinking this would write graphical and tabular comparison output with the difference between comparisons for each of the two runs, by default. Heatmap/tabular output with differences; CSV/tab file (tidy format) with pairwise comparisons and difference.

IDEA 2: there seems to be a systematic difference between reported coverage for fastANI and ANIm. We should investigate this. My first thought is that the kmer approach of fastANI collapses repeats (where ANIm preserves them) so the denominator in the coverage is proportionately smaller - we should check if this is the case.

widdowquinn · 2021-06-17T13:24:17Z

widdowquinn
Jun 17, 2021
Maintainer Author

Re: IDEA 2 - the current "Coverage" for fastANI is (matching fragments)/(all fragments) - so it's bound to overestimate if a match is judged as anything less than the length of the fragment.

If the current calculation is (frags1 * len)/(frags2 * len) then we can save two multiplication operations by going to (frags1)/(frags2).

5 replies

baileythegreen Jun 17, 2021

I think looking at how the calculations can be optimised would be good; that part of the code has changed a bit with the unit balancing I was doing earlier.

I can test the coverage question a bit by seeing how the plots look for runs with different minFraction values; that way we can provide more information about potential biases in the software.

widdowquinn Jun 17, 2021
Maintainer Author

That would be great - and very much enabled with a pyani compare function. I think it's time to write an enhancement issue.

baileythegreen Jun 17, 2021

As we think about the sorts of comparisons it would be useful to implement, we should also try to think about how they could be usefully visualised. The plots already produced by pyani might not be the easiest way to compare several runs with different values for one of the parameters, for instance, or for understanding how two parameters interact.

I'm not sure I have a good suggestion yet, though.

widdowquinn Jun 18, 2021
Maintainer Author

Some comparisons possibly suggest themselves:

For run1 and run2 (where we might change methodology, parameters, tool versions, whatever…), we'd want to be able to compare output values for each pairwise genome comparison, so we have a tidy-looking table:

genome_query,genome_subject,run_id,key,value
genome1,genome2,1,identity,0.97
genome1,genome2,1,coverage,0.17
genome1,genome2,1,aligned_length,1.3e6
genome1,genome2,1,similarity_errors,1.3e4
genome1,genome2,2,identity,0.96
genome1,genome2,2,coverage,0.45
genome1,genome2,2,aligned_length,2.5e6
genome1,genome2,2,similarity_errors,2.6e4
[...]

This could be its own output - possibly the primary output for any interested user to visualise how they like. I think any visualisation we provide should firstly serve our needs (which are likely to be general, anyway), but also anticipate common user needs.

The things that come to mind first of all are:

scatterplot: value (run1) vs value (run2) for each comparison [for each of the reported keys above]
- this helps identify systematic variation between values for each run
kde/distribution: how much do value (run1) and value (run2) differ for each comparison [for each of the reported keys above]
- this helps identify potential individual comparison outliers and any skew in differences (consistently lower/higher; by a certain amount, or over its own - potentially skewed - distribution)
(clustered) heatmap: how much do value (run1) and value (run2) differ for each comparison [for each of the reported keys above]
- this helps extend the kde/distribution interpretation to identify groups of comparisons that behave in the same way, maybe because there are a load of repeats in family1 and these get compressed by one method, or not if you switch to --maxmatch, for instance

widdowquinn Jun 18, 2021
Maintainer Author

I think looking at how the calculations can be optimised would be good

TBH I'd expect virtually no detectable speedup from that change under most circumstances - but it's useful to notice when there is an easy "win" like having the same term in the numerator and denominator. Removing them here eliminates a bug/typo opportunity, as well as simplifying the operation in code. It's always easy to let the computer "handle the maths" and code "literally" - but thinking about the maths first before coding can sometimes lead to much bigger computational savings than this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastANI vs ANIm output differences #291

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

fastANI vs ANIm output differences #291

widdowquinn Jun 17, 2021 Maintainer

Replies: 1 comment · 5 replies

widdowquinn Jun 17, 2021 Maintainer Author

baileythegreen Jun 17, 2021

widdowquinn Jun 17, 2021 Maintainer Author

baileythegreen Jun 17, 2021

widdowquinn Jun 18, 2021 Maintainer Author

widdowquinn Jun 18, 2021 Maintainer Author

widdowquinn
Jun 17, 2021
Maintainer

Replies: 1 comment 5 replies

widdowquinn
Jun 17, 2021
Maintainer Author

widdowquinn Jun 17, 2021
Maintainer Author

widdowquinn Jun 18, 2021
Maintainer Author

widdowquinn Jun 18, 2021
Maintainer Author