Issue 48: Describe `pyani plot` graphical output. #305

baileythegreen · 2021-07-06T12:34:48Z

Adds descriptions of pyani plot graphical output (based on responses to Issues #48 and #303, as well as how things are calculated in the code).

Not ready to be merged, but ready for discussion of what else should be included / if anything needs to be removed.

Fixes #48.
Fixes #303.

Type of change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality not to work as expected)
This change requires a documentation update
This is a documentation update

Action Checklist

codecov · 2021-07-06T12:50:18Z

Codecov Report

Merging #305 (def0e87) into master (c4aa68b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #305   +/-   ##
=======================================
  Coverage   76.12%   76.12%           
=======================================
  Files          52       52           
  Lines        3380     3380           
=======================================
  Hits         2573     2573           
  Misses        807      807

widdowquinn

I think that "interpreting output" belongs under "Basic Use" in the ToC, rather than at the same level as "installation" and requirements" - please can we link it from basic_use.rst instead?

ANIb/ANIblastall/TETRA methods will be available in v0.3, so I don't think we should note a restriction there. (ll.51-55)

I'll not merge this until you've decided whether TETRA is symmetrical or not and changed the text. (l.55)

We should avoid double newlines (I think they're formatted the same, regardless…) (l.61)

Is the description on l.65 correct? It sounds like alignment length uses the length of the reference genome, as phrased.

The alignment length/similarity errors should be described for ANIb/ANIblastall methods (they don't crop up for TETRA).

l.75 is missing an "as" or "because", I think.

baileythegreen · 2021-07-06T15:00:41Z

I took l.65 from the docstring in anim.parse_delta(), l.270. That says 'reference_length', but I don't know if that means it's right.

widdowquinn · 2021-07-07T08:43:33Z

I took l.65 from the docstring in anim.parse_delta(), l.270. That says 'reference_length', but I don't know if that means it's right.

Yes, that's not entirely clear is it? Especially out of the immediate context.

reference_length there means: "the number of bases from the reference sequence in the alignment."

I think this needs clarification in the documentation certainly, but maybe also a note in the comment at that point in the code might be useful.

baileythegreen · 2021-07-09T13:37:51Z

I've made those changes.

Are there any situations where coverage should be symmetrical, actually? Unless I've misunderstood them (and notes in the docstrings are wrong, in some cases) none of these methods are. In which case the explanation I lifted from one of your issue comments about how coverage 'can be' asymmetrical might need to be changed.

widdowquinn · 2021-07-09T14:04:15Z

Are there any situations where coverage should be symmetrical, actually?

Two come to mind:

Any situation where the alignment is symmetrical, and the participating genomes have the same lengths.
Any situation where the alignment is not symmetrical, but the participating genomes happen to compensate by being different lengths.

It should be trivial, I think, to generate a completely symmetrical coverage output by renaming a single input file multiple times, and pretending they were different genomes.

Unless I've misunderstood them (and notes in the docstrings are wrong, in some cases) none of these methods are.

ANIb/ANIblastall/fastANI are not symmetrical, in general. Nor are they necessarily stable to circularly-permuted sequences, due to the sequence fragmentation step.

TETRA is described here: https://doi.org/10.1111/j.1462-2920.2004.00624.x - having reminded myself with a quick skim, I think the pairwise score is calculated by:

determining the frequency Z-score for each possible 4-mer in the sequence of each genome (i.e. the normalised deviation from "expected frequency")
calculating a Pearson correlation coefficient between the two Z-score vectors

which sounds symmetrical to me. What are your thoughts?

In which case the explanation I lifted from one of your issue comments about how coverage 'can be' asymmetrical might need to be changed.

The mathematician in me tries to stop me from claiming that something is always true, if it is not. A common method of (dis)proof is by counterexample. The counterexamples above, contrived or coincidental as they might be, demonstrate that coverage can be symmetrical. If you feel that we need to state a stronger expectation of asymmetry, how about "will usually be asymmetrical"?

widdowquinn · 2021-07-09T14:16:25Z

I think there's opportunity for an IJSEM paper discussing how algorithm choice affects measurement, along the lines of https://doi.org/10.1099/ijsem.0.004124

baileythegreen · 2021-07-09T14:19:03Z

Did I misunderstand something, or infer too much?

I think it wasn't clear to me from the comment that you were only talking about coverage being asymmetrical.

You're correct that coverage, alignment length, etc. don't apply for TETRA. It is kind of a proto-MinHash distance measurement, rather than an alignment.

My issue is purely with the combination of 'this can be asymmetrical', as though this is a possible, but low-probability event, followed by 'every method we implement is asymmetrical (almost always)'. "[W]ill usually be symmetrical" resolves this.

*asymmetrical ?

baileythegreen · 2021-07-21T17:57:59Z

The description for similarity errors in ANIm should perhaps be modified. Currently, this does not use what NUCmer/MUMmer themselves identify as similarity errors, but non-identities + indels. The discrepancy could lead to confusion if people try to compare across tools (whether or not such comparisons actually make sense).

…sue_48

- add notes on distribution/scatter plots - clarify interpretations of heatmaps - correct method descriptions

widdowquinn · 2021-11-05T12:52:24Z

I've checked the docs here and made modifications where necessary (e.g. alignment length in the BLAST methods does not subtract mismatches; adding summaries of the scatterplot/distribution output).

NOTE: we will need to modify the descriptions of some measures when we correct ANIm calculations according to #340

baileythegreen added 2 commits July 6, 2021 13:27

Add documentation page explaining pyani plot output

021788d

Remove comment

99215a4

baileythegreen requested a review from widdowquinn as a code owner July 6, 2021 12:34

baileythegreen changed the title ~~Issue 48~~ Issue 48: Describe pyani plot graphical output. Jul 6, 2021

widdowquinn reviewed Jul 6, 2021

View reviewed changes

baileythegreen added 2 commits July 9, 2021 14:33

Move interpreting_plots to the basic_use menu

45e712f

Fill in missing information on ANIb, ANIblastall, and Tetra

e48d04b

baileythegreen mentioned this pull request Jul 29, 2021

Issue 175: Plot identity vs coverage scatter plot #319

Merged

21 tasks

baileythegreen added documentation documentation is unclear or incomplete visualisation issues relating to plot outputs labels Sep 6, 2021

Merge branch 'master' of https://github.com/widdowquinn/pyani into is…

8e4c24e

…sue_48

baileythegreen added the PR of Supreme Importance The PR Bailey really, really wants merged right now label Sep 7, 2021

baileythegreen added 2 commits September 8, 2021 15:44

Merge branch 'master' of https://github.com/widdowquinn/pyani into is…

c53bf1b

…sue_48

Add documentation pertaining to scatterplot creation

d789624

baileythegreen added PR of Supreme Importance The PR Bailey really, really wants merged right now and removed PR of Supreme Importance The PR Bailey really, really wants merged right now labels Sep 9, 2021

update plot interpretation docs

def0e87

- add notes on distribution/scatter plots - clarify interpretations of heatmaps - correct method descriptions

widdowquinn merged commit ad64861 into master Nov 5, 2021

widdowquinn deleted the issue_48 branch November 5, 2021 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 48: Describe `pyani plot` graphical output. #305

Issue 48: Describe `pyani plot` graphical output. #305

baileythegreen commented Jul 6, 2021 •

edited

Loading

codecov bot commented Jul 6, 2021 •

edited

Loading

widdowquinn left a comment

baileythegreen commented Jul 6, 2021

widdowquinn commented Jul 7, 2021

baileythegreen commented Jul 9, 2021

widdowquinn commented Jul 9, 2021 •

edited

Loading

widdowquinn commented Jul 9, 2021

baileythegreen commented Jul 9, 2021 •

edited by widdowquinn

Loading

baileythegreen commented Jul 21, 2021

widdowquinn commented Nov 5, 2021

Issue 48: Describe pyani plot graphical output. #305

Issue 48: Describe pyani plot graphical output. #305

Conversation

baileythegreen commented Jul 6, 2021 • edited Loading

Type of change

Action Checklist

codecov bot commented Jul 6, 2021 • edited Loading

Codecov Report

widdowquinn left a comment

Choose a reason for hiding this comment

baileythegreen commented Jul 6, 2021

widdowquinn commented Jul 7, 2021

baileythegreen commented Jul 9, 2021

widdowquinn commented Jul 9, 2021 • edited Loading

widdowquinn commented Jul 9, 2021

baileythegreen commented Jul 9, 2021 • edited by widdowquinn Loading

baileythegreen commented Jul 21, 2021

widdowquinn commented Nov 5, 2021

Issue 48: Describe `pyani plot` graphical output. #305

Issue 48: Describe `pyani plot` graphical output. #305

baileythegreen commented Jul 6, 2021 •

edited

Loading

codecov bot commented Jul 6, 2021 •

edited

Loading

widdowquinn commented Jul 9, 2021 •

edited

Loading

baileythegreen commented Jul 9, 2021 •

edited by widdowquinn

Loading