Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 48: Describe pyani plot graphical output. #305

Merged
merged 8 commits into from
Nov 5, 2021
Merged

Conversation

baileythegreen
Copy link
Contributor

@baileythegreen baileythegreen commented Jul 6, 2021

Adds descriptions of pyani plot graphical output (based on responses to Issues #48 and #303, as well as how things are calculated in the code).

Not ready to be merged, but ready for discussion of what else should be included / if anything needs to be removed.

Fixes #48.
Fixes #303.

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality not to work as expected)
  • This change requires a documentation update
  • This is a documentation update

Action Checklist

  • Work on a single issue/concept (if there are multiple separate issues to address, please use a separate pull request for each)
  • Fork the pyani repository under your own account (please allow write access for repository maintainers)
  • Set up an appropriate development environment (please see CONTRIBUTING.md)
  • Create a new branch with a short, descriptive name
  • Work on this branch
    • style guidelines have been followed
    • new code is commented, especially in hard-to-understand areas
    • corresponding changes to documentation have been made
    • tests for the change have been added that demonstrate the fix or feature works
  • Test locally with pytest -v non-passing code will not be merged
  • Rebase against origin/master
  • Check changes with flake8 and black before submission
  • Commit branch
  • Submit pull request, describing the content and intent of the pull request
  • Request a code review
  • Continue the discussion at Pull requests section in the pyani repository

@baileythegreen baileythegreen changed the title Issue 48 Issue 48: Describe pyani plot graphical output. Jul 6, 2021
@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

Merging #305 (def0e87) into master (c4aa68b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #305   +/-   ##
=======================================
  Coverage   76.12%   76.12%           
=======================================
  Files          52       52           
  Lines        3380     3380           
=======================================
  Hits         2573     2573           
  Misses        807      807           

Copy link
Owner

@widdowquinn widdowquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "interpreting output" belongs under "Basic Use" in the ToC, rather than at the same level as "installation" and requirements" - please can we link it from basic_use.rst instead?

ANIb/ANIblastall/TETRA methods will be available in v0.3, so I don't think we should note a restriction there. (ll.51-55)

I'll not merge this until you've decided whether TETRA is symmetrical or not and changed the text. (l.55)

We should avoid double newlines (I think they're formatted the same, regardless…) (l.61)

Is the description on l.65 correct? It sounds like alignment length uses the length of the reference genome, as phrased.

The alignment length/similarity errors should be described for ANIb/ANIblastall methods (they don't crop up for TETRA).

l.75 is missing an "as" or "because", I think.

@baileythegreen
Copy link
Contributor Author

I took l.65 from the docstring in anim.parse_delta(), l.270. That says 'reference_length', but I don't know if that means it's right.

@widdowquinn
Copy link
Owner

I took l.65 from the docstring in anim.parse_delta(), l.270. That says 'reference_length', but I don't know if that means it's right.

Yes, that's not entirely clear is it? Especially out of the immediate context.

reference_length there means: "the number of bases from the reference sequence in the alignment."

I think this needs clarification in the documentation certainly, but maybe also a note in the comment at that point in the code might be useful.

@baileythegreen
Copy link
Contributor Author

I've made those changes.

Are there any situations where coverage should be symmetrical, actually? Unless I've misunderstood them (and notes in the docstrings are wrong, in some cases) none of these methods are. In which case the explanation I lifted from one of your issue comments about how coverage 'can be' asymmetrical might need to be changed.

@widdowquinn
Copy link
Owner

widdowquinn commented Jul 9, 2021

Are there any situations where coverage should be symmetrical, actually?

Two come to mind:

  • Any situation where the alignment is symmetrical, and the participating genomes have the same lengths.
  • Any situation where the alignment is not symmetrical, but the participating genomes happen to compensate by being different lengths.

It should be trivial, I think, to generate a completely symmetrical coverage output by renaming a single input file multiple times, and pretending they were different genomes.

Unless I've misunderstood them (and notes in the docstrings are wrong, in some cases) none of these methods are.

ANIb/ANIblastall/fastANI are not symmetrical, in general. Nor are they necessarily stable to circularly-permuted sequences, due to the sequence fragmentation step.

TETRA is described here: https://doi.org/10.1111/j.1462-2920.2004.00624.x - having reminded myself with a quick skim, I think the pairwise score is calculated by:

  1. determining the frequency Z-score for each possible 4-mer in the sequence of each genome (i.e. the normalised deviation from "expected frequency")
  2. calculating a Pearson correlation coefficient between the two Z-score vectors

which sounds symmetrical to me. What are your thoughts?

In which case the explanation I lifted from one of your issue comments about how coverage 'can be' asymmetrical might need to be changed.

The mathematician in me tries to stop me from claiming that something is always true, if it is not. A common method of (dis)proof is by counterexample. The counterexamples above, contrived or coincidental as they might be, demonstrate that coverage can be symmetrical. If you feel that we need to state a stronger expectation of asymmetry, how about "will usually be asymmetrical"?

@widdowquinn
Copy link
Owner

I think there's opportunity for an IJSEM paper discussing how algorithm choice affects measurement, along the lines of https://doi.org/10.1099/ijsem.0.004124

@baileythegreen
Copy link
Contributor Author

baileythegreen commented Jul 9, 2021

Did I misunderstand something, or infer too much?

I think it wasn't clear to me from the comment that you were only talking about coverage being asymmetrical.

You're correct that coverage, alignment length, etc. don't apply for TETRA. It is kind of a proto-MinHash distance measurement, rather than an alignment.


My issue is purely with the combination of 'this can be asymmetrical', as though this is a possible, but low-probability event, followed by 'every method we implement is asymmetrical (almost always)'. "[W]ill usually be symmetrical" resolves this.

*asymmetrical ?

@baileythegreen
Copy link
Contributor Author

The description for similarity errors in ANIm should perhaps be modified. Currently, this does not use what NUCmer/MUMmer themselves identify as similarity errors, but non-identities + indels. The discrepancy could lead to confusion if people try to compare across tools (whether or not such comparisons actually make sense).

@baileythegreen baileythegreen added documentation documentation is unclear or incomplete visualisation issues relating to plot outputs labels Sep 6, 2021
@baileythegreen baileythegreen added the PR of Supreme Importance The PR Bailey really, really wants merged right now label Sep 7, 2021
@baileythegreen baileythegreen added PR of Supreme Importance The PR Bailey really, really wants merged right now and removed PR of Supreme Importance The PR Bailey really, really wants merged right now labels Sep 9, 2021
- add notes on distribution/scatter plots
- clarify interpretations of heatmaps
- correct method descriptions
@widdowquinn
Copy link
Owner

I've checked the docs here and made modifications where necessary (e.g. alignment length in the BLAST methods does not subtract mismatches; adding summaries of the scatterplot/distribution output).

NOTE: we will need to modify the descriptions of some measures when we correct ANIm calculations according to #340

@widdowquinn widdowquinn merged commit ad64861 into master Nov 5, 2021
@widdowquinn widdowquinn deleted the issue_48 branch November 5, 2021 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation documentation is unclear or incomplete PR of Supreme Importance The PR Bailey really, really wants merged right now visualisation issues relating to plot outputs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ANIm ouput files need some explanation Description of Heatmap Outputs
2 participants