Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANIm should not be symmetric #373

Closed
donovan-h-parks opened this issue Jan 17, 2022 · 8 comments · May be fixed by #376
Closed

ANIm should not be symmetric #373

donovan-h-parks opened this issue Jan 17, 2022 · 8 comments · May be fixed by #376
Assignees
Labels
bug something isn't working how it should HIGH PRIORITY high priority issue Marked for closure Ready to close
Milestone

Comments

@donovan-h-parks
Copy link

Summary:

PyANI treats ANIm as symmetric, but the order of genomes impacts this calculation.

Description:

I have two test genomes. The ANIm results produced by PyANI change depending on the order these genomes are processed. That is, the results depend on which genome is considered the "reference" and which the "query" when running nucmer.

Reproducible Steps:

  1. Process the attached genomes with PyANI using average_nucleotide_identity.py -m ANIm --workers 1.
  2. Swap the names of the genomes (i.e. genome1.fna becomes genome2.fna and genome2.fna becomes genome1.fna).
  3. Run PyANI again as in step 1, but changing the output directory.
  4. Compare the results and observe that the ANIm values are now different.

As a concrete results, the number of similarity errors changes from 512 to 561 depending on the order of the genomes.

Expected Output:

test_genomes.tar.gz

The ANI-percentage_identity.tab, ANIm_similarity_errors.tab, and ANIm_alignment_lengths.tab files are non-symmetric matrices that reflect which genome was the reference and which genome was the query.

pyani Version:

v0.2.11

Operating System:

Ubuntu

@widdowquinn widdowquinn added the bug something isn't working how it should label Jan 17, 2022
@widdowquinn
Copy link
Owner

Thanks @dparks1134 - apologies for the bug, and we'll deal with this as a priority.

Quick notes

  • although it stems from the same misunderstanding of/assumptions about mummer's operation, this is not the same issue as Alignment coverage >1.0 #340 and will probably not be fixed by the proposed changes to resolve that issue.
  • the resolution currently appears to be to run mummer for the reciprocal pairwise comparisons.

L.

@widdowquinn widdowquinn added the HIGH PRIORITY high priority issue label Jan 17, 2022
@widdowquinn
Copy link
Owner

widdowquinn commented Jan 19, 2022

NOTE: Handling under branch issue_373 for version 0.3.

@widdowquinn widdowquinn mentioned this issue Jan 20, 2022
21 tasks
widdowquinn added a commit that referenced this issue Jan 20, 2022
Issue 373 - preliminary merge with multiple fixes to annoyances (warnings/small errors) prior to real fix of #373.
@baileythegreen
Copy link
Contributor

For reference, two sets of output for the genomes provided by @dparks1134, named <reference>_vs_<query>.zip.

genome1_vs_genome2.zip
genome2_vs_genome1.zip

@baileythegreen
Copy link
Contributor

baileythegreen commented Jan 24, 2022

I have pushed changes to issue_373 that remove the assumption that ANIm is symmetric; skip a test that fails if we no longer enforce genome sorting alphabetically as a result; and updated docs, so that they don't claim it is symmetric.

I will see if there are other things I have missed; in particular, I think perhaps the test outputs might need to be updated; (they pass, but this might just mean the differences fall within the designated tolerances).

@widdowquinn
Copy link
Owner

I think perhaps the test outputs might need to be updated; (they pass, but this might just mean the differences fall within the designated tolerances).

I think your guess is probably correct. It would be useful to have (for the record) a pyani compare output comparing (i) the old, symmetric behaviour and (ii) the new, non-symmetric behaviour so we can see how big the difference is for at least one real-world example we're familiar with.

@baileythegreen
Copy link
Contributor

baileythegreen commented Apr 4, 2022

I tried uploading the archive of inputs I used here, but I think that must be too big; I've instead included the classes.txt file. @widdowquinn, it's a dataset you sent me, so you should already have the sequence files.

classes.txt

Here's the output from running pyani compare on the symmetric and asymmetric versions of anim, using default settings. I have not yet addressed the suggestions made on pyani compare output in #364, so apologies for the less-than-useful colour schemes, et cetera.

summary_run1_vs_run2.md
ref_1_vs_query_2_plots.zip


Update

Here are the updated versions of the summary report and plots; changes include—better colour maps, more useful axes, and more informative titles. (The run numbers differ because I always forget when you need to specify labels, et cetera.)

updated_plots_with_labels.zip

summary_run4_vs_run5.md

@baileythegreen
Copy link
Contributor

It looks like Dickeya didantii NCPPB 898 chromosome, whole genome shotgun sequence specifically sees a lot of differences between the two.

@kiepczi
Copy link
Collaborator

kiepczi commented Apr 18, 2024

I'm also marking this for closure, because my understanding is that this issue was fixed with 1f10a6c.

Specifically, we have now decided to run two pairwise comparisons (forward and reverse), and calculate ANIm values for query for each comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should HIGH PRIORITY high priority issue Marked for closure Ready to close
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants