-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment coverage >1.0 #340
Comments
I can add that ANIb produces the expect result for these two genomes with an alignment coverage of 0.98 and 0.99. |
Thanks @dparks1134 - that's weird. If you could provide the input genomes that would be very helpful. Cheers, L. |
FWIW I found (and fixed) a related bug while trying this out on a couple of (unrelated) virus genomes. I don't think it will fix your issue, though. |
Zip file with the two genomes and PyANI ANIm results are attached. |
Thanks - I can confirm your ANIm output. Looking into why, just now. |
Well - the issue appears to be that MUMmer is reporting two overlapping alignments. One alignment runs from position 85 to 37713 in the query; the other alignment runs from 17709 to 39253 in 0264574. Inspecting the alignment length output shows that the aligned sequence length is longer than either genome. We use the I think that what we should have been doing all along is also to use the I feel quite bad about missing that. I'm not going to defend it by claiming that we're reimplementing the Rossello-Mora et al. implementation - by default we use I'm going to add a command-line option I'll make the changes and push a v0.2.12 release with an apology and explanation. Many thanks for spotting this and bringing this to our attention @dparks1134 - this is a very important improvement that will impact directly on the next iteration of Cheers, L. |
@all-contributors please add @dparks1134 for bug |
I've put up a pull request to add @dparks1134! 🎉 |
Thanks for the detailed explanation. Nice to know what is going on. |
@widdowquinn I've created a new issue for this in version 3, and will prioritise that, since it probably qualifies as a bug. |
Hi. I implemented the proposed |
Thanks @dparks1134 - it does look like even I've been talking with @baileythegreen about potentially using interval graphs to ensure we don't double-count and, if you've found further examples of double-counting with I expect this hasn't been noticed already because the overlap size probably gets lost in the noise for bacterial genomes, and the values look plausible. |
Hi @widdowquinn, do you know if it is sensible to run |
Thanks for the suggestion - I've taken a look at
I'm surprised and disappointed that even the |
Dose it make sense to follow the |
Yes, the Currently, the ANI identity calculation will be double-counting the overlap, and |
Largely a note for me and @baileythegreen after today's discussion... It looks like we might be able to get everything we need from a combination of the However, for very large comparisons on a cluster we probably don't want |
My initial optimism may have been misplaced. There appear to be some disturbing differences in the Caulobacter test set - especially wrt coverage. ANIm_alignment_coverage_noextend.pdf |
This also runs a --noextend test set in the Makefile, and adds a test for command creation (but not output regression). The --noextend option appears to affect low coverage and identity scores quite significantly. NOTE: this is not in itself a fix for #340. That requires changes to how we write and parse nucmer output.
Further investigation of the The difference in output between what we do and what Where the I'd argue that we should be only using 1:1 alignments, so However, we were not originally accounting for overlapping alignments. When we do account for them,
and our standard
With our alignment choices (including 1:1 match filtering) we align about 1.38Mbp where However, if we use
and almost exactly recover the So, while our output differs, it appears to do so mostly because we use However, that's not the problem with issue #340 ;) The issue with #340 is that we were not emulating the overlap removal that
We can fix this by running an
and recovers |
So, I think we should be generating a I think we already allow users to choose between I don't understand the process by which Once we correct for the overlaps, I think that our agreement with |
Historical note I think I know why I misunderstood
I think I might have internalised this explanation and never realised that |
I have been looking at the I think Example:
When any of these are not specified for a given
where Because
This does not remove those attributes from the class (so it wouldn't affect any backend Whether this is preferable to using the custom classes I have written may depend on speed and whatever benefits come from using |
I had a feeling that, in my testing, the route to getting what I believe to be the correct count of #!/usr/bin/env python3
import csv
from pathlib import Path
import intervaltree
# Parse .coords files and try to replicate AlignedBases
def parse_coords(infname):
aln_ref = 0
aln_query = 0
cov_ref = 0
cov_query = 0
num_rows = 0
last_row = None
overlaps = 0
overlap_len = 0
ref_intervals = []
query_intervals = []
with infname.open() as ifh:
fieldnames = [
"start_ref",
"end_ref",
"start_query",
"end_query",
"aln_len_ref",
"aln_len_query",
"aln_id",
"len_ref",
"len_query",
"cov_ref",
"cov_query",
"id_ref",
"id_query",
]
reader = csv.DictReader(ifh, delimiter="\t", fieldnames=fieldnames)
for row in reader:
num_rows += 1
aln_ref += int(row["aln_len_ref"])
aln_query += int(row["aln_len_query"])
cov_ref += float(row["cov_ref"])
cov_query += float(row["cov_query"])
ref_intervals.append(sorted((int(row["start_ref"]), int(row["end_ref"]))))
query_intervals.append(
sorted((int(row["start_query"]), int(row["end_query"])))
)
ref_tree = intervaltree.IntervalTree.from_tuples(ref_intervals)
ref_tree.merge_overlaps()
ref_aligned_size = 0
for interval in ref_tree:
ref_aligned_size += interval.end - interval.begin + 1
query_tree = intervaltree.IntervalTree.from_tuples(query_intervals)
query_tree.merge_overlaps()
query_aligned_size = 0
for interval in query_tree:
query_aligned_size += interval.end - interval.begin + 1
print(f"{infname=}")
print(
f"{aln_ref=}, {aln_query=}, {cov_ref=}, {cov_query=}, {num_rows=}" # , {overlaps=}, {overlap_len=}"
)
print(f"{len(ref_tree)=}, {len(query_tree)=}")
print(f"{ref_aligned_size=}, {query_aligned_size=}")
print()
# virus output
infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.1coords")
parse_coords(infname)
infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.mcoords")
parse_coords(infname)
infname = Path("2021-11-10/virus_nucmer_output/virus_nucmer.coords")
parse_coords(infname) |
@widdowquinn I think that script is merging intervals it should not; at least, based on what I understood from a discussion a while back about when intervals should be merged. For example: Assume the input file:
This has data on two query contigs:
and
If overlapping intervals should be merged so long as they belong to the same contig, then the query intervals for If, however, the identity of the reference contig must also be considered before deciding to merge, then With your script, I get the former (inserting a Which is it? |
We could have left this to tomorrow's meeting, but it may be useful to have a written record here, to avoid misunderstanding. In a pairwise comparison, there are two genomes being compared. Let's call them genomes A and B. In your example above, contigs We are interested in which parts of genome A align to genome B (and vice versa, but as in your example let's consider only genome A). We are interested, for the purpose of calculating genome coverage, in the number of unique bases in genome A that align to genome B. It doesn't matter to which parts of genome B the regions of genome A align, and specifically it doesn't matter which contigs they align to, so long as we are willing to accept each of the individual alignments as valid. Here, genome A is divided into two contigs: The regions of The regions of I hope this clarifies things. |
Fixed with 1f10a6c. |
Summary:
I am comparing a large number of genomes and occasionally the ANIm_alignment_coverage.tab file indicates a value >1.0. I would have though this value was bounded between [0, 1.0]. How does one interpret an alignment coverage > 1.0?
Description:
I am using PyANI v0.2.11 as follows:
average_nucleotide_identity.py -i genomes/ -o anim -v -m ANIm
I can narrow the issue down to a pair of viral genomes which are in the
genomes
directory.Reproducible Steps:
I can provide the pair of viral genomes which cause this issue.
Current Output:
The
ANIm_alignment_coverage.tab
file indicates the alignment coverage between these pairs of genomes is 1.49 and 1.51, respectively.Expected Output:
I would have though the alignment coverage would have a maximum value of 1.0. That is, any region of the query genome would align to at most one region in the target genome and thus it would be impossible to cover more than 100% of a genome.
pyani Version:
0.2.11
The text was updated successfully, but these errors were encountered: