Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment coverage >1.0 #340

Closed
donovan-h-parks opened this issue Sep 30, 2021 · 26 comments
Closed

Alignment coverage >1.0 #340

donovan-h-parks opened this issue Sep 30, 2021 · 26 comments
Assignees
Labels
bug something isn't working how it should HIGH PRIORITY high priority issue
Milestone

Comments

@donovan-h-parks
Copy link

Summary:

I am comparing a large number of genomes and occasionally the ANIm_alignment_coverage.tab file indicates a value >1.0. I would have though this value was bounded between [0, 1.0]. How does one interpret an alignment coverage > 1.0?

Description:

I am using PyANI v0.2.11 as follows:
average_nucleotide_identity.py -i genomes/ -o anim -v -m ANIm

I can narrow the issue down to a pair of viral genomes which are in the genomes directory.

Reproducible Steps:

I can provide the pair of viral genomes which cause this issue.

Current Output:

The ANIm_alignment_coverage.tab file indicates the alignment coverage between these pairs of genomes is 1.49 and 1.51, respectively.

Expected Output:

I would have though the alignment coverage would have a maximum value of 1.0. That is, any region of the query genome would align to at most one region in the target genome and thus it would be impossible to cover more than 100% of a genome.

pyani Version:

0.2.11

@donovan-h-parks
Copy link
Author

I can add that ANIb produces the expect result for these two genomes with an alignment coverage of 0.98 and 0.99.

@widdowquinn
Copy link
Owner

Thanks @dparks1134 - that's weird. If you could provide the input genomes that would be very helpful.

Cheers,

L.

@widdowquinn widdowquinn added the bug something isn't working how it should label Sep 30, 2021
@widdowquinn widdowquinn self-assigned this Sep 30, 2021
@widdowquinn
Copy link
Owner

widdowquinn commented Sep 30, 2021

FWIW I found (and fixed) a related bug while trying this out on a couple of (unrelated) virus genomes. I don't think it will fix your issue, though.

@donovan-h-parks
Copy link
Author

Zip file with the two genomes and PyANI ANIm results are attached.

high_align_cov.zip

@widdowquinn
Copy link
Owner

Thanks - I can confirm your ANIm output. Looking into why, just now.

@widdowquinn
Copy link
Owner

widdowquinn commented Sep 30, 2021

Well - the issue appears to be that MUMmer is reporting two overlapping alignments. One alignment runs from position 85 to 37713 in the query; the other alignment runs from 17709 to 39253 in 0264574. Inspecting the alignment length output shows that the aligned sequence length is longer than either genome.

We use the --mum option to ensure that there are unique anchors for the NUCmer alignment, so this was not the behaviour I was expecting. I errored in my understanding of how MUMmer generates alignment output.

I think that what we should have been doing all along is also to use the --noextend option, by default. With this option we still get alignments at {85, 27720} and {27702, 39253} in the query and the ANI results are otherwise in line with ANIb. I suspect, though don't yet know for certain, that the default --extend option in NUCmer allows the anchors to extend into each other - and using --mum --noextend is needed to ensure that alignments are unique to both query and subject sequence (though in this example they still overlap by 18 bases!).

I feel quite bad about missing that. I'm not going to defend it by claiming that we're reimplementing the Rossello-Mora et al. implementation - by default we use --mum (not indicated in that paper) to try to ensure we don't align repeats. It's just a mistake.

I'm going to add a command-line option --noextend to pyani v0.2, so that we retain backwards compatibility by default. @baileythegreen - I think we should maybe make --noextend the default NUCmer behaviour for v0.3, and implement an --extend option for pyani anim that allows us to use pyani compare to see what the damage is likely to have been under the assumption above.

I'll make the changes and push a v0.2.12 release with an apology and explanation.

Many thanks for spotting this and bringing this to our attention @dparks1134 - this is a very important improvement that will impact directly on the next iteration of pyani. I appreciate the heads-up, and hope it doesn't put you off using the tool.

Cheers,

L.

@widdowquinn
Copy link
Owner

@all-contributors please add @dparks1134 for bug

@allcontributors
Copy link
Contributor

@widdowquinn

I've put up a pull request to add @dparks1134! 🎉

@donovan-h-parks
Copy link
Author

Thanks for the detailed explanation. Nice to know what is going on.

@baileythegreen
Copy link
Contributor

@widdowquinn I've created a new issue for this in version 3, and will prioritise that, since it probably qualifies as a bug.

@donovan-h-parks
Copy link
Author

Hi. I implemented the proposed noextend fix in v0.2.11. This does reduce the number of pairs with an alignment coverage >1, but this still occur in a few instances. The values are just above one in all cases I have observed so perhaps this is fine, but it would be good to ensure things aren't being double counted. I've attached an email pair of genomes.

AF_bug_v2.zip

@widdowquinn
Copy link
Owner

Thanks @dparks1134 - it does look like even --noextend cannot guarantee that no bases are double-counted in the MUMmer alignment.

I've been talking with @baileythegreen about potentially using interval graphs to ensure we don't double-count and, if you've found further examples of double-counting with --noextend already, it looks like we're going to have to do that instead.

I expect this hasn't been noticed already because the overlap size probably gets lost in the noise for bacterial genomes, and the values look plausible.

@donovan-h-parks
Copy link
Author

Hi @widdowquinn, do you know if it is sensible to run dnadiff and look at the generated 1-to-1 coordinates and/or alignment file? I know at least one person that has taken this approach, but I don't know enough about the MUMmer package to know if this is actually correct.

@widdowquinn
Copy link
Owner

widdowquinn commented Oct 4, 2021

Thanks for the suggestion - I've taken a look at dnadiff and it doesn't appear to do anything differently to what pyani already does - it runs a default nucmer search and then uses delta-filter to generate a 1:1 alignment (extension .1coords with dnadiff). It does some things in addition like identifying SNPs and reporting M-to-M alignments, but they don't contribute to ANI calculations.

pyani's current ANIm nucmer output and the .1coords output are identical. There are no useful options to dnadiff that I see; we can't, for instance, pass through either --mum or --noextend to nucmer.

nucmer default, followed by delta-filter -1:

/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0

dnadiff's .1delta output:

/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0

dnadiff's .report - notice that it claims a total alignment of 59kbp for the 39kbp genomes:

$ mkdir -p dnadiff_default && dnadiff -p dnadiff_default/high_align_test high_align_cov/MGV-GENOME-0264574.fna high_align_cov/MGV-GENOME-0266457.fna
[...]
/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER

                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                     39253                39594
AlignedBases           39169(99.79%)        39176(98.94%)
UnalignedBases             84(0.21%)           418(1.06%)

[Alignments]
1-to-1                             2                    2
TotalLength                    59174                59187
AvgLength                   29587.00             29593.50
AvgIdentity                    99.63                99.63
[...]

I'm surprised and disappointed that even the --noextend option in nucmer gives overlapping alignments. We can account for this, but it will take some additional output filtering of the nucmer output prior to calculation. We're investigating how much of a difference using --noextend makes to our analyses (the tests seem to be nearly all within tolerance, so far), but it's clearly not going to be the actual solution, if we want to ensure that there are no overlaps between alignments.

@donovan-h-parks
Copy link
Author

Dose it make sense to follow the AlignedBases approach used by dnadiff? The answer certainly seems sensible in this case, but it is unclear to me exactly how this number is established.

@widdowquinn
Copy link
Owner

widdowquinn commented Oct 4, 2021

Yes, the AlignedBases value looks to be a good place to start for coverage. For instance, dnadiff reports 39169 AlignedBases for the reference, which corresponds to the 39253 - 85 interval we'd expect when not double-counting the overlap. That helps towards improving the coverage/aligned fraction calculation.

Currently, the ANI identity calculation will be double-counting the overlap, and AlignedBases doesn't look to help with that. I think we may still need to deduplicate overlapping regions in order to give an accurate ANI value.

@widdowquinn
Copy link
Owner

Largely a note for me and @baileythegreen after today's discussion...

It looks like we might be able to get everything we need from a combination of the .1coords and .snps files that dnadiff produces, so we can get the "right" answers for identity, coverage, etc.

However, for very large comparisons on a cluster we probably don't want dnadiff to generate that suite of multiple output files, so it might be best to do something like run show-snps and show-coords directly (catching each STDOUT) in a wrapper, and pass the results of the calculation to the database, in v0.3+ - though there may be a better solution.

@widdowquinn
Copy link
Owner

widdowquinn commented Nov 10, 2021

My initial optimism may have been misplaced. There appear to be some disturbing differences in the Caulobacter test set - especially wrt coverage.

ANIm_alignment_coverage_noextend.pdf
ANIm_alignment_coverage.pdf

widdowquinn added a commit that referenced this issue Nov 10, 2021
This also runs a --noextend test set in the Makefile, and adds a test
for command creation (but not output regression).

The --noextend option appears to affect low coverage and identity
scores quite significantly.

NOTE: this is not in itself a fix for #340. That requires changes to
how we write and parse nucmer output.
@widdowquinn
Copy link
Owner

widdowquinn commented Nov 10, 2021

Further investigation of the dnadiff and nucmer parsing is making me optimistic again. It's quite clear that dnadiff is not using --noextend, but is using something like a tweaked --maxmatch, and is then using the output of M-to-M matches (which we avoid), and processing them to remove overlaps to get the value in AlignedBases.

The difference in output between what we do and what dnadiff does turns out to be quite small for a bacterial test set.

Where the dnadiff .report output gives AlignedBases = 1414384 (reference) or 1424179 (query), the 1:1 alignments are actually 1408533 and 1409770 in total length on reference and query genomes; the M:M alignments are 1442608 and 1443869 respectively.

I'd argue that we should be only using 1:1 alignments, so AlignedBases is not the right number to use.

However, we were not originally accounting for overlapping alignments. When we do account for them, dnadiff gives (look at ref_aligned_size and query_aligned_size):

infname=PosixPath('2021-11-10/test_dnadiff_output/NC_002696_vs_NC_010338.1coords')
aln_ref=1408533, aln_query=1409770, cov_ref=35.050000000000125, cov_query=25.850000000000044, num_rows=1206, overlaps=11, overlap_len=17664
len(ref_tree)=1195, len(query_tree)=1201
ref_aligned_size=1401112, query_aligned_size=1408639

infname=PosixPath('2021-11-10/test_dnadiff_output/NC_002696_vs_NC_010338.mcoords')
aln_ref=1442608, aln_query=1443869, cov_ref=35.82000000000012, cov_query=26.40000000000006, num_rows=1275, overlaps=53, overlap_len=50359
len(ref_tree)=1222, len(query_tree)=1230
ref_aligned_size=1414420, query_aligned_size=1424194

and our standard pyani approach gives:

infname=PosixPath('2021-11-10/test_nucmer_output/NC_002696_vs_NC_010338.fcoords')
aln_ref=1388340, aln_query=1389491, cov_ref=34.59000000000013, cov_query=25.51000000000005, num_rows=1173, overlaps=5, overlap_len=7845
len(ref_tree)=1168, len(query_tree)=1172
ref_aligned_size=1384814, query_aligned_size=1389479

With our alignment choices (including 1:1 match filtering) we align about 1.38Mbp where dnadiff aligns 1.41Mbp.

However, if we use --maxmatch rather than --mum, we get:

infname=PosixPath('2021-11-10/test_nucmer_maxmatch_output/NC_002696_vs_NC_010338.fcoords')
aln_ref=1408533, aln_query=1409770, cov_ref=35.050000000000125, cov_query=25.850000000000044, num_rows=1206, overlaps=11, overlap_len=17664
len(ref_tree)=1195, len(query_tree)=1201
ref_aligned_size=1401112, query_aligned_size=1408639

and almost exactly recover the dnadiff output.

So, while our output differs, it appears to do so mostly because we use --mum rather than --maxmatch. As --maxmatch is an option intended to find inexact repeats and uses all anchors, whether unique or not, --mum uses only anchors that are unique in both reference and query. We do offer the --maxmatch option to users already if they don't agree with our default, so I think we might have this covered.

However, that's not the problem with issue #340 ;)

The issue with #340 is that we were not emulating the overlap removal that dnadiff employs when generating its .report file. Note that the raw alignment lengths, and the .1coord/.mcoord outputs have the same problem we did - this from the original virus comparison:

[Bases]
TotalBases                     39253                39594
AlignedBases           39169(99.79%)        39176(98.94%)
UnalignedBases             84(0.21%)           418(1.06%)

[Alignments]
1-to-1                             2                    2
TotalLength                    59174                59187
AvgLength                   29587.00             29593.50
AvgIdentity                    99.63                99.63

M-to-M                             2                    2
TotalLength                    59174                59187
AvgLength                   29587.00             29593.50
AvgIdentity                    99.63                99.63

AlignedBases is great, but the alignment lengths are not.

We can fix this by running an IntervalTree over the alignment co-ordinates. I did that for the virus comparisons to get:

infname=PosixPath('2021-11-10/virus_dnadiff_output/virus_dnadiff.1coords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176

infname=PosixPath('2021-11-10/virus_dnadiff_output/virus_dnadiff.mcoords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176

infname=PosixPath('2021-11-10/virus_nucmer_output/virus_nucmer.coords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176

and recovers AlignedBases exactly, whether we use dnadiff or our existing approach.

@widdowquinn
Copy link
Owner

So, I think we should be generating a .coords file output and parsing it through an IntervalTree to remove/account for overlaps. That's the main change.

I think we already allow users to choose between --maxmatch/--mum so no change there.

I don't understand the process by which AlignedBases is generated exactly, but as it appears to be based on M:M alignments, I think it's possibly the wrong value for us to use, if we're trying to identify uniquely-homologous regions.

Once we correct for the overlaps, I think that our agreement with dnadiff/other methods should be pretty good and, except in unusual cases like the MAG viruses, I don't currently think it will affect pyani outputs to a large degree - though it will be an improvement.

@widdowquinn widdowquinn added the HIGH PRIORITY high priority issue label Jan 17, 2022
@widdowquinn widdowquinn mentioned this issue Jan 20, 2022
21 tasks
@widdowquinn
Copy link
Owner

widdowquinn commented Feb 10, 2022

Historical note

I think I know why I misunderstood mummer's operation. I taught sequence alignment as part of a computational biology course for a few years, and would use the Progressive Mauve paper in that (I was rereading it for other reasons, last week). This paper describes mummer as follows (my emphasis):

MUMmer can identify and align genomes with rearranged homologous sequences. However MUMmer does not align paralogous sequences (repeats within a genome), nor does it align all copies of multi-copy orthologous sequence. Because it aligns any site to at most one site in the other genome, and due to the way it anchors alignment of repetitive sequence using neighboring unique regions, MUMmer often aligns the positionally conserved copy of a repeat element.

I think I might have internalised this explanation and never realised that mummer can in fact extend those alignments without regard to whether it maps one site in the reference to two or more sites in the query (or vice versa)…

@baileythegreen
Copy link
Contributor

I have been looking at the pybedtools library after rereading over discussion in #342. The suggestion there was to use it for nucmer file parsing.

I think pybedtools concept of an Interval has a lot of attributes that are not necessarily relevant in our case—strand for one, chromosome for another, but the class itself is more amenable to extension/modification than is the case with the IntervalTree Interval class.

Example:

pybedtools.Interval has the following attributes (the Python OOP concept of attributes, not what pybedtools refers to as attributes):

  • chrom
  • start
  • stop
  • name
  • score
  • strand
  • additional key:value pairs

When any of these are not specified for a given Interval, they are assigned the value .. We really only need our intervals to have four attributes for the kind of manipulation we want to do:

  • sequence
  • start
  • stop
  • counterpart

where sequence is the particular query sequence, and counterpart is the particular reference sequence.

Because pybedtools use a more standard class definition for their concept of Interval, I can define a class which inherits from pybedtools.Interval and use property decorators and corresponding setter methods to provide alternate names for select attributes, so that they are called things which make more sense within the context of our data:

  • chrom --> sequence
  • start
  • stop
  • name --> counterpart
  • score
  • strand
  • additional key:value pairs

This does not remove those attributes from the class (so it wouldn't affect any backend pybedtools functionality, should we need to use that; it just provides a different name for them that we can use in the pyani code. It is bypassing the way the pybedtools documentation indicates Intervals should be created, but those are all based on file structures that do not apply here, so I think this is fine.

Whether this is preferable to using the custom classes I have written may depend on speed and whatever benefits come from using pybedtools, if any.

@widdowquinn
Copy link
Owner

I had a feeling that, in my testing, the route to getting what I believe to be the correct count of AlignedBases was not very tricky. The (hacky) script I used for that is here:

#!/usr/bin/env python3

import csv

from pathlib import Path

import intervaltree

# Parse .coords files and try to replicate AlignedBases
def parse_coords(infname):
    aln_ref = 0
    aln_query = 0
    cov_ref = 0
    cov_query = 0
    num_rows = 0
    last_row = None
    overlaps = 0
    overlap_len = 0

    ref_intervals = []
    query_intervals = []

    with infname.open() as ifh:
        fieldnames = [
            "start_ref",
            "end_ref",
            "start_query",
            "end_query",
            "aln_len_ref",
            "aln_len_query",
            "aln_id",
            "len_ref",
            "len_query",
            "cov_ref",
            "cov_query",
            "id_ref",
            "id_query",
        ]
        reader = csv.DictReader(ifh, delimiter="\t", fieldnames=fieldnames)
        for row in reader:
            num_rows += 1
            aln_ref += int(row["aln_len_ref"])
            aln_query += int(row["aln_len_query"])
            cov_ref += float(row["cov_ref"])
            cov_query += float(row["cov_query"])

            ref_intervals.append(sorted((int(row["start_ref"]), int(row["end_ref"]))))
            query_intervals.append(
                sorted((int(row["start_query"]), int(row["end_query"])))
            )

    ref_tree = intervaltree.IntervalTree.from_tuples(ref_intervals)
    ref_tree.merge_overlaps()
    ref_aligned_size = 0
    for interval in ref_tree:
        ref_aligned_size += interval.end - interval.begin + 1

    query_tree = intervaltree.IntervalTree.from_tuples(query_intervals)
    query_tree.merge_overlaps()
    query_aligned_size = 0
    for interval in query_tree:
        query_aligned_size += interval.end - interval.begin + 1

    print(f"{infname=}")
    print(
        f"{aln_ref=}, {aln_query=}, {cov_ref=}, {cov_query=}, {num_rows=}"  # , {overlaps=}, {overlap_len=}"
    )
    print(f"{len(ref_tree)=}, {len(query_tree)=}")
    print(f"{ref_aligned_size=}, {query_aligned_size=}")
    print()


# virus output
infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.1coords")
parse_coords(infname)

infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.mcoords")
parse_coords(infname)

infname = Path("2021-11-10/virus_nucmer_output/virus_nucmer.coords")
parse_coords(infname)

@baileythegreen
Copy link
Contributor

@widdowquinn I think that script is merging intervals it should not; at least, based on what I understood from a discussion a while back about when intervals should be merged. For example:

Assume the input file:


1	5	20	24	300	300	90.0	500	330	1.0	1.0	id1	id2
3	7	97	24	300	300	90.0	500	400	1.0	1.0	id1	id3
1146	1183	1	37	300	300	100.00	2000	700	3.14	100.00	id4	id5
1	164	164	1	300	300	100.00	2000	800	13.88	21.78	id4	id6
163	166	179	182	300	300	100.00	2000	800	0.29	0.46	id4	id6
165	1140	1	974	300	300	100.00	2000	1000	82.34	100.00	id4	id7

This has data on two query contigs:

id1

1	5	20	24	300	300	90.0	500	330	1.0	1.0	id1	id2
3	7	97	24	300	300	90.0	500	400	1.0	1.0	id1	id3

and id4

1146	1183	1	37	300	300	100.00	2000	700	3.14	100.00	id4	id5
1	164	164	1	300	300	100.00	2000	800	13.88	21.78	id4	id6
163	166	179	182	300	300	100.00	2000	800	0.29	0.46	id4	id6
165	1140	1	974	300	300	100.00	2000	1000	82.34	100.00	id4	id7

If overlapping intervals should be merged so long as they belong to the same contig, then the query intervals for id1 should be merged, producing one interval of 1–7, and for id4, the last three intervals should also be merged, giving [1–1140, 1146–1183].

If, however, the identity of the reference contig must also be considered before deciding to merge, then id1 reduces to [1–5, 3–7] and id4 to [1–166, 165–1140, 1146–1183].

With your script, I get the former (inserting a print statement following the merge_overlaps() calls), but from our discussion a while back, I was under the impression that it should be the latter.

Which is it?

@widdowquinn
Copy link
Owner

widdowquinn commented May 4, 2022

We could have left this to tomorrow's meeting, but it may be useful to have a written record here, to avoid misunderstanding.

In a pairwise comparison, there are two genomes being compared. Let's call them genomes A and B.

In your example above, contigs id1 and id4 belong to genome A. Contigs id2, id3, id5, id6 and id7 belong to genome B.

We are interested in which parts of genome A align to genome B (and vice versa, but as in your example let's consider only genome A). We are interested, for the purpose of calculating genome coverage, in the number of unique bases in genome A that align to genome B. It doesn't matter to which parts of genome B the regions of genome A align, and specifically it doesn't matter which contigs they align to, so long as we are willing to accept each of the individual alignments as valid.

Here, genome A is divided into two contigs: id1 and id4.

The regions of id1 that align to genome B are 1-5 and 3-7, so we merge these to get the single region 1-7.

The regions of id4 that align to genome B are 1-164, 163-166, 165-1140, and 1146-1183, so we merge these to get the two regions 1-1140 and 1146-1183.

I hope this clarifies things.

@widdowquinn
Copy link
Owner

Fixed with 1f10a6c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should HIGH PRIORITY high priority issue
Projects
Status: Done
Development

No branches or pull requests

3 participants