Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let scaffolder use ambiguous alignments when no unambiguous ones exist #904

Merged
merged 28 commits into from
Feb 14, 2019

Conversation

notestaff
Copy link
Contributor

When scaffolding contigs to a reference, if no unambiguous alignments cover a part of reference, let the scaffolder consider ambiguous ones, when all alignments mostly agree on the sequence. The ambiguity can be the result of a misassembly, where the same reference part appears twice in a contig. Also, in case of a repeat, dropping all ambiguous alignments could leave a hole in the assembly.

@dpark01
Copy link
Member

dpark01 commented Dec 5, 2018

Oh wow... you delved into some of the more complicated code that actually does some of the science. In particular, I think this prior limitation was simply because I didn't have the time to think about how to implement the general case at the time. Let me stare at it a bit.

Is the only appropriate/needed test around this scenario the part that you modified in the expected output of the pre-existing unit test?

@notestaff
Copy link
Contributor Author

I can add more tests, that involve actual repeats in the reference. (In the existing test, the ambiguity was due to a misassembly.)

Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, looks pretty slick... nice add. So let's see if I'm understanding this correctly:

This augments the existing scaffolding code so it no longer cops out on some potentially easy scenarios where multiple contigs align to the same part of the reference. But the way it's currently written, it's still pretty conservative: it will only recover scenarios that had exactly two contigs aligning where their alignments happen to be exactly the same length (no indel differences between them) and the SNP differences between the two is < 1%. There's one line where we can turn any of those knobs to be more generous, but for now, that's how it's set.

Is that right?

Or is this less about multiple contigs aligning to the same region of the reference and more about multiple alignments from repetitive regions in a single contig hitting the same region of a reference? (because there's still the unchanged contig_chooser stuff below).

tools/mummer.py Outdated
@@ -282,17 +282,36 @@ def scaffold_contigs_custom(self, refFasta, contigsFasta, outFasta,
# (# assembled segments)
continue

def frac_unambig(seqs):
"""Given a list of seqs of the same length, return the fraction of positions on which they all agree"""
assert len(set(map(len, alt_seqs_f)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assert should only be used in tests--I think they're entirely ignored when python executes under certain circumstances.

tools/mummer.py Show resolved Hide resolved
tools/mummer.py Outdated Show resolved Hide resolved
tools/mummer.py Outdated
assert len(set(map(len, alt_seqs_f)))
n_tot = len(seqs[0])
n_unambig = 0
for i in range(n_tot):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, could use zip instead? Like...

return sum(1 if (len(set(bases)==1) else 0 for bases in zip(seqs)) / float(n_tot)

@notestaff
Copy link
Contributor Author

"is this less about multiple contigs aligning to the same region of the reference and more about multiple alignments from repetitive regions in a single contig hitting the same region of a reference" -- right. The former case is still handled by contig_chooser.

"it's still pretty conservative" -- that's by design; I assumed the original check was meant to prevent some undesirable scenarios, so did not want to relax it too much. The relaxed check seems to be enough for the misassemblies seen in @tomkinsc 's / Kayla's ebov data. But it's possible that more assembly holes could be closed by relaxing the check, and/or by considering ambiguous aligns together with unambiguous ones (since a misjoined contig is really two contigs).

@dpark01
Copy link
Member

dpark01 commented Dec 5, 2018

Maybe just for record keeping, can you link or describe examples from the EBOV data you're seeing? If this is primarily about the scenario when multiple parts of the same contig align to the same region of a reference, then it's always about repetitive sequence (either because it's biologically repetitive, or because of a misassembly that produces that).

I appreciate the conservatism. That said, the real reason this scenario wasn't supported in the past was just implementation laziness. But if these parameters recover everything we need to recover in the data you're seeing, that's good enough. Or ... would it make any sense to expose them as knobs for the user? Eh... maybe later if the need arises.

So, if you're saying this scenario gets triggered because two contigs got misjoined by the assembler, this PR would help effectively pull them apart?

@notestaff
Copy link
Contributor Author

"this scenario gets triggered because two contigs got misjoined by the assembler, this PR would help effectively pull them apart" -- two contigs from the same genome part. (And to really pull them apart, maybe we should add both variants to alt_seqs, together with variants from non-ambiguous aligns, and let contig_chooser choose.) For misjoined contigs that pull together two different genome parts (as opposed to repeating the same part twice), there's still the problem that show_tiling uses each contig at most once, so one of the genome parts may get lost. So it may make sense to also detect such contigs and either break them or duplicate them before show_tiling.

notestaff and others added 4 commits December 6, 2018 15:09
@dpark01 dpark01 closed this Jan 9, 2019
@tomkinsc tomkinsc reopened this Jan 10, 2019
@notestaff notestaff merged commit 8c17bb0 into master Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants