Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing Gotoh alignment algorithm #556

Open
2 of 3 tasks
donkirkby opened this issue Apr 17, 2020 · 2 comments
Open
2 of 3 tasks

Consider replacing Gotoh alignment algorithm #556

donkirkby opened this issue Apr 17, 2020 · 2 comments
Milestone

Comments

@donkirkby
Copy link
Member

donkirkby commented Apr 17, 2020

As I've been working on #549 to add support for SARS-CoV-2 references, I've had some trouble with running out of memory. I think it's partly that I'm running on equipment with less memory than I usually use, and partly that the SARS-CoV-2 genome is longer than HIV or HCV. The specific step that I've had most trouble with is aligning two consensus sequences using our Gotoh algorithm, so maybe it's time to look at alternatives.

@jeff-k had suggested we move from Gotoh to BWA, and that project seems to have been superceded by minimap2. Experiment with these tools for aligning the SARS-CoV-2 consensus sequences, and then decide whether they are worth switching to.

Tasks

@donkirkby donkirkby added this to the 7.13 milestone Apr 17, 2020
@donkirkby
Copy link
Member Author

donkirkby commented Apr 17, 2020

First impressions of minimap2:

  • Good - it includes a Python wrapper and installs with pip. Nice!
  • Not so good - any error conditions cause a None return value instead of raising an exception.
  • Good - alignments seem to work as expected.
  • Good - memory usage at least 100 times less for the current task.

@jeff-k
Copy link
Contributor

jeff-k commented Apr 17, 2020

For local alignment of two long consensus sequences, assuming one of them spans a range of the other (amplicon vs. reference), Smith-Waterman or Gotoh are sound choices. I have never used minimap2, but maybe the PacBio or Nanopore features for handling long reads would approximate this use case. I don't know what else those features would do, though.

At 30k bp, SARS-CoV-2 is going to stress a SW implementation that builds an entire N x M backtracking matrix. A useful optimization for SW space complexity is banding.

The rust-bio library has a good API for this: https://docs.rs/bio/0.20.3/bio/alignment/pairwise/banded/index.html which would be a good test bed for working out the alignment parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants