Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of assemblers -- please contribute! #71

Closed
rcedgar opened this issue May 3, 2020 · 42 comments
Closed

List of assemblers -- please contribute! #71

rcedgar opened this issue May 3, 2020 · 42 comments
Assignees
Labels
Bioinformatics Bioinformatics task good first issue Good for newcomers

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented May 3, 2020

I started a list of assemblers that might be useful here:

doc/assemblers.md

Please edit / send me updates if you have ideas.

@rcedgar rcedgar changed the title List of assemblers List of assemblers -- please contribute! May 3, 2020
@rcedgar rcedgar added Bioinformatics Bioinformatics task good first issue Good for newcomers labels May 3, 2020
@hussius
Copy link
Collaborator

hussius commented May 11, 2020

Since I don't seem to have push access to this repo, I'll add some suggestions as a comment below this one. Just to give a bit of context - my comments and recommendations stem from a project about assembling RNA and DNA virus genomes from mixed samples (host + virus) that I was involved in around 2015-2016. In terms of the tools mentioned in the review paper which is linked from the doc/assemblers.md document, I've tried more than half. (Specifically, CLC, IDBA-UD, Megahit, MetaVelvet, Mira, SPAdes, Velvet, Vicuna; I've also used Ray Meta but only for bacterial metagenome assembly, where it worked well). The ones I ended up using the most were SPAdes (which is also highlighted in the paper as a consistently well-performing tool, although they used the SPAdes meta variety whereas I just used SPAdes) and IDBA-UD (which the review paper authors also seem to like). I also quite liked Megahit, which has low memory requirements, but did not end up using that simply because it was not appreciably better in terms of assembly quality than SPAdes and IDBA-UD, which were already in our pipeline.

All just my two cents of course!

@hussius
Copy link
Collaborator

hussius commented May 11, 2020

SPAdes
Type: De-novo genomic / transcriptomic / metagenomic (different varieties exist - rnaSPAdes, SPAdes meta etc.)
Code: https://github.com/ablab/spades
Paper: doi: 10.1089/cmb.2012.0021
Comments: Well-supported and generally robust assembler. SPAdes meta was highlighted in the review article at the top of the document ("Choice of assembly software has a critical impact on virome characterisation") as performing "consistently well".

Megahit
Type: De-novo genomic / metagenomic
Code: https://github.com/voutcn/megahit
Paper: doi: 10.1093/bioinformatics/btv033
Comments: Very memory-efficient.

IDBA
Type: De-novo metagenomic
Code: https://github.com/loneknightpy/idba
Paper: doi: 10.1093/bioinformatics/bts174
Comments: Anecdotally (i.e. in my own experience) works well for viral genome assembly. Also positively reviewed in the review paper above.

@hussius
Copy link
Collaborator

hussius commented May 11, 2020

@rcedgar Do you know what hardware you will be trying to run the assembly on? (If AWS instances, what size RAM etc.)

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 11, 2020

@hussius We can use any instance type we need, smaller and cheaper are preferred but Amazon are donating credits and we have a lot of flexibility. If you have expertise here, would you be up for doing some testing ASAP? The key question is whether an existing de novo assembler can handle a dataset under these conditions:

(+) Host reads are a large majority, virus is a small fraction.

(+) Host is a species where we don't have a finished genome, an obscure bat or whatever.

(+) Virus is only recognized from a fragment, so we cannot use known virus genomes as a positive filter. (Edit added by RCE).

Even the assembler generates virus contigs, how do we filter out the host? If the virus is closely related to a known strain, this may be straightforward, but otherwise it could be challenging.

@charlescongxu
Copy link
Collaborator

The host species is typically annotated, no? You can map to a similar enough host species since divergence with viruses will be large.

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 11, 2020

Show us how! @ababaian and @rcedgar are overextended and don't have time to try this ourselves right now.

I will offer US$250 Amazon gift certificate to first person to implement an open-source method which creates de novo contigs from GroupA: SRR10951654-655 and GroupB: SRR10829951-958. Contigs must be validated against a close known virus reference. Virus reference genomes must not be used as a positive filter before assembly, the key challenge is how to assemble when only a small fragment is recognized.

Offer expires in a week -- contigs and method documentation must be posted before 12pm Pacific time Sunday May 17th.

RCE -- edit to clarify -- You must start from the full SRA dataset, not the BAM file generated by Serratus. In the datasets above, there is a close relative of a known virus, so the BAM files probably include almost all virus reads and very little host. This is easy. The situation we don't know how to handle is where we see reads hitting a short fragment, say one CDS, but not an entire genome. In that case, most of the virus reads will be missing from the Serratus BAM file. Host filtering is allowed (the above datasets have pig hosts, which are a good model for this situation), but filtering by the virus reference is not allowed.

@hussius
Copy link
Collaborator

hussius commented May 11, 2020

I'll give it a try! The condition "(+) Host reads are a large majority, virus is a small fraction." is typical and was possible to overcome in my old project. But we'll see how the assemblers work on your suggested datasets!

@charlescongxu
Copy link
Collaborator

Let me know if you need help!

@ababaian
Copy link
Owner

Hello all interested parties,

I think we're starting to reach a critical mass of people regarding how to process assembly for Serratus. I'd like to propose we all get on a technical group call and we can begin to address how we want to tackle this and how best to divide the work among us.

I'd like to propose Friday morning (PST) which is Friday evening in Europe. Please submit this dudle poll with your availability and we'll convene with a clear plan of attack then.

@ababaian
Copy link
Owner

We will be meeting Friday 9AM PDT on Skype. Please DM me your skype details on slack if we haven't already had a chance to chat.

@cmorganl
Copy link
Collaborator

MATAM
Type: Reference-guided, metagenomic
Code: https://github.com/bonsai-team/matam
Paper: https://doi.org/10.1093/bioinformatics/btx644
Comments: Given the amount of data we're working with, and that the coronavirus genome is substantially larger than the ~1500 nucleotides of a 16S rRNA gene, I'm not sure how it will scale. But I'm very interested in testing it when the datasets become available. I'd also like to hear feedback if someone has already tried it.

@taltman
Copy link
Collaborator

taltman commented May 15, 2020

Hi @rcedgar, are you caught up on incorporating these suggestions into the documentation? Or is there something more that we need here? Thanks!

@taltman taltman added this to the Assembly: Pipelines milestone May 15, 2020
@rcedgar
Copy link
Collaborator Author

rcedgar commented May 15, 2020

@cmorganl "I'm very interested in testing it when the datasets become available." See #89

@AndreaGuarracino
Copy link

Shasta
Type: De novo assembly from Oxford Nanopore reads
Link to code: https: //github.com/chanzuckerberg/shasta
Link to paper: https://www.nature.com/articles/s41587-020-0503-6
Comments: It works well, but in needs parameter tuning to do it.

@hussius
Copy link
Collaborator

hussius commented May 16, 2020

@rcedgar I have written up a methods description for making de novo contigs from your two SRA accession groups here.

Briefly, for the high-coverage case (GroupB), Megahit was able to create a single contig (link) covering the presumably intended target genome. For the low-coverage case, the assembly is more fragmented (I think 37 contigs) so while the contigs (link) do span most of the presumable target genome, there are quite a few gaps.

@ababaian
Copy link
Owner

@hussius Can you scaffold that onto a genome for the low-coverage set and create a 'genome' with NNNN in between?

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 16, 2020

@hussius Congratulations! Just made it in time, or maybe a bit late... Send an email to robert@drive5.com and let me know which amazon you prefer (.com, .de, .es, .mx...) -- if that actually matters, not sure.

Would be great to see an alignment of your contigs, or even better scaffold, against my assembly (see #89).

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

@hussius
Copy link
Collaborator

hussius commented May 17, 2020

@ababaian Sure. If you are only talking about inserting NNNs between contigs that should be a fairly straightforward scripting task, but if you mean a more "serious" scaffolding, I'd happily take suggestions on good tools to use. I tried something called Medusa on these contigs and it was able to get the number of contigs down to three, but it also lost some sequence so the end product was a ~25 kb assembly.

@hussius
Copy link
Collaborator

hussius commented May 17, 2020

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now. The best BLAST hit:
Screenshot from 2020-05-17 17-21-15

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 17, 2020

@hussius can you comment on suitability of your assembly approach for HPC, i.e. putting in a container and running it in the cloud?

@hussius
Copy link
Collaborator

hussius commented May 17, 2020

@rcedgar The problem I see is that automating the host sequence removal could be hard, especially if the host doesn't have a good reference genome. Apart from that, the workflow is fairly lightweight and should be reasonably easy to containerize.

@hussius
Copy link
Collaborator

hussius commented May 17, 2020

@rcedgar Here is an alignment between my assembly for the low-coverage pig virus vs. a FASTA file combining your reference based assemblies for SRR10951654 (called "Genome1" in the file) and SRR10951655 (called "Genome2" in the file):
C2X4M578114-Alignment.txt I don't know if this alignment format is convenient; if not, you can BLAST your assemblies against my tentative scaffolded assembly. (My assembly is called "Chr0_RaGOO" because I used a program called RaGOO for scaffolding)

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 17, 2020

@hussius "automating the host sequence removal could be hard". Yes, exactly. Maybe I should have disallowed host filtering and potentially saved myself $250, but it seemed we were making very little progress on assembly so I thought solving an easier problem could get some momentum going. So, how do we tackle host filtering in general?

@ababaian
Copy link
Owner

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now.

How is this scaffolded? Did you use another reference genome as the backbone?

@taltman
Copy link
Collaborator

taltman commented May 17, 2020

I've created a stub on the Serratus Assembly Wiki to capture this list of assemblers:

https://github.com/ababaian/serratus/wiki/Serratus-Assembly#list-of-assemblers-to-consider

Please help migrate this great content over there!

@rcedgar
Copy link
Collaborator Author

rcedgar commented May 17, 2020

Sounds like duplicated effort to me. The wiki can link to the issue. Let's be pragmatic and not expend unnecessary effort conforming to a system, we need to focus on the "real" work as much as possible.

ababaian pushed a commit that referenced this issue May 17, 2020
Notes on de novo assembly methods used for issue #71
@taltman
Copy link
Collaborator

taltman commented May 17, 2020

@rcedgar Regarding host removal, I am testing the use of Kraken2 for removing reads that are predicted to be prokaryotic or eukaryotic in origin. The hope is that will reduce the overhead tremendously. I'll post results on Slack.

@victorlin
Copy link
Collaborator

List of assemblers has been copied from assemblers.md to the relevant wiki page.

@ababaian
Copy link
Owner

You're a gentleman and a scholar Victor.

@sjackman
Copy link

@sjackman
Copy link

I've found Unicycler very effective for small genomes, especially when you want a usable GFA file for visualization with Bandage. Both tools in Homebrew.
https://github.com/rrwick/Unicycler
https://github.com/brewsci/homebrew-bio/blob/master/Formula/unicycler.rb

@sjackman
Copy link

@taltman
Copy link
Collaborator

taltman commented May 24, 2020

@sjackman Thanks! We're planning on having a call to discuss our assembly plan, would you be available to join? Here's the link:

https://dudle.inf.tu-dresden.de/serratus001/

@sjackman
Copy link

Sorry I didn't see this question until now, and it looks like the meeting has already happened. I'm happy to chat more here on GitHub.

@ababaian
Copy link
Owner

@sjackman We're meeting tomorrow 9AM :P

@sjackman
Copy link

I can make that. Please share the Zoom (or whatever) link with me.

@ababaian
Copy link
Owner

@sjackman can you email me your skype id

@sjackman
Copy link

Reposted from #86 (comment)

Mash Screen seems relevant.
Mash Screen: high-throughput sequence containment estimation for genome discovery
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x
See the paragraph Novel virus assembly

Mash Screen is implemented in C++ and is integrated into the existing Mash codebase as of v2.0.

https://github.com/marbl/Mash

@rchikhi
Copy link
Collaborator

rchikhi commented May 31, 2020

As per Issue #130, a benchmark of some of the assemblers is https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets Please let me know if you'd like it to be updated with your favorite method

@ababaian
Copy link
Owner

Good to close for now?

@rcedgar
Copy link
Collaborator Author

rcedgar commented Jun 16, 2020

Yes.

@rcedgar rcedgar closed this as completed Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests