List of assemblers -- please contribute! #71

rcedgar · 2020-05-03T18:15:36Z

I started a list of assemblers that might be useful here:

Please edit / send me updates if you have ideas.

hussius · 2020-05-11T13:06:56Z

Since I don't seem to have push access to this repo, I'll add some suggestions as a comment below this one. Just to give a bit of context - my comments and recommendations stem from a project about assembling RNA and DNA virus genomes from mixed samples (host + virus) that I was involved in around 2015-2016. In terms of the tools mentioned in the review paper which is linked from the doc/assemblers.md document, I've tried more than half. (Specifically, CLC, IDBA-UD, Megahit, MetaVelvet, Mira, SPAdes, Velvet, Vicuna; I've also used Ray Meta but only for bacterial metagenome assembly, where it worked well). The ones I ended up using the most were SPAdes (which is also highlighted in the paper as a consistently well-performing tool, although they used the SPAdes meta variety whereas I just used SPAdes) and IDBA-UD (which the review paper authors also seem to like). I also quite liked Megahit, which has low memory requirements, but did not end up using that simply because it was not appreciably better in terms of assembly quality than SPAdes and IDBA-UD, which were already in our pipeline.

All just my two cents of course!

hussius · 2020-05-11T13:07:38Z

SPAdes
Type: De-novo genomic / transcriptomic / metagenomic (different varieties exist - rnaSPAdes, SPAdes meta etc.)
Code: https://github.com/ablab/spades
Paper: doi: 10.1089/cmb.2012.0021
Comments: Well-supported and generally robust assembler. SPAdes meta was highlighted in the review article at the top of the document ("Choice of assembly software has a critical impact on virome characterisation") as performing "consistently well".

Megahit
Type: De-novo genomic / metagenomic
Code: https://github.com/voutcn/megahit
Paper: doi: 10.1093/bioinformatics/btv033
Comments: Very memory-efficient.

IDBA
Type: De-novo metagenomic
Code: https://github.com/loneknightpy/idba
Paper: doi: 10.1093/bioinformatics/bts174
Comments: Anecdotally (i.e. in my own experience) works well for viral genome assembly. Also positively reviewed in the review paper above.

hussius · 2020-05-11T13:08:38Z

@rcedgar Do you know what hardware you will be trying to run the assembly on? (If AWS instances, what size RAM etc.)

rcedgar · 2020-05-11T14:33:35Z

@hussius We can use any instance type we need, smaller and cheaper are preferred but Amazon are donating credits and we have a lot of flexibility. If you have expertise here, would you be up for doing some testing ASAP? The key question is whether an existing de novo assembler can handle a dataset under these conditions:

(+) Host reads are a large majority, virus is a small fraction.

(+) Host is a species where we don't have a finished genome, an obscure bat or whatever.

(+) Virus is only recognized from a fragment, so we cannot use known virus genomes as a positive filter. (Edit added by RCE).

Even the assembler generates virus contigs, how do we filter out the host? If the virus is closely related to a known strain, this may be straightforward, but otherwise it could be challenging.

charlescongxu · 2020-05-11T14:45:00Z

The host species is typically annotated, no? You can map to a similar enough host species since divergence with viruses will be large.

rcedgar · 2020-05-11T15:13:55Z

Show us how! @ababaian and @rcedgar are overextended and don't have time to try this ourselves right now.

I will offer US$250 Amazon gift certificate to first person to implement an open-source method which creates de novo contigs from GroupA: SRR10951654-655 and GroupB: SRR10829951-958. Contigs must be validated against a close known virus reference. Virus reference genomes must not be used as a positive filter before assembly, the key challenge is how to assemble when only a small fragment is recognized.

Offer expires in a week -- contigs and method documentation must be posted before 12pm Pacific time Sunday May 17th.

RCE -- edit to clarify -- You must start from the full SRA dataset, not the BAM file generated by Serratus. In the datasets above, there is a close relative of a known virus, so the BAM files probably include almost all virus reads and very little host. This is easy. The situation we don't know how to handle is where we see reads hitting a short fragment, say one CDS, but not an entire genome. In that case, most of the virus reads will be missing from the Serratus BAM file. Host filtering is allowed (the above datasets have pig hosts, which are a good model for this situation), but filtering by the virus reference is not allowed.

hussius · 2020-05-11T16:56:31Z

I'll give it a try! The condition "(+) Host reads are a large majority, virus is a small fraction." is typical and was possible to overcome in my old project. But we'll see how the assemblers work on your suggested datasets!

charlescongxu · 2020-05-11T17:04:20Z

Let me know if you need help!

ababaian · 2020-05-12T20:46:59Z

Hello all interested parties,

I think we're starting to reach a critical mass of people regarding how to process assembly for Serratus. I'd like to propose we all get on a technical group call and we can begin to address how we want to tackle this and how best to divide the work among us.

I'd like to propose Friday morning (PST) which is Friday evening in Europe. Please submit this dudle poll with your availability and we'll convene with a clear plan of attack then.

ababaian · 2020-05-13T17:36:13Z

We will be meeting Friday 9AM PDT on Skype. Please DM me your skype details on slack if we haven't already had a chance to chat.

cmorganl · 2020-05-15T17:54:00Z

MATAM
Type: Reference-guided, metagenomic
Code: https://github.com/bonsai-team/matam
Paper: https://doi.org/10.1093/bioinformatics/btx644
Comments: Given the amount of data we're working with, and that the coronavirus genome is substantially larger than the ~1500 nucleotides of a 16S rRNA gene, I'm not sure how it will scale. But I'm very interested in testing it when the datasets become available. I'd also like to hear feedback if someone has already tried it.

taltman · 2020-05-15T18:01:28Z

Hi @rcedgar, are you caught up on incorporating these suggestions into the documentation? Or is there something more that we need here? Thanks!

rcedgar · 2020-05-15T19:31:02Z

@cmorganl "I'm very interested in testing it when the datasets become available." See #89

AndreaGuarracino · 2020-05-15T21:13:52Z

Shasta
Type: De novo assembly from Oxford Nanopore reads
Link to code: https: //github.com/chanzuckerberg/shasta
Link to paper: https://www.nature.com/articles/s41587-020-0503-6
Comments: It works well, but in needs parameter tuning to do it.

hussius · 2020-05-16T09:41:55Z

@rcedgar I have written up a methods description for making de novo contigs from your two SRA accession groups here.

Briefly, for the high-coverage case (GroupB), Megahit was able to create a single contig (link) covering the presumably intended target genome. For the low-coverage case, the assembly is more fragmented (I think 37 contigs) so while the contigs (link) do span most of the presumable target genome, there are quite a few gaps.

ababaian · 2020-05-16T15:32:27Z

@hussius Can you scaffold that onto a genome for the low-coverage set and create a 'genome' with NNNN in between?

rcedgar · 2020-05-16T15:42:50Z

@hussius Congratulations! Just made it in time, or maybe a bit late... Send an email to robert@drive5.com and let me know which amazon you prefer (.com, .de, .es, .mx...) -- if that actually matters, not sure.

Would be great to see an alignment of your contigs, or even better scaffold, against my assembly (see #89).

taltman · 2020-05-17T07:34:16Z

https://sanger-pathogens.github.io/iva/

hussius · 2020-05-17T10:27:42Z

@ababaian Sure. If you are only talking about inserting NNNs between contigs that should be a fairly straightforward scripting task, but if you mean a more "serious" scaffolding, I'd happily take suggestions on good tools to use. I tried something called Medusa on these contigs and it was able to get the number of contigs down to three, but it also lost some sequence so the end product was a ~25 kb assembly.

hussius · 2020-05-17T15:19:53Z

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now. The best BLAST hit:

rcedgar · 2020-05-17T15:23:04Z

@hussius can you comment on suitability of your assembly approach for HPC, i.e. putting in a container and running it in the cloud?

hussius · 2020-05-17T17:25:33Z

@rcedgar The problem I see is that automating the host sequence removal could be hard, especially if the host doesn't have a good reference genome. Apart from that, the workflow is fairly lightweight and should be reasonably easy to containerize.

hussius · 2020-05-17T17:37:16Z

@rcedgar Here is an alignment between my assembly for the low-coverage pig virus vs. a FASTA file combining your reference based assemblies for SRR10951654 (called "Genome1" in the file) and SRR10951655 (called "Genome2" in the file):
C2X4M578114-Alignment.txt I don't know if this alignment format is convenient; if not, you can BLAST your assemblies against my tentative scaffolded assembly. (My assembly is called "Chr0_RaGOO" because I used a program called RaGOO for scaffolding)

rcedgar · 2020-05-17T17:46:43Z

@hussius "automating the host sequence removal could be hard". Yes, exactly. Maybe I should have disallowed host filtering and potentially saved myself $250, but it seemed we were making very little progress on assembly so I thought solving an easier problem could get some momentum going. So, how do we tackle host filtering in general?

ababaian · 2020-05-17T18:10:57Z

@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with aws s3 mb and I wasn't sure where to put it if I couldn't make my own directory. I'm sure this scaffolded genome can be improved but I will leave it as it is for now.

How is this scaffolded? Did you use another reference genome as the backbone?

taltman · 2020-05-17T22:25:14Z

I've created a stub on the Serratus Assembly Wiki to capture this list of assemblers:

https://github.com/ababaian/serratus/wiki/Serratus-Assembly#list-of-assemblers-to-consider

Please help migrate this great content over there!

rcedgar · 2020-05-17T22:27:12Z

Sounds like duplicated effort to me. The wiki can link to the issue. Let's be pragmatic and not expend unnecessary effort conforming to a system, we need to focus on the "real" work as much as possible.

Notes on de novo assembly methods used for issue #71

taltman · 2020-05-17T22:33:33Z

@rcedgar Regarding host removal, I am testing the use of Kraken2 for removing reads that are predicted to be prokaryotic or eukaryotic in origin. The hope is that will reduce the overhead tremendously. I'll post results on Slack.

victorlin · 2020-05-18T02:57:55Z

List of assemblers has been copied from assemblers.md to the relevant wiki page.

ababaian · 2020-05-18T02:59:49Z

You're a gentleman and a scholar Victor.

sjackman · 2020-05-21T21:43:34Z

IVA mentioned above by @taltman is missing. It's in Homebrew. #71 (comment)
https://github.com/sanger-pathogens/iva
https://github.com/brewsci/homebrew-bio/blob/master/Formula/iva.rb

sjackman · 2020-05-21T21:45:08Z

I've found Unicycler very effective for small genomes, especially when you want a usable GFA file for visualization with Bandage. Both tools in Homebrew.
https://github.com/rrwick/Unicycler
https://github.com/brewsci/homebrew-bio/blob/master/Formula/unicycler.rb

sjackman · 2020-05-21T22:17:26Z

Shovill
https://github.com/tseemann/shovill
https://github.com/brewsci/homebrew-bio/blob/develop/Formula/shovill.rb
Used by
Isolation and rapid sharing of the 2019 novel coronavirus (SARS ‐CoV‐2) from the first patient diagnosed with COVID ‐19 in Australia
https://onlinelibrary.wiley.com/doi/full/10.5694/mja2.50569
https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.5694%2Fmja2.50569&file=mja250569-sup-0001-Supinfo.pdf

taltman · 2020-05-24T22:44:39Z

@sjackman Thanks! We're planning on having a call to discuss our assembly plan, would you be available to join? Here's the link:

https://dudle.inf.tu-dresden.de/serratus001/

sjackman · 2020-05-25T21:42:50Z

Sorry I didn't see this question until now, and it looks like the meeting has already happened. I'm happy to chat more here on GitHub.

ababaian · 2020-05-25T21:59:00Z

@sjackman We're meeting tomorrow 9AM :P

sjackman · 2020-05-25T22:08:28Z

I can make that. Please share the Zoom (or whatever) link with me.

ababaian · 2020-05-26T16:00:37Z

@sjackman can you email me your skype id

sjackman · 2020-05-26T16:06:56Z

Reposted from #86 (comment)

Mash Screen seems relevant.
Mash Screen: high-throughput sequence containment estimation for genome discovery
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x
See the paragraph Novel virus assembly

Mash Screen is implemented in C++ and is integrated into the existing Mash codebase as of v2.0.

https://github.com/marbl/Mash

rchikhi · 2020-05-31T17:22:08Z

As per Issue #130, a benchmark of some of the assemblers is https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets Please let me know if you'd like it to be updated with your favorite method

ababaian · 2020-06-16T23:13:45Z

Good to close for now?

rcedgar · 2020-06-16T23:14:39Z

Yes.

rcedgar changed the title ~~List of assemblers~~ List of assemblers -- please contribute! May 3, 2020

rcedgar added Bioinformatics Bioinformatics task good first issue Good for newcomers labels May 3, 2020

rcedgar mentioned this issue May 3, 2020

Assembly protocol of COV sequences #65

Closed

ababaian mentioned this issue May 14, 2020

State of Assembly #86

Closed

taltman assigned hussius, JustinChu, cmorganl, ababaian and charlescongxu May 15, 2020

taltman added this to the Assembly: Pipelines milestone May 15, 2020

hussius added a commit that referenced this issue May 17, 2020

Notes on de novo assembly methods used for issue #71

bce8e3d

ababaian pushed a commit that referenced this issue May 17, 2020

Merge pull request #105 from ababaian/mhuss_dev

b1fe1b1

Notes on de novo assembly methods used for issue #71

rcedgar closed this as completed Jun 16, 2020

List of assemblers -- please contribute! #71

List of assemblers -- please contribute! #71

Comments

rcedgar commented May 3, 2020 • edited Loading

hussius commented May 11, 2020 • edited Loading

hussius commented May 11, 2020

hussius commented May 11, 2020

rcedgar commented May 11, 2020 • edited Loading

charlescongxu commented May 11, 2020

rcedgar commented May 11, 2020 • edited Loading

hussius commented May 11, 2020

charlescongxu commented May 11, 2020

ababaian commented May 12, 2020

ababaian commented May 13, 2020

cmorganl commented May 15, 2020

taltman commented May 15, 2020

rcedgar commented May 15, 2020

AndreaGuarracino commented May 15, 2020

hussius commented May 16, 2020

ababaian commented May 16, 2020

rcedgar commented May 16, 2020

taltman commented May 17, 2020

hussius commented May 17, 2020

hussius commented May 17, 2020 • edited Loading

rcedgar commented May 17, 2020

hussius commented May 17, 2020

hussius commented May 17, 2020

rcedgar commented May 17, 2020

ababaian commented May 17, 2020

taltman commented May 17, 2020

rcedgar commented May 17, 2020

taltman commented May 17, 2020

victorlin commented May 18, 2020

ababaian commented May 18, 2020

sjackman commented May 21, 2020

sjackman commented May 21, 2020

sjackman commented May 21, 2020

taltman commented May 24, 2020

sjackman commented May 25, 2020

ababaian commented May 25, 2020

sjackman commented May 25, 2020

ababaian commented May 26, 2020

sjackman commented May 26, 2020

rchikhi commented May 31, 2020

ababaian commented Jun 16, 2020

rcedgar commented Jun 16, 2020

rcedgar commented May 3, 2020 •

edited

Loading

hussius commented May 11, 2020 •

edited

Loading

rcedgar commented May 11, 2020 •

edited

Loading

rcedgar commented May 11, 2020 •

edited

Loading

hussius commented May 17, 2020 •

edited

Loading