-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of assemblers -- please contribute! #71
Comments
Since I don't seem to have push access to this repo, I'll add some suggestions as a comment below this one. Just to give a bit of context - my comments and recommendations stem from a project about assembling RNA and DNA virus genomes from mixed samples (host + virus) that I was involved in around 2015-2016. In terms of the tools mentioned in the review paper which is linked from the doc/assemblers.md document, I've tried more than half. (Specifically, CLC, IDBA-UD, Megahit, MetaVelvet, Mira, SPAdes, Velvet, Vicuna; I've also used Ray Meta but only for bacterial metagenome assembly, where it worked well). The ones I ended up using the most were SPAdes (which is also highlighted in the paper as a consistently well-performing tool, although they used the SPAdes meta variety whereas I just used SPAdes) and IDBA-UD (which the review paper authors also seem to like). I also quite liked Megahit, which has low memory requirements, but did not end up using that simply because it was not appreciably better in terms of assembly quality than SPAdes and IDBA-UD, which were already in our pipeline. All just my two cents of course! |
SPAdes Megahit IDBA |
@rcedgar Do you know what hardware you will be trying to run the assembly on? (If AWS instances, what size RAM etc.) |
@hussius We can use any instance type we need, smaller and cheaper are preferred but Amazon are donating credits and we have a lot of flexibility. If you have expertise here, would you be up for doing some testing ASAP? The key question is whether an existing de novo assembler can handle a dataset under these conditions: (+) Host reads are a large majority, virus is a small fraction. (+) Host is a species where we don't have a finished genome, an obscure bat or whatever. (+) Virus is only recognized from a fragment, so we cannot use known virus genomes as a positive filter. (Edit added by RCE). Even the assembler generates virus contigs, how do we filter out the host? If the virus is closely related to a known strain, this may be straightforward, but otherwise it could be challenging. |
The host species is typically annotated, no? You can map to a similar enough host species since divergence with viruses will be large. |
Show us how! @ababaian and @rcedgar are overextended and don't have time to try this ourselves right now. I will offer US$250 Amazon gift certificate to first person to implement an open-source method which creates de novo contigs from GroupA: SRR10951654-655 and GroupB: SRR10829951-958. Contigs must be validated against a close known virus reference. Virus reference genomes must not be used as a positive filter before assembly, the key challenge is how to assemble when only a small fragment is recognized. Offer expires in a week -- contigs and method documentation must be posted before 12pm Pacific time Sunday May 17th. RCE -- edit to clarify -- You must start from the full SRA dataset, not the BAM file generated by Serratus. In the datasets above, there is a close relative of a known virus, so the BAM files probably include almost all virus reads and very little host. This is easy. The situation we don't know how to handle is where we see reads hitting a short fragment, say one CDS, but not an entire genome. In that case, most of the virus reads will be missing from the Serratus BAM file. Host filtering is allowed (the above datasets have pig hosts, which are a good model for this situation), but filtering by the virus reference is not allowed. |
I'll give it a try! The condition "(+) Host reads are a large majority, virus is a small fraction." is typical and was possible to overcome in my old project. But we'll see how the assemblers work on your suggested datasets! |
Let me know if you need help! |
Hello all interested parties, I think we're starting to reach a critical mass of people regarding how to process assembly for Serratus. I'd like to propose we all get on a technical group call and we can begin to address how we want to tackle this and how best to divide the work among us. I'd like to propose Friday morning (PST) which is Friday evening in Europe. Please submit this dudle poll with your availability and we'll convene with a clear plan of attack then. |
We will be meeting Friday 9AM PDT on Skype. Please DM me your skype details on slack if we haven't already had a chance to chat. |
MATAM |
Hi @rcedgar, are you caught up on incorporating these suggestions into the documentation? Or is there something more that we need here? Thanks! |
Shasta |
@rcedgar I have written up a methods description for making de novo contigs from your two SRA accession groups here. Briefly, for the high-coverage case (GroupB), Megahit was able to create a single contig (link) covering the presumably intended target genome. For the low-coverage case, the assembly is more fragmented (I think 37 contigs) so while the contigs (link) do span most of the presumable target genome, there are quite a few gaps. |
@hussius Can you scaffold that onto a genome for the low-coverage set and create a 'genome' with NNNN in between? |
@hussius Congratulations! Just made it in time, or maybe a bit late... Send an email to robert@drive5.com and let me know which amazon you prefer (.com, .de, .es, .mx...) -- if that actually matters, not sure. Would be great to see an alignment of your contigs, or even better scaffold, against my assembly (see #89). |
@ababaian Sure. If you are only talking about inserting NNNs between contigs that should be a fairly straightforward scripting task, but if you mean a more "serious" scaffolding, I'd happily take suggestions on good tools to use. I tried something called Medusa on these contigs and it was able to get the number of contigs down to three, but it also lost some sequence so the end product was a ~25 kb assembly. |
@ababaian OK, I've posted a tentative scaffolded genome for the low-coverage samples here. I would have uploaded into the serratus-public AWS S3, but it wouldn't let me create a new directory (bucket) with |
@hussius can you comment on suitability of your assembly approach for HPC, i.e. putting in a container and running it in the cloud? |
@rcedgar The problem I see is that automating the host sequence removal could be hard, especially if the host doesn't have a good reference genome. Apart from that, the workflow is fairly lightweight and should be reasonably easy to containerize. |
@rcedgar Here is an alignment between my assembly for the low-coverage pig virus vs. a FASTA file combining your reference based assemblies for SRR10951654 (called "Genome1" in the file) and SRR10951655 (called "Genome2" in the file): |
@hussius "automating the host sequence removal could be hard". Yes, exactly. Maybe I should have disallowed host filtering and potentially saved myself $250, but it seemed we were making very little progress on assembly so I thought solving an easier problem could get some momentum going. So, how do we tackle host filtering in general? |
How is this scaffolded? Did you use another reference genome as the backbone? |
I've created a stub on the Serratus Assembly Wiki to capture this list of assemblers: Please help migrate this great content over there! |
Sounds like duplicated effort to me. The wiki can link to the issue. Let's be pragmatic and not expend unnecessary effort conforming to a system, we need to focus on the "real" work as much as possible. |
Notes on de novo assembly methods used for issue #71
@rcedgar Regarding host removal, I am testing the use of Kraken2 for removing reads that are predicted to be prokaryotic or eukaryotic in origin. The hope is that will reduce the overhead tremendously. I'll post results on Slack. |
List of assemblers has been copied from assemblers.md to the relevant wiki page. |
You're a gentleman and a scholar Victor. |
IVA mentioned above by @taltman is missing. It's in Homebrew. #71 (comment) |
I've found Unicycler very effective for small genomes, especially when you want a usable GFA file for visualization with Bandage. Both tools in Homebrew. |
Shovill |
@sjackman Thanks! We're planning on having a call to discuss our assembly plan, would you be available to join? Here's the link: |
Sorry I didn't see this question until now, and it looks like the meeting has already happened. I'm happy to chat more here on GitHub. |
@sjackman We're meeting tomorrow 9AM :P |
I can make that. Please share the Zoom (or whatever) link with me. |
@sjackman can you email me your skype id |
Reposted from #86 (comment) Mash Screen seems relevant.
|
As per Issue #130, a benchmark of some of the assemblers is https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets Please let me know if you'd like it to be updated with your favorite method |
Good to close for now? |
Yes. |
I started a list of assemblers that might be useful here:
doc/assemblers.md
Please edit / send me updates if you have ideas.
The text was updated successfully, but these errors were encountered: