-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~~> The Assembly Benchmark <~~ #130
Comments
Excellent! Please deposit fastq, bam and sam if possible, thanks. |
Suggest designating an S3 directory structure for contigs, something like contigs/method/SRA.fa e.g. contigs/rce/SRR7287110.fa so we can all find & review. |
Just so we're not all trying the same things: I'm running metaviralSpades on all those datasets. |
This is probably clear to all, but I wanted to emphasize just in case: The goal is to implement a fully automated containerized workflow. IMO it's fine to cut corners to do the benchmark test if that makes things easier, but manual curation steps, e.g. to separate host from virus, are out of bounds unless it's clear how to automate them. There should be a clear and short path from the method used to make the contigs to something we can run in the cloud. |
The |
I'm doing an assembly of all, and then was planning to use CheckV (or others) to identify relevant contigs. This might not be the best strategy, but I'll just explore this route.. |
So, in terms of SPAdes, we believe that just SPAdes should work more or less straight out of the box. Probably single cell mode should be used to ensure that no assumptions about uniform coverage are made (though this mode also adds some code to clean up some MDA artifacts). The reality is that the proper solution should be somewhere in the middle of metaSPAdes and metaviralSPAdes :) |
I actually ended up running much more than MetaViralSPAdes. Here is a benchmark of MetaViralSpades, Megahit, Minia (with several parameters). An up-to-date version will be there: https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets Assembly benchmark results for 8 coronavirus candidates datasetsSee #130 for the motivation and description of the datasets. Viral contigs are available in A collection of scripts that were used to produce this benchmark is in Benchmark setupReads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in MetaviralSpades did not detect anything in its regular output so I used Results
Nothing was found, apparently. PerformanceOn 4 threads, dataset
Some thoughtsThere are 3 types of datasets:
Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only |
PRICE: elapsed times are ~1hr with 4 threads. |
I'm checking some odd annotation results in our co-assembly. @ababaian , can you help me understand which version (i.e., generated on which date) of which flavor of reference DB (e.g., cov0, cov2, cov3m, etc.) was used for each of the runs referenced in the benchmark above? |
A little bit outdated by this point but this is all from the zoonotic reservoir dataset which began on 200505. A small sub-set of that data is aligned to cov2r (only Pan-Coronavirus) and the majority of the data is aligned to |
For completeness, here are the results from the metagenomic co-assembly test: https://github.com/ababaian/serratus/wiki/Metagenomic-Co-Assembly-Proof-of-Concept |
Can you add "Conclusions"? Is this something you would recommend including in production, or not? Seems like a good idea, but maybe didn't work so well in practice? |
I believe the key point in co-assembly approach there was the removal of host reads. Otherwise the size of the data would be enormous |
Maybe I misunderstood, I thought co-assembly was combining reads from different SRAs which probably have the same virus. Naively, this would make sense when coverage is low. I thought they used checkv as the final step to separate out the host reads? |
@rcedgar I don't know why you say it doesn't work so well in practice. We recovered a full-length IBV CoV genome from the "low abundance" IBV study, whereas using other assemblers a fraction of that at 790 bp max was recovered (see @rchikhi 's results above) . I think it worked very well, as I described on the call. Yes, it takes ~4x as long as the Minia assembler, so I don't recommend it for everything. I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome. And the Hallam Lab has graciously volunteered a significant amount of compute to do that for the Serratus Project, so AWS cost budgeting is not an issue. I just need a list of the SRAs where we didn't recover a full sequence. There's no reason not to do this. |
Sorry for the misunderstanding -- it was a question, not a statement, I was hoping you could summarize the conclusions. "I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome" is exactly what I was looking for, thanks! |
Correction: @rchikhi 's results show that 790 bp were recovered, not 5kbp. Corrected above. |
Sorry if I misread your comment. I'll make sure to mirror the conclusions that I've written here to the Wiki report. Please leave this open until that documentation is complete, feel free to reassign to me. |
@asl Is correct, we filter reads up-front to remove host reads using BMTagger. We will probably move to Kraken in the future. We used CheckV at the end to check for completeness of the recovered contigs. |
Got it thanks. And general apologies to all for not RTFM'ing everything, I'm skimming a lot in an attempt to maximize my own productivity. |
For the sake of completeness, SPAdes recovered 25 kbp of contigs from single run (though, max contig was only 2 kbp there) and we obtained 27 kbp from 2 SRAs. Here is the QUAST report:
|
Thanks @asl. Did the metaviralSPAdes tools declare all of those contigs to be viral in origin? |
This was not a metaviralSPAdes run as it is unsuitable for RNA viruses. But I will check with viralVerify. |
@taltman viralVerify classified majority of these contigs as viral. Also, in the full assembly it shows many other viral contigs as well, I'm cross-checking with checkv. |
Yes, there is, for example:
which is likely the same as (judging from the length and # of genes)
as reported in co-assembly results |
I've added coronaSPAdes results to https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets |
commentary: looks real good, especially on Ginger, the best method so far. For some reason, in combination with CheckV it didn't return anything on the very-low-coverage datasets, but in some cases the |
@rchikhi The assemblies are fragmented on low coverage datasets. So we don't have strong matches for HMM-based approaches. And checkv had some thresholds as well (I believe it won't report anything useful to contigs < 4 kbp) |
sounds fair! well, CheckV did manage to get hits on small (200-300bp) Megahit/minia contigs. But this is anyway a minor issue, that will be resolved by coassembly. |
These are the SRA accessions which are our 'representative' set. Based on the assemblies / contigs generated by each method we will decide on an implementation to use on June 3rd. Which means we should have assemblies for these done by June 1st the latest such that we can compare results.
Delivery
Completion of each benchmark should contain as much of this information as possible. All data should be uploaded to ~
s3://serratus-public/notebook/200526_assembly/<your_initials>/
~/notebook.md
/.ipynb
/.make
: A Rough script/procedure for exactly how the assembly was made (QC, software/versions etc...)time
on an 8-core ~64 GB RAM machine (be sure to start count from post-unzip)~/contigs/
Viral contig sequence output~/metrics/
Any assembly metrics associated with outputThe Dataset
To download the raw fq files and the
cov3m
aligned bam files (output of Serratus).fq data:
aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/fq/ ./
bam data:
aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/bam/ ./
(includes cov3m reference sequence)
High Coverage Porcine Epidemic Virus in Sus Scrofa
SRR10829953
: ~180K reads to KT323979.1SRR10829957
: ~195K reads to KP728470.1Low Coverage Infectious Bronchitis Virus in Sus Scrofa
SRR10951656
: ~4x coverage to MH878976.1SRR10951660
: ~1x coverage to MH878976.1Experimental SARS/MERS infection of Vero Cell Line
SRR1194066
: ~16K read coverage to KF600647.1SRR1168901
:~2.6K reads low coverage to KF600647.1Frank - Unidentified Bat Alphacoronavirus
ERR2756788
: ~8K mapped coverage, closest hit is fragment EU769558.1Ginger - Unidentified Feline Coronavirus
SRR7287110
: ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1I will download all of these accessions and host them in the S3 bucket so we can access them quickly and not have to go through the SRA each time. Give me an hour or two.
The text was updated successfully, but these errors were encountered: