Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~~> The Assembly Benchmark <~~ #130

Closed
6 tasks
ababaian opened this issue May 26, 2020 · 32 comments
Closed
6 tasks

~~> The Assembly Benchmark <~~ #130

ababaian opened this issue May 26, 2020 · 32 comments
Labels
Bioinformatics Bioinformatics task Priority: High Plz do this

Comments

@ababaian
Copy link
Owner

ababaian commented May 26, 2020

These are the SRA accessions which are our 'representative' set. Based on the assemblies / contigs generated by each method we will decide on an implementation to use on June 3rd. Which means we should have assemblies for these done by June 1st the latest such that we can compare results.

Delivery

Completion of each benchmark should contain as much of this information as possible. All data should be uploaded to ~ s3://serratus-public/notebook/200526_assembly/<your_initials>/

  • ~/notebook.md / .ipynb / .make : A Rough script/procedure for exactly how the assembly was made (QC, software/versions etc...)
  • time on an 8-core ~64 GB RAM machine (be sure to start count from post-unzip)
  • Peak memory usage
  • ~/contigs/ Viral contig sequence output
  • ~/metrics/ Any assembly metrics associated with output
  • Size of CoV viral contigs

The Dataset

To download the raw fq files and the cov3m aligned bam files (output of Serratus).

fq data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/fq/ ./
bam data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/bam/ ./
(includes cov3m reference sequence)

High Coverage Porcine Epidemic Virus in Sus Scrofa

  • SRR10829953 : ~180K reads to KT323979.1
  • SRR10829957: ~195K reads to KP728470.1

Low Coverage Infectious Bronchitis Virus in Sus Scrofa

  • SRR10951656 : ~4x coverage to MH878976.1
  • SRR10951660 : ~1x coverage to MH878976.1

Experimental SARS/MERS infection of Vero Cell Line

  • SRR1194066 : ~16K read coverage to KF600647.1
  • SRR1168901 :~2.6K reads low coverage to KF600647.1

Frank - Unidentified Bat Alphacoronavirus

  • ERR2756788 : ~8K mapped coverage, closest hit is fragment EU769558.1

Ginger - Unidentified Feline Coronavirus

  • SRR7287110 : ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

I will download all of these accessions and host them in the S3 bucket so we can access them quickly and not have to go through the SRA each time. Give me an hour or two.

@rcedgar
Copy link
Collaborator

rcedgar commented May 26, 2020

Excellent! Please deposit fastq, bam and sam if possible, thanks.

@rcedgar
Copy link
Collaborator

rcedgar commented May 26, 2020

Suggest designating an S3 directory structure for contigs, something like contigs/method/SRA.fa e.g. contigs/rce/SRR7287110.fa so we can all find & review.

@ababaian ababaian added Bioinformatics Bioinformatics task Priority: High Plz do this labels May 26, 2020
@rchikhi
Copy link
Collaborator

rchikhi commented May 26, 2020

Just so we're not all trying the same things: I'm running metaviralSpades on all those datasets.

@rcedgar
Copy link
Collaborator

rcedgar commented May 26, 2020

This is probably clear to all, but I wanted to emphasize just in case: The goal is to implement a fully automated containerized workflow. IMO it's fine to cut corners to do the benchmark test if that makes things easier, but manual curation steps, e.g. to separate host from virus, are out of bounds unless it's clear how to automate them. There should be a clear and short path from the method used to make the contigs to something we can run in the cloud.

@taltman
Copy link
Collaborator

taltman commented May 27, 2020

@rchikhi Rats, you took mine! I guess I'll try to marshal some forces to try to do some metagenomic assembly voo-doo instead. @ababaian , is there a low-coverage CoV hit that comes from an experiment with multiple samples, that you might recommend?

@ababaian
Copy link
Owner Author

The sus scrofa series with IBV SRR10951656/SRR10951660 are part of that series, I took two examples but you can try that entire BioProject for this.

@taltman
Copy link
Collaborator

taltman commented May 27, 2020

@ababaian Thanks! That looks perfect.

@rchikhi Are you planning on using Kraken2 or some other tool for trying to pare down the number of host reads? Or do you plan to assemble them all, and then identify contigs of interest? Or both? :-)

@rchikhi
Copy link
Collaborator

rchikhi commented May 27, 2020

I'm doing an assembly of all, and then was planning to use CheckV (or others) to identify relevant contigs. This might not be the best strategy, but I'll just explore this route..

@asl
Copy link

asl commented May 27, 2020

So, in terms of SPAdes, we believe that just SPAdes should work more or less straight out of the box. Probably single cell mode should be used to ensure that no assumptions about uniform coverage are made (though this mode also adds some code to clean up some MDA artifacts).

The reality is that the proper solution should be somewhere in the middle of metaSPAdes and metaviralSPAdes :)

@rchikhi
Copy link
Collaborator

rchikhi commented May 31, 2020

I actually ended up running much more than MetaViralSPAdes. Here is a benchmark of MetaViralSpades, Megahit, Minia (with several parameters). An up-to-date version will be there: https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets

Assembly benchmark results for 8 coronavirus candidates datasets

See #130 for the motivation and description of the datasets.

Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs.

A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/.

Benchmark setup

Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv that match the regexp [Cc]orona.

MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta (the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz).

Results

SRR10829953 : ~180K reads to KT323979.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 28530 101.4 97.55
Megahit GCA_000913415.1 583 2.0 62.7
Minia-pipeline GCA_900205315.1 24266 86.3 97.72
Minia-pipeline GCA_900205315.1 2650 9.5 96.26
Minia k71 GCA_900205315.1 21295 75.7 97.88
Minia k71 GCA_900205315.1 2553 9.1 98.2
Minia k31 GCA_900205315.1 24280 86.4 97.72
Minia k31 GCA_900205315.1 1040 3.7 98.4
Minia k31 GCA_900205315.1 1003 3.6 93.0
MetaViralSPAdes GCA_900205315.1 18984 67.5 97.51
MetaViralSPAdes GCA_900205315.1 8808 31.4 96.9

SRR10829957 : ~195K reads to KP728470.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_900205315.1 1548 5.5 95.4
Megahit GCA_900205315.1 26185 93.1 97.65
Minia-pipeline GCA_900205315.1 24272 86.3 97.72
Minia-pipeline GCA_900205315.1 3650 13.0 96.42
Minia k71 GCA_900205315.1 24243 86.2 97.72
Minia k71 GCA_900205315.1 2363 8.4 98.2
Minia k71 GCA_900205315.1 805 2.9 98.5
Minia k31 GCA_900205315.1 17109 60.9 98.19
Minia k31 GCA_900205315.1 5041 17.9 95.0
Minia k31 GCA_900205315.1 1451 5.2 98.1
MetaViralSPades GCA_900205315.1 28060 99.8 97.55

SRR10951656 : ~4x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000880055.1 460 1.7 93.8
Megahit GCA_000880055.1 569 2.1 95.2
Megahit GCA_000880055.1 790 2.9 87.4
Minia-pipeline GCA_000880055.1 206 0.7 84.4
Minia k31 GCA_000880055.1 203 0.7 92.2
Minia k31 GCA_000862965.1 343 1.2 96.5
Minia k31 GCA_000880055.1 255 0.9 85.0

SRR10951660 : ~1x coverage to MH878976.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000862965.1 346 1.3 96.5
Megahit GCA_000880055.1 634 2.3 95.7

SRR1194066 : ~16K read coverage to KF600647.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000901155.1 7601 25.2 99.18
Minia-pipeline GCA_000901155.1 6973 23.2 99.1
Minia k71 GCA_000901155.1 2502 8.3 100.0
Minia k71 GCA_000901155.1 760 2.5 100.0
Minia k71 GCA_000901155.1 285 1.0 100.0
Minia k31 GCA_000901155.1 5691 19.0 100.0
MetaViralSPAdes GCA_000901155.1 3180 10.6 100.0
MetaViralSPAdes GCA_000901155.1 2952 9.8 100.0

ERR2756788 : Frank, ~8K mapped coverage, closest hit is fragment EU769558.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000872845.1 1998 6.3 32.2
Megahit GCA_003972065.1 29219 105.2 55.71
Minia-pipeline GCA_001503155.1 26412 95.1 56.18
Minia-pipeline GCA_003972065.1 2778 9.7 47.6
Minia-pipeline GCA_000872845.1 1801 5.6 32.2
Minia k71 GCA_000899495.1 7104 25.3 48.61
Minia k71 GCA_000899495.1 5044 17.7 80.3
Minia k31 GCA_003972065.1 28908 104.1 55.71
Minia k31 GCA_000872845.1 798 2.5 32.2
MetaViralSPAdes GCA_000899495.1 26406 95.1 55.75
MetaViralSPAdes GCA_003972065.1 1337 4.7 46.9
MetaViralSPAdes GCA_000872845.1 1185 3.7 32.2

SRR7287110 : Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

Method Detected virus Contig length CheckV estim. completeness CheckV AA avg ID%
Megahit GCA_000856025.1 26905 95.1 85.3
Megahit GCA_000856025.1 1995 6.9 90.89
Megahit GCA_000856025.1 2598 9.0 91.14
Megahit GCA_000870985.1 3972 14.4 37.0
Minia-pipeline GCA_000856025.1 11269 39.5 79.61
Minia-pipeline GCA_000856025.1 6024 20.9 91.23
Minia-pipeline GCA_000856025.1 1058 3.7 68.1
Minia-pipeline GCA_000856025.1 2288 7.9 90.61
Minia-pipeline GCA_000856025.1 2291 7.9 91.14
Minia k71 GCA_000856025.1 10075 35.0 78.89
Minia k71 GCA_000856025.1 1000 3.4 89.4
Minia k71 GCA_000856025.1 843 3.0 90.8
Minia k71 GCA_000856025.1 846 3.0 90.8
Minia k71 GCA_000856025.1 666 2.3 97.8
Minia k71 GCA_000856025.1 666 2.3 94.6
Minia k31 GCA_000856025.1 4902 17.2 96.1
Minia k31 GCA_000856025.1 1891 6.5 92.4
Minia k31 GCA_000856025.1 1368 4.8 87.08
Minia k31 GCA_000856025.1 832 3.0 95.5
MetaViralSPAdes GCA_000856025.1 307 1.1 68.1
MetaViralSPAdes GCA_000856025.1 307 1.1 72.1
MetaViralSPAdes GCA_001504755.1 2879 10.4 36.0
MetaViralSPAdes GCA_000856025.1 1064 3.6 91.4
MetaViralSPAdes GCA_000856025.1 2201 7.6 91.8
MetaViralSPAdes GCA_000856025.1 1217 4.2 92.1
MetaViralSPAdes GCA_000856025.1 899 3.1 88.4
MetaViralSPAdes GCA_000856025.1 8785 30.5 95.9
MetaViralSPAdes GCA_000856025.1 858 3.0 95.5
MetaViralSPAdes GCA_000856025.1 9000 31.4 91.3

SRR1168901

Nothing was found, apparently.

Performance

On 4 threads, dataset SRR10829957 (4 GB compressed)

Method Time Memory (GB)
Megahit 2h59m 6.8
Minia-pipeline 2h 3.8
Minia k61 24m 1.7
Minia k31 34m 1.7
MetaviralSpades 18h21m 21

Some thoughts

There are 3 types of datasets:

  1. (easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome
  2. (hard instances) and those where the coverage is too low and reference-based assembly will be needed.
  3. (assembler-critical instances) on those, only some assemblers do well with default parameters

Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110 is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.

@rcedgar
Copy link
Collaborator

rcedgar commented May 31, 2020

PRICE: elapsed times are ~1hr with 4 threads.
Longest contigs:
Frank ERR2756788 8504
Ginger SRR7287110 15691
Pedv1 SRR10829953 23559
Pedv2 SRR10829957 27949
Lowcvg SRR10951656 238 (total of 628 contigs)
Fully automated, no attempt to tune, no metadata needed except for paired vs. unpaired and read length

@taltman
Copy link
Collaborator

taltman commented Jun 2, 2020

I'm checking some odd annotation results in our co-assembly. @ababaian , can you help me understand which version (i.e., generated on which date) of which flavor of reference DB (e.g., cov0, cov2, cov3m, etc.) was used for each of the runs referenced in the benchmark above?

@ababaian
Copy link
Owner Author

ababaian commented Jun 2, 2020

A little bit outdated by this point but this is all from the zoonotic reservoir dataset which began on 200505. A small sub-set of that data is aligned to cov2r (only Pan-Coronavirus) and the majority of the data is aligned to cov2m which is flom1, or a sub-set of vertebrate viruses. Essentially as we were learning about this data-set we were going through major changes of our reference genome, it will all have to be repeated soon anyways, I can do this on the weekend.

@taltman
Copy link
Collaborator

taltman commented Jun 7, 2020

For completeness, here are the results from the metagenomic co-assembly test:

https://github.com/ababaian/serratus/wiki/Metagenomic-Co-Assembly-Proof-of-Concept

@rcedgar
Copy link
Collaborator

rcedgar commented Jun 7, 2020

Can you add "Conclusions"? Is this something you would recommend including in production, or not? Seems like a good idea, but maybe didn't work so well in practice?

@asl
Copy link

asl commented Jun 7, 2020

I believe the key point in co-assembly approach there was the removal of host reads. Otherwise the size of the data would be enormous

@rcedgar
Copy link
Collaborator

rcedgar commented Jun 7, 2020

Maybe I misunderstood, I thought co-assembly was combining reads from different SRAs which probably have the same virus. Naively, this would make sense when coverage is low. I thought they used checkv as the final step to separate out the host reads?

@taltman
Copy link
Collaborator

taltman commented Jun 7, 2020

@rcedgar I don't know why you say it doesn't work so well in practice. We recovered a full-length IBV CoV genome from the "low abundance" IBV study, whereas using other assemblers a fraction of that at 790 bp max was recovered (see @rchikhi 's results above) . I think it worked very well, as I described on the call.

Yes, it takes ~4x as long as the Minia assembler, so I don't recommend it for everything. I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome. And the Hallam Lab has graciously volunteered a significant amount of compute to do that for the Serratus Project, so AWS cost budgeting is not an issue. I just need a list of the SRAs where we didn't recover a full sequence. There's no reason not to do this.

@rcedgar
Copy link
Collaborator

rcedgar commented Jun 7, 2020

Sorry for the misunderstanding -- it was a question, not a statement, I was hoping you could summarize the conclusions. "I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome" is exactly what I was looking for, thanks!

@taltman
Copy link
Collaborator

taltman commented Jun 7, 2020

Correction: @rchikhi 's results show that 790 bp were recovered, not 5kbp. Corrected above.

@taltman
Copy link
Collaborator

taltman commented Jun 7, 2020

Sorry if I misread your comment. I'll make sure to mirror the conclusions that I've written here to the Wiki report. Please leave this open until that documentation is complete, feel free to reassign to me.

@taltman
Copy link
Collaborator

taltman commented Jun 7, 2020

@asl Is correct, we filter reads up-front to remove host reads using BMTagger. We will probably move to Kraken in the future. We used CheckV at the end to check for completeness of the recovered contigs.

@rcedgar
Copy link
Collaborator

rcedgar commented Jun 7, 2020

Got it thanks. And general apologies to all for not RTFM'ing everything, I'm skimming a lot in an attempt to maximize my own productivity.

@asl
Copy link

asl commented Jun 7, 2020

For the sake of completeness, SPAdes recovered 25 kbp of contigs from single run (though, max contig was only 2 kbp there) and we obtained 27 kbp from 2 SRAs. Here is the QUAST report:

Assembly                    corona_pe2
# contigs (>= 0 bp)         13
# contigs (>= 1000 bp)      10
# contigs (>= 5000 bp)      0
# contigs (>= 10000 bp)     0
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      27648
Total length (>= 1000 bp)   26277
Total length (>= 5000 bp)   0
Total length (>= 10000 bp)  0
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   11
Largest contig              4836
Total length                26987
GC (%)                      38.31
N50                         2824
N75                         1657
L50                         4
L75                         7
# N's per 100 kbp           0.00

@taltman
Copy link
Collaborator

taltman commented Jun 8, 2020

Thanks @asl. Did the metaviralSPAdes tools declare all of those contigs to be viral in origin?

@asl
Copy link

asl commented Jun 8, 2020

This was not a metaviralSPAdes run as it is unsuitable for RNA viruses. But I will check with viralVerify.

@asl
Copy link

asl commented Jun 8, 2020

@taltman viralVerify classified majority of these contigs as viral. Also, in the full assembly it shows many other viral contigs as well, I'm cross-checking with checkv.

@asl
Copy link

asl commented Jun 8, 2020

Yes, there is, for example:

NODE_2_length_15157_cov_8.893458

which is likely the same as (judging from the length and # of genes)

k119_5808       MH996950.1      11176   Avian avulavirus 1      15205   15147   15138   99.908  99      27237   0.0

as reported in co-assembly results

@rchikhi
Copy link
Collaborator

rchikhi commented Jun 16, 2020

@rchikhi
Copy link
Collaborator

rchikhi commented Jun 16, 2020

commentary: looks real good, especially on Ginger, the best method so far.

For some reason, in combination with CheckV it didn't return anything on the very-low-coverage datasets, but in some cases the gene_clusters.fasta contained a few hundreds of sequences bases. Maybe lack of sensitivity from CheckV.

@asl
Copy link

asl commented Jun 16, 2020

@rchikhi The assemblies are fragmented on low coverage datasets. So we don't have strong matches for HMM-based approaches. And checkv had some thresholds as well (I believe it won't report anything useful to contigs < 4 kbp)

@rchikhi
Copy link
Collaborator

rchikhi commented Jun 16, 2020

sounds fair! well, CheckV did manage to get hits on small (200-300bp) Megahit/minia contigs. But this is anyway a minor issue, that will be resolved by coassembly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task Priority: High Plz do this
Projects
None yet
Development

No branches or pull requests

5 participants