> The Assembly Benchmark < #130

ababaian · 2020-05-26T17:46:15Z

These are the SRA accessions which are our 'representative' set. Based on the assemblies / contigs generated by each method we will decide on an implementation to use on June 3rd. Which means we should have assemblies for these done by June 1st the latest such that we can compare results.

Delivery

Completion of each benchmark should contain as much of this information as possible. All data should be uploaded to ~ s3://serratus-public/notebook/200526_assembly/<your_initials>/

~/notebook.md / .ipynb / .make : A Rough script/procedure for exactly how the assembly was made (QC, software/versions etc...)
time on an 8-core ~64 GB RAM machine (be sure to start count from post-unzip)
Peak memory usage
~/contigs/ Viral contig sequence output
~/metrics/ Any assembly metrics associated with output
Size of CoV viral contigs

The Dataset

To download the raw fq files and the cov3m aligned bam files (output of Serratus).

fq data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/fq/ ./
bam data: aws s3 cp --recursive s3://serratus-public/notebook/200526_assembly/bam/ ./
(includes cov3m reference sequence)

High Coverage Porcine Epidemic Virus in Sus Scrofa

SRR10829953 : ~180K reads to KT323979.1
SRR10829957: ~195K reads to KP728470.1

Low Coverage Infectious Bronchitis Virus in Sus Scrofa

SRR10951656 : ~4x coverage to MH878976.1
SRR10951660 : ~1x coverage to MH878976.1

Experimental SARS/MERS infection of Vero Cell Line

SRR1194066 : ~16K read coverage to KF600647.1
SRR1168901 :~2.6K reads low coverage to KF600647.1

Frank - Unidentified Bat Alphacoronavirus

ERR2756788 : ~8K mapped coverage, closest hit is fragment EU769558.1

Ginger - Unidentified Feline Coronavirus

SRR7287110 : ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

I will download all of these accessions and host them in the S3 bucket so we can access them quickly and not have to go through the SRA each time. Give me an hour or two.

The text was updated successfully, but these errors were encountered:

rcedgar · 2020-05-26T17:47:42Z

Excellent! Please deposit fastq, bam and sam if possible, thanks.

rcedgar · 2020-05-26T17:52:08Z

Suggest designating an S3 directory structure for contigs, something like contigs/method/SRA.fa e.g. contigs/rce/SRR7287110.fa so we can all find & review.

rchikhi · 2020-05-26T20:02:12Z

Just so we're not all trying the same things: I'm running metaviralSpades on all those datasets.

rcedgar · 2020-05-26T22:31:16Z

This is probably clear to all, but I wanted to emphasize just in case: The goal is to implement a fully automated containerized workflow. IMO it's fine to cut corners to do the benchmark test if that makes things easier, but manual curation steps, e.g. to separate host from virus, are out of bounds unless it's clear how to automate them. There should be a clear and short path from the method used to make the contigs to something we can run in the cloud.

taltman · 2020-05-27T05:41:39Z

@rchikhi Rats, you took mine! I guess I'll try to marshal some forces to try to do some metagenomic assembly voo-doo instead. @ababaian , is there a low-coverage CoV hit that comes from an experiment with multiple samples, that you might recommend?

ababaian · 2020-05-27T05:43:56Z

The sus scrofa series with IBV SRR10951656/SRR10951660 are part of that series, I took two examples but you can try that entire BioProject for this.

taltman · 2020-05-27T05:46:49Z

@ababaian Thanks! That looks perfect.

@rchikhi Are you planning on using Kraken2 or some other tool for trying to pare down the number of host reads? Or do you plan to assemble them all, and then identify contigs of interest? Or both? :-)

rchikhi · 2020-05-27T12:19:36Z

I'm doing an assembly of all, and then was planning to use CheckV (or others) to identify relevant contigs. This might not be the best strategy, but I'll just explore this route..

asl · 2020-05-27T22:41:20Z

So, in terms of SPAdes, we believe that just SPAdes should work more or less straight out of the box. Probably single cell mode should be used to ensure that no assumptions about uniform coverage are made (though this mode also adds some code to clean up some MDA artifacts).

The reality is that the proper solution should be somewhere in the middle of metaSPAdes and metaviralSPAdes :)

rchikhi · 2020-05-31T17:07:20Z

I actually ended up running much more than MetaViralSPAdes. Here is a benchmark of MetaViralSpades, Megahit, Minia (with several parameters). An up-to-date version will be there: https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets

Assembly benchmark results for 8 coronavirus candidates datasets

See #130 for the motivation and description of the datasets.

Viral contigs are available in s3://serratus-public/notebook/200526_assembly/RC/contigs.

A collection of scripts that were used to produce this benchmark is in s3://serratus-public/notebook/200526_assembly/RC/scripts/.

Benchmark setup

Reads were given as-is to each assembler (no quality/adapter trim). I ran the assemblers with default parameters and gave the contigs to CheckV. I then reported any contig hit for the genomes in checkv_genbank.tsv that match the regexp [Cc]orona.

MetaviralSpades did not detect anything in its regular output so I used K*/before_chromosome_removal.fasta (the unfiltered final assembly), and I couldn't run it on single-end reads (SRR1168901.fastq.gz, SRR10951660.fastq.gz, SRR10951656.fastq.gz).

Results

SRR10829953 : ~180K reads to KT323979.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_900205315.1	28530	101.4	97.55
Megahit	GCA_000913415.1	583	2.0	62.7
Minia-pipeline	GCA_900205315.1	24266	86.3	97.72
Minia-pipeline	GCA_900205315.1	2650	9.5	96.26
Minia k71	GCA_900205315.1	21295	75.7	97.88
Minia k71	GCA_900205315.1	2553	9.1	98.2
Minia k31	GCA_900205315.1	24280	86.4	97.72
Minia k31	GCA_900205315.1	1040	3.7	98.4
Minia k31	GCA_900205315.1	1003	3.6	93.0
MetaViralSPAdes	GCA_900205315.1	18984	67.5	97.51
MetaViralSPAdes	GCA_900205315.1	8808	31.4	96.9

SRR10829957 : ~195K reads to KP728470.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_900205315.1	1548	5.5	95.4
Megahit	GCA_900205315.1	26185	93.1	97.65
Minia-pipeline	GCA_900205315.1	24272	86.3	97.72
Minia-pipeline	GCA_900205315.1	3650	13.0	96.42
Minia k71	GCA_900205315.1	24243	86.2	97.72
Minia k71	GCA_900205315.1	2363	8.4	98.2
Minia k71	GCA_900205315.1	805	2.9	98.5
Minia k31	GCA_900205315.1	17109	60.9	98.19
Minia k31	GCA_900205315.1	5041	17.9	95.0
Minia k31	GCA_900205315.1	1451	5.2	98.1
MetaViralSPades	GCA_900205315.1	28060	99.8	97.55

SRR10951656 : ~4x coverage to MH878976.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_000880055.1	460	1.7	93.8
Megahit	GCA_000880055.1	569	2.1	95.2
Megahit	GCA_000880055.1	790	2.9	87.4
Minia-pipeline	GCA_000880055.1	206	0.7	84.4
Minia k31	GCA_000880055.1	203	0.7	92.2
Minia k31	GCA_000862965.1	343	1.2	96.5
Minia k31	GCA_000880055.1	255	0.9	85.0

SRR10951660 : ~1x coverage to MH878976.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_000862965.1	346	1.3	96.5
Megahit	GCA_000880055.1	634	2.3	95.7

SRR1194066 : ~16K read coverage to KF600647.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_000901155.1	7601	25.2	99.18
Minia-pipeline	GCA_000901155.1	6973	23.2	99.1
Minia k71	GCA_000901155.1	2502	8.3	100.0
Minia k71	GCA_000901155.1	760	2.5	100.0
Minia k71	GCA_000901155.1	285	1.0	100.0
Minia k31	GCA_000901155.1	5691	19.0	100.0
MetaViralSPAdes	GCA_000901155.1	3180	10.6	100.0
MetaViralSPAdes	GCA_000901155.1	2952	9.8	100.0

ERR2756788 : Frank, ~8K mapped coverage, closest hit is fragment EU769558.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_000872845.1	1998	6.3	32.2
Megahit	GCA_003972065.1	29219	105.2	55.71
Minia-pipeline	GCA_001503155.1	26412	95.1	56.18
Minia-pipeline	GCA_003972065.1	2778	9.7	47.6
Minia-pipeline	GCA_000872845.1	1801	5.6	32.2
Minia k71	GCA_000899495.1	7104	25.3	48.61
Minia k71	GCA_000899495.1	5044	17.7	80.3
Minia k31	GCA_003972065.1	28908	104.1	55.71
Minia k31	GCA_000872845.1	798	2.5	32.2
MetaViralSPAdes	GCA_000899495.1	26406	95.1	55.75
MetaViralSPAdes	GCA_003972065.1	1337	4.7	46.9
MetaViralSPAdes	GCA_000872845.1	1185	3.7	32.2

SRR7287110 : Ginger, ~46k mapped coverage to various Feline Cov, cloest hit is MN165107.1

Method	Detected virus	Contig length	CheckV estim. completeness	CheckV AA avg ID%
Megahit	GCA_000856025.1	26905	95.1	85.3
Megahit	GCA_000856025.1	1995	6.9	90.89
Megahit	GCA_000856025.1	2598	9.0	91.14
Megahit	GCA_000870985.1	3972	14.4	37.0
Minia-pipeline	GCA_000856025.1	11269	39.5	79.61
Minia-pipeline	GCA_000856025.1	6024	20.9	91.23
Minia-pipeline	GCA_000856025.1	1058	3.7	68.1
Minia-pipeline	GCA_000856025.1	2288	7.9	90.61
Minia-pipeline	GCA_000856025.1	2291	7.9	91.14
Minia k71	GCA_000856025.1	10075	35.0	78.89
Minia k71	GCA_000856025.1	1000	3.4	89.4
Minia k71	GCA_000856025.1	843	3.0	90.8
Minia k71	GCA_000856025.1	846	3.0	90.8
Minia k71	GCA_000856025.1	666	2.3	97.8
Minia k71	GCA_000856025.1	666	2.3	94.6
Minia k31	GCA_000856025.1	4902	17.2	96.1
Minia k31	GCA_000856025.1	1891	6.5	92.4
Minia k31	GCA_000856025.1	1368	4.8	87.08
Minia k31	GCA_000856025.1	832	3.0	95.5
MetaViralSPAdes	GCA_000856025.1	307	1.1	68.1
MetaViralSPAdes	GCA_000856025.1	307	1.1	72.1
MetaViralSPAdes	GCA_001504755.1	2879	10.4	36.0
MetaViralSPAdes	GCA_000856025.1	1064	3.6	91.4
MetaViralSPAdes	GCA_000856025.1	2201	7.6	91.8
MetaViralSPAdes	GCA_000856025.1	1217	4.2	92.1
MetaViralSPAdes	GCA_000856025.1	899	3.1	88.4
MetaViralSPAdes	GCA_000856025.1	8785	30.5	95.9
MetaViralSPAdes	GCA_000856025.1	858	3.0	95.5
MetaViralSPAdes	GCA_000856025.1	9000	31.4	91.3

SRR1168901

Nothing was found, apparently.

Performance

On 4 threads, dataset SRR10829957 (4 GB compressed)

Method	Time	Memory (GB)
Megahit	2h59m	6.8
Minia-pipeline	2h	3.8
Minia k61	24m	1.7
Minia k31	34m	1.7
MetaviralSpades	18h21m	21

Some thoughts

There are 3 types of datasets:

(easy-medium instances) those on which the virus assembles into 1-3 contigs with any assembler, typically one large contig covering >90% of the genome
(hard instances) and those where the coverage is too low and reference-based assembly will be needed.
(assembler-critical instances) on those, only some assemblers do well with default parameters

Clearly types 1 and 2 are "easy" in the sense that we can run any assembler and get pretty much the same result quality. I was wondering how many datasets were of type 3, which makes the choice of method harder. It seems only SRR7287110 is. Of course I am biased, but if all datasets were of type 1 and 2, then Minia k31 is an energy-saving way to triage datasets between those of type 1 and those of type 2.

rcedgar · 2020-05-31T18:15:48Z

PRICE: elapsed times are ~1hr with 4 threads.
Longest contigs:
Frank ERR2756788 8504
Ginger SRR7287110 15691
Pedv1 SRR10829953 23559
Pedv2 SRR10829957 27949
Lowcvg SRR10951656 238 (total of 628 contigs)
Fully automated, no attempt to tune, no metadata needed except for paired vs. unpaired and read length

taltman · 2020-06-02T20:04:45Z

I'm checking some odd annotation results in our co-assembly. @ababaian , can you help me understand which version (i.e., generated on which date) of which flavor of reference DB (e.g., cov0, cov2, cov3m, etc.) was used for each of the runs referenced in the benchmark above?

ababaian · 2020-06-02T20:56:59Z

A little bit outdated by this point but this is all from the zoonotic reservoir dataset which began on 200505. A small sub-set of that data is aligned to cov2r (only Pan-Coronavirus) and the majority of the data is aligned to cov2m which is flom1, or a sub-set of vertebrate viruses. Essentially as we were learning about this data-set we were going through major changes of our reference genome, it will all have to be repeated soon anyways, I can do this on the weekend.

taltman · 2020-06-07T02:57:35Z

For completeness, here are the results from the metagenomic co-assembly test:

https://github.com/ababaian/serratus/wiki/Metagenomic-Co-Assembly-Proof-of-Concept

rcedgar · 2020-06-07T03:00:45Z

Can you add "Conclusions"? Is this something you would recommend including in production, or not? Seems like a good idea, but maybe didn't work so well in practice?

asl · 2020-06-07T10:07:34Z

I believe the key point in co-assembly approach there was the removal of host reads. Otherwise the size of the data would be enormous

rcedgar · 2020-06-07T14:02:30Z

Maybe I misunderstood, I thought co-assembly was combining reads from different SRAs which probably have the same virus. Naively, this would make sense when coverage is low. I thought they used checkv as the final step to separate out the host reads?

taltman · 2020-06-07T21:57:57Z

@rcedgar I don't know why you say it doesn't work so well in practice. We recovered a full-length IBV CoV genome from the "low abundance" IBV study, whereas using other assemblers a fraction of that at 790 bp max was recovered (see @rchikhi 's results above) . I think it worked very well, as I described on the call.

Yes, it takes ~4x as long as the Minia assembler, so I don't recommend it for everything. I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome. And the Hallam Lab has graciously volunteered a significant amount of compute to do that for the Serratus Project, so AWS cost budgeting is not an issue. I just need a list of the SRAs where we didn't recover a full sequence. There's no reason not to do this.

rcedgar · 2020-06-07T22:00:36Z

Sorry for the misunderstanding -- it was a question, not a statement, I was hoping you could summarize the conclusions. "I recommend it for all studies where there are multiple samples and the Minia assembler approach fails to recover a full-length CoV genome" is exactly what I was looking for, thanks!

taltman · 2020-06-07T22:01:51Z

Correction: @rchikhi 's results show that 790 bp were recovered, not 5kbp. Corrected above.

taltman · 2020-06-07T22:03:12Z

Sorry if I misread your comment. I'll make sure to mirror the conclusions that I've written here to the Wiki report. Please leave this open until that documentation is complete, feel free to reassign to me.

taltman · 2020-06-07T22:04:41Z

@asl Is correct, we filter reads up-front to remove host reads using BMTagger. We will probably move to Kraken in the future. We used CheckV at the end to check for completeness of the recovered contigs.

rcedgar · 2020-06-07T22:06:42Z

Got it thanks. And general apologies to all for not RTFM'ing everything, I'm skimming a lot in an attempt to maximize my own productivity.

asl · 2020-06-07T22:31:54Z

For the sake of completeness, SPAdes recovered 25 kbp of contigs from single run (though, max contig was only 2 kbp there) and we obtained 27 kbp from 2 SRAs. Here is the QUAST report:

Assembly                    corona_pe2
# contigs (>= 0 bp)         13
# contigs (>= 1000 bp)      10
# contigs (>= 5000 bp)      0
# contigs (>= 10000 bp)     0
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      27648
Total length (>= 1000 bp)   26277
Total length (>= 5000 bp)   0
Total length (>= 10000 bp)  0
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   11
Largest contig              4836
Total length                26987
GC (%)                      38.31
N50                         2824
N75                         1657
L50                         4
L75                         7
# N's per 100 kbp           0.00

taltman · 2020-06-08T01:43:05Z

Thanks @asl. Did the metaviralSPAdes tools declare all of those contigs to be viral in origin?

asl · 2020-06-08T06:42:53Z

This was not a metaviralSPAdes run as it is unsuitable for RNA viruses. But I will check with viralVerify.

asl · 2020-06-08T07:49:24Z

@taltman viralVerify classified majority of these contigs as viral. Also, in the full assembly it shows many other viral contigs as well, I'm cross-checking with checkv.

asl · 2020-06-08T09:04:19Z

Yes, there is, for example:

NODE_2_length_15157_cov_8.893458

which is likely the same as (judging from the length and # of genes)

k119_5808       MH996950.1      11176   Avian avulavirus 1      15205   15147   15138   99.908  99      27237   0.0

as reported in co-assembly results

rchikhi · 2020-06-16T08:32:31Z

I've added coronaSPAdes results to https://github.com/ababaian/serratus/wiki/Assembly-benchmark-results-for-8-coronavirus-candidates-datasets

rchikhi · 2020-06-16T08:33:27Z

commentary: looks real good, especially on Ginger, the best method so far.

For some reason, in combination with CheckV it didn't return anything on the very-low-coverage datasets, but in some cases the gene_clusters.fasta contained a few hundreds of sequences bases. Maybe lack of sensitivity from CheckV.

asl · 2020-06-16T08:40:23Z

@rchikhi The assemblies are fragmented on low coverage datasets. So we don't have strong matches for HMM-based approaches. And checkv had some thresholds as well (I believe it won't report anything useful to contigs < 4 kbp)

rchikhi · 2020-06-16T08:42:27Z

sounds fair! well, CheckV did manage to get hits on small (200-300bp) Megahit/minia contigs. But this is anyway a minor issue, that will be resolved by coassembly.

ababaian added Bioinformatics Bioinformatics task Priority: High Plz do this labels May 26, 2020

ababaian mentioned this issue May 26, 2020

State of Assembly #86

Closed

This was referenced May 27, 2020

Sus scrofus samples with high or low virus abundance #96

Closed

Identify desired assembler output #91

Closed

rchikhi mentioned this issue May 31, 2020

List of assemblers -- please contribute! #71

Closed

ababaian mentioned this issue Jun 26, 2020

Assembler benchmarking #72

Closed

ababaian closed this as completed Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

> The Assembly Benchmark < #130

> The Assembly Benchmark < #130

ababaian commented May 26, 2020 •

edited

Loading

rcedgar commented May 26, 2020

rcedgar commented May 26, 2020

rchikhi commented May 26, 2020

rcedgar commented May 26, 2020

taltman commented May 27, 2020 •

edited

Loading

ababaian commented May 27, 2020

taltman commented May 27, 2020

rchikhi commented May 27, 2020

asl commented May 27, 2020

rchikhi commented May 31, 2020 •

edited

Loading

rcedgar commented May 31, 2020 •

edited

Loading

taltman commented Jun 2, 2020

ababaian commented Jun 2, 2020

taltman commented Jun 7, 2020

rcedgar commented Jun 7, 2020

asl commented Jun 7, 2020

rcedgar commented Jun 7, 2020

taltman commented Jun 7, 2020 •

edited

Loading

rcedgar commented Jun 7, 2020

taltman commented Jun 7, 2020

taltman commented Jun 7, 2020

taltman commented Jun 7, 2020

rcedgar commented Jun 7, 2020

asl commented Jun 7, 2020

taltman commented Jun 8, 2020

asl commented Jun 8, 2020

asl commented Jun 8, 2020

asl commented Jun 8, 2020

rchikhi commented Jun 16, 2020

rchikhi commented Jun 16, 2020 •

edited

Loading

asl commented Jun 16, 2020

rchikhi commented Jun 16, 2020 •

edited

Loading

~~> The Assembly Benchmark <~~ #130

~~> The Assembly Benchmark <~~ #130

Comments

ababaian commented May 26, 2020 • edited Loading

Delivery

The Dataset

High Coverage Porcine Epidemic Virus in Sus Scrofa

Low Coverage Infectious Bronchitis Virus in Sus Scrofa

Experimental SARS/MERS infection of Vero Cell Line

Frank - Unidentified Bat Alphacoronavirus

Ginger - Unidentified Feline Coronavirus

rcedgar commented May 26, 2020

rcedgar commented May 26, 2020

rchikhi commented May 26, 2020

rcedgar commented May 26, 2020

taltman commented May 27, 2020 • edited Loading

ababaian commented May 27, 2020

taltman commented May 27, 2020

rchikhi commented May 27, 2020

asl commented May 27, 2020

rchikhi commented May 31, 2020 • edited Loading

Assembly benchmark results for 8 coronavirus candidates datasets

Benchmark setup

Results

Performance

Some thoughts

rcedgar commented May 31, 2020 • edited Loading

taltman commented Jun 2, 2020

ababaian commented Jun 2, 2020

taltman commented Jun 7, 2020

rcedgar commented Jun 7, 2020

asl commented Jun 7, 2020

rcedgar commented Jun 7, 2020

taltman commented Jun 7, 2020 • edited Loading

rcedgar commented Jun 7, 2020

taltman commented Jun 7, 2020

taltman commented Jun 7, 2020

taltman commented Jun 7, 2020

rcedgar commented Jun 7, 2020

asl commented Jun 7, 2020

taltman commented Jun 8, 2020

asl commented Jun 8, 2020

asl commented Jun 8, 2020

asl commented Jun 8, 2020

rchikhi commented Jun 16, 2020

rchikhi commented Jun 16, 2020 • edited Loading

asl commented Jun 16, 2020

rchikhi commented Jun 16, 2020 • edited Loading

> The Assembly Benchmark < #130

> The Assembly Benchmark < #130

ababaian commented May 26, 2020 •

edited

Loading

taltman commented May 27, 2020 •

edited

Loading

rchikhi commented May 31, 2020 •

edited

Loading

rcedgar commented May 31, 2020 •

edited

Loading

taltman commented Jun 7, 2020 •

edited

Loading

rchikhi commented Jun 16, 2020 •

edited

Loading

rchikhi commented Jun 16, 2020 •

edited

Loading