-
Notifications
You must be signed in to change notification settings - Fork 34
Assembly Data
s3://lovelywater/ # A Read-Only Archive of Serratus Data Releases
├── assembly/ # Viral assembly and annotation data
│ └─── cov/ # .fasta : Assembled/filtered coronaviruses
│ └─── contigs/ # CoronaSPAdes output, contigs, graphs, stats...
│ └─── annotation/ # CoV annotation and taxonomic assignments
These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.
SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt
All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes.
Depending on the assembler, a subset of these files will be present for each accession.
Beware: contigs.fa.mfc
actually contains the content of coronaSPAdes' scaffolds.fasta
compressed with MFCompress.
This folder contains the annotation results of several programs applied to different inputs.
CheckV applied to the scaffolds.fasta
and/or gene_clusters.fasta
:
SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz
serraplace (phylo placement) output of CheckV-filtered gene clusters:
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
serratax (taxonomic identification) output of CheckV-filtered gene clusters:
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz
Then, the following are annotations of the assemblies in cov/
. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.
SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz
See also: Accessing Serratus Data