Skip to content

Latest commit

 

History

History
88 lines (63 loc) · 3.14 KB

meta_assembly.md

File metadata and controls

88 lines (63 loc) · 3.14 KB

Metagenome Assembly

In this tutorial you'll learn how to inspect the quality of High-throughput sequencing and perform a metagenomic assembly.

We will use data under the accession SRS018585 in the Sequence Read Archive. this sample is "a Human Metagenome sample from G_DNA_Anterior nares of a male participant in the dbGaP study HMP Core Microbiome Sampling Protocol A (HMP-A)"

Table of Contents

Softwares Required for this Tutorial

Getting the Data

wget http://downloads.hmpdacc.org/data/Illumina/anterior_nares/SRS018585.tar.bz2
tar xjf SRS018585.tar.bz2
cd SRS018585

Quality Control

we'll use FastQC to check the quality of our data. FastQC can be downloaded and ran on a Windows or LINUX computer without installation. It is available here

Start FastQC and select the fastq files you just downloaded with file -> open

What is the average read length? The average quality?

Now we'll trim the reads using sickle

sickle pe -f SRS018585.denovo_duplicates_marked.trimmed.1.fastq \
-r SRS018585.denovo_duplicates_marked.trimmed.2.fastq -t sanger \
-o SRS018585_trimmed_1.fastq -p SRS018585_trimmed_2.fastq -s unpaired.fastq

sickle normally gives you a summary of how many reads were trimmed.

Assembly

SPAdes will be used for the assembly. Since version 3.7, SPAdes includes a metagenomic version of its algorithm, callable with the option --meta

spades.py --meta -1 SRS018585_trimmed_1.fastq -2 SRS018585_trimmed_2.fastq -t 8 -o assembly

the resulting assenmbly can be found under assembly/scaffolds.fasta. How many contigs does this assembly contain? How long is the longest contig and to what organism does it belong to?

Taxonomic Classification and Visualization

For the vizualisation of the assembly we will use a tool called blobtools. Blobtools produces "Taxon annotated GC-coverage plots" (TAGC) and was orignially made for the visualisation of (draft) genome assemblies.

mkdir blobtools && cd $_
blastn -num_threads 8 -db nt -query ../assembly/scaffolds.fasta -out blastresults.txt -outfmt '6 qseqid staxids bitscore'

This blast step is necessary to obtain the taxonomic information of your contigs. It might take a while. Be patient!

blobtools create -i ../assembly/scaffolds.fasta -y spades -t blastresults.txt \
    --nodes /export/databases/taxonomy/nodes.dmp \
    --names /export/databases/taxonomy/names.dmp \
    -o scaffolds --title SRS018585
blobtools plot -i scaffolds.blob.BlobDB.json -o scaffolds --title -r family

Inspect the plot, what is the most abundant families? try to play with the parameters (especially -r)