- Tom Delmont, Antti Karkman, Jenni Hultman*
Anvio is an analysis and visualization platform for omics data. You can read more from Anvio's webpage.
Go to your course folder and make a new folder called ANVIO. All task on this section are to be done in this folder.
mkdir ANVIO
cd ANVIO
We need to do some tricks for the contigs from assembly before we can use them on Friday. You'll need to load bioconda and activate Anvio.
Open a screen for Anvi'o workflow. screen -S anvio
# allocate resources and log in to a computing node
salloc -n 1 --mem=10000 -t 06:00:00 -p serial
srun --pty $SHELL
# load bioconda and activate Anvi'o environment
module load bioconda/3
source activate anvio3
Anvio wants sequence IDs in your FASTA file as simple as possible. Therefore we need to reformat the headerlines to remove spaces and non-numeric characters. Also contigs shorter than 2500 bp will be removed.
anvi-script-reformat-fasta ../co-assembly/final.contigs.fa -l 2500 --simplify-names --prefix MEGAHIT_co_assembly -r REPORT -o MEGAHIT_co-assembly_2500nt.fa
Deattach from the screen with Ctrl a+d
Next thing to do is mapping all the reads back to the assembly. We use the renamed >2,500 nt contigs and do it sample-wise, so each sample is mapped separately using the trimmed R1 & R2 reads.
We will need to two scripts for that, one for the actual mapping and another to run it as an array job. Save both scripts to your scripts
folder.
But before doing that we have make a bowtie2 index from the contig file. Run the following command:
bowtie2-build MEGAHIT_co-assembly_2500nt.fa co-assembly
co-assembly
is the base name for the resulting index files.
The mapping script from Tom, modified to be used in Taito:
#!/bin/bash
# example run: ./bowtie2-map-batch.sh SAMPLE1_R1.fastq SAMPLE1_R2.fastq SAMPLE1 bt2_index
# $1: Forward reads
# $2: Reverse reads
# $3: Sample name
# $4: Bowtie2 index
set -e
bowtie2 --threads $SLURM_CPUS_PER_TASK -x $4 -1 $1 -2 $2 -S $3.sam --no-unal
samtools view -F 4 -bS $3.sam > $3-RAW.bam
samtools sort $3-RAW.bam -o $3.bam
samtools index $3.bam
rm $3.sam $3-RAW.bam
The array job script:
#!/bin/bash -l
#SBATCH -J array_map
#SBATCH -o array_map_out_%A_%a.txt
#SBATCH -e array_map_err_%A_%a.txt
#SBATCH -t 00:30:00
#SBATCH --mem-per-cpu=1000
#SBATCH --array=1-6
#SBATCH -n 1
#SBATCH --cpus-per-task=6
#SBATCH -p serial
cd $WRKDIR/BioInfo_course/co-assembly
# we need Bowtie2 from the biokit
module load biokit
# each job will get one sample from the sample names file
name=$(sed -n "$SLURM_ARRAY_TASK_ID"p ../sample_names.txt)
# run mapping script for each sample
bash ../scripts/bowtie2-map-batch.sh ../trimmed_data/$name"_R1_trimmed.fastq" \
../trimmed_data/$name"_R2_trimmed.fastq" \
$name ../ANVIO/co-assembly
Then again submit the array job with sbatch
.
During luch break check what happens in the different steps in the mapping script.
bowtie2
samtools view
samtools sort
samtools index
Reattach to your Anvi'o screen
screen -r anvio
Contigs database (contigs.db) contains information on contig length, open reading frames (searched with Prodigal) and kmers. See Anvio webpage for more information.
anvi-gen-contigs-database -f MEGAHIT_co-assembly_2500nt.fa -o MEGAHIT_co-assembly_2500nt_CONTIGS.db -n MEGAHIT_co-assembly
anvi-run-hmms -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -T 6
While the HMM identification is running, deattach from the screen and check if the mapping has been done.
squeue -l -u $USER
When the mapping is done for all samples and the contigs database is ready, we can profile the samples using the DB and the mapping output. Write an array script for the profiling and submit it to the queue.
#!/bin/bash -l
#SBATCH -J array_profiling
#SBATCH -o array_profiling_out_%A_%a.txt
#SBATCH -e array_profiling_err_%A_%a.txt
#SBATCH -t 00:30:00
#SBATCH --mem-per-cpu=1000
#SBATCH --array=1-6
#SBATCH -n 1
#SBATCH --cpus-per-task=6
#SBATCH -p serial
cd $WRKDIR/BioInfo_course/co-assembly
# we need to load Bioconda
module load bioconda/3
# then activate Anvi'o
source activate anvio3
# and also load the biokit, since Anvi'o uses samtools in the profiling
module load biokit
# each job will get one sample from the sample names file
name=$(sed -n "$SLURM_ARRAY_TASK_ID"p ../sample_names.txt)
# then the actual profiling
anvi-profile -c ../ANVIO/MEGAHIT_co-assembly_2500nt_CONTIGS.db -M 2500 -T $SLURM_CPUS_PER_TASK -i $name.bam -o $name
Submit the job with sbatch
as previously.
Next we annotate genes in contigs database with functions from the NCBI’s Clusters of Orthologus Groups (COGs). Again first reattach to your Anvio'o screen.
anvi-run-ncbi-cogs -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -T 6
With this command we export the genecalls from Prodigal to gene-calls.fa and do taxonomic annotation against centrifuge database you installed on Wednesday
anvi-get-dna-sequences-for-gene-calls -o gene-calls.fa -c MEGAHIT_co-assembly_2500nt_CONTIGS.db
“classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers”. Read more from here.
Remember to set the environmental variable pointing to the centrifuge folder as shown in MetagenomeInstallations.
centrifuge -f -p 6 -x $CENTRIFUGE_BASE/p+h+v gene-calls.fa -S centrifuge_hits.tsv
anvi-import-taxonomy -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge -c MEGAHIT_co-assembly_2500nt_CONTIGS.db
# run DIAMOND
diamond blastx -d ~/appl_taito/CARD/CARD -q gene-calls.fa \
--max-target-seqs 1 -o CARD.out -f 6 --id 90 --min-orf 20 -p 6 --masking 0
# parse the output for Anvi'o
printf "gene_callers_id\tsource\taccession\tfunction\te_value\n" > CARD_functions.txt
awk '{print $1"\tCARD\t\t"$2"\t"$11}' CARD.out >> CARD_functions.txt
# and finally import the functions to Anvi'o contigs DB
anvi-import-functions -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -i CARD_functions.txt
When the profiling is done, you can merge them with one command.
anvi-merge ../co-assembly/*/PROFILE.db -o SAMPLES-MERGED -c MEGAHIT_co-assembly_2500nt_CONTIGS.db
Open a new ssh window. In mac:
ssh -L 8080:localhost:8080 YOUR_USERNAME@taito.csc.fi
in Windows with Putty: In SSH tab select "tunnels". Add
Source port: 8080
Destination: localhost:8080
Click add and log in to Taito.
Activate anvio
anvi-interactive -c MEGAHIT_co-assembly_2500nt_CONTIGS.db -p SAMPLES-MERGED/PROFILE.db --server-only -P 8080
Then open google chrome and go to address