Skip to content

Latest commit

 

History

History
740 lines (735 loc) · 85.2 KB

Module6_Metatranscriptomics_Tutorial_MIC-2022.md

File metadata and controls

740 lines (735 loc) · 85.2 KB

MetaPro Metatranscriptomics Practical Lab

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International. This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.

This tutorial was developed by Billy Taj (billy.taj@sickkids.ca), Mobolaji Adeolu (adeolum@mcmaster.ca), John Parkinson (john.parkinson@utoronto.ca) & Xuejian Xiong (xuejian@sickkids.ca), and updated by Ana Popovic for the IMPACTT/IMC Bioinformatics Workshop, 2022.

Overview

This tutorial will take you through the MetaPro pipeline for processing and annotation of metatranscriptomic data (https://doi.org/10.1101/2021.02.23.432558). The pipeline was developed by the Parkinson lab, and consists of the following steps:

  1. Remove adapter sequences (added during library preparation for sequencing) and low-quality reads.
  2. Remove duplicate reads to reduce processing time for downstream steps.
  3. Remove vector contamination (reads derived from cloning vectors, spike-ins, and primers).
  4. Remove host reads (if exploring a host-derived microbiome).
  5. Remove rRNA sequences, which typically dominate metatranscriptomic datasets.
  6. Repopulate duplicate reads (removed in step 2).
  7. Assemble reads into contigs, and predict ORFs.
  8. Annotate ORFs to known genes and proteins.
  9. Classify reads to known taxonomic groups, and visualize the taxonomic composition of your dataset.
  10. Predict enzyme function.
  11. Generate output files, including gene annotations with normalized gene expression values, and enzyme predictions.
  12. Visualize metabolic pathway activity, as mapped onto KEGG-defined pathways, within Cytoscape.

The MetaPro metatranscriptomics pipeline includes a series of Python scripts that handle the orchestrated invocation of existing bioinformatics tools, read-conflict resolution, file format conversion, and output parsing. We will go through these steps to illustrate the complexity of the process and the underlying tools and scripts. The pipeline is designed to run all steps automatically in succession. However, a tutorial mode is available to allow running of single steps, which we will employ in this tutorial for the purpose of learning.

New, faster, and/or more accurate tools are being developed all the time, and it is worth bearing in mind that any pipeline needs to be flexible to incorporate these tools as they are adopted as standards by the community. For example, over the past two years, our lab has transitioned from using Trimmomatic to AdapterRemoval, and from BLAST to DIAMOND.

To illustrate the process, we are going to use sequenced reads generated from the colon contents of a mouse. The original dataset contains 25 million 150 bp single-end reads. Rather than use the entire set, which might take several days to process on a desktop, the tutorial will take you through processing a subset of 100,000 single-ended reads. Paired-end reads can also be used, and are often preferred because they can improve annotation quality when there is enough overlap in read pairs to improve the effective average read length. Working with paired-end data involves an additional data processing step (merging of overlapping reads) and produces more files (files for merged/singleton reads, forward reads and reverse reads), but the structure of the remaining steps is the same as that described below.

Note: The purpose of this tutorial is to demonstrate MetaPro's various steps. The pipeline is fully capable of running all steps without the intervention of the user, beyond an initial call to the program. If you wish to use automated MetaPro pipeline, do not use the --tutorial option.

Preliminaries

i. Launch the MetaPro pipeline

MetaPro operates in a containerized environment running Ubuntu Linux 18.04. The pipeline and its dependent programs exist within a Docker container image, and may be accessed using the Docker or Singularity tools. For the purposes of this tutorial, Docker and Singlarity have both already been installed, and the MetaPro pipeline has been downloaded to your environment.

Docker and Singularity maintain different access modes to use their containers.

  1. Scripted-mode: where the user can call software within the container, to run it.
  2. Interactive-mode: where the user can enter into the container and use it like an operating system.

In this tutorial, we will be using Singularity in interactive mode to run MetaPro commands. The command to download and create the MetaPro image (below) has already been run for you:

    singularity pull docker://parkinsonlab/metapro:develop [DO NOT RUN]



Navigate to your workspace and create a new tutorial folder (we will be using the absolute path):

cd /media/data/workspace  
mkdir metapro_tutorial/  
cd metapro_tutorial  

Launch MetaPro using Singularity in interactive mode (using the command: singularity shell path/to/pipline metapro.sif

singularity shell /media/data/CourseData/Module6/Tools/metapro_develop.sif



To access MetaPro using Docker, we include instructions below: (not required for the current tutorial)

Install Docker:
https://www.docker.com/products/docker-desktop

Next, pull the MetaPro docker image:
   docker pull parkinsonlab/metapro:develop

Launch MetaPro within the Docker interactive mode:

    docker run -it -v :

    An example would be:
    docker run -it -v /home/ubuntu/MetaPro_tutorial:/MetaPro_docker_tutorial parkinsonlab/metapro:develop

ii. Download the data

Our data set consists of 150 bp single-end Illumina reads generated from mouse colon contents. Download the data and precomputed files:

wget https://github.com/ParkinsonLab/MetaPro_tutorial/releases/download/1.0/tutorial_files.tar.gz

Unzip the data folder, and view contents:

tar -xzvf tutorial_files.tar.gz  
ls

The downloaded contents include:

  • the input sequence file mouse1.fastq
  • the MetaPro configuration file containing paths to tools and databases, as well as parameters config_mouse_tutorial.ini (see below for description)
  • a folder containing databases required for the tutorial databases\
  • the output directory containing some precomputed files mouse1_run\
  • an example Cytoscape file which may be generated with MetaPro to view microbial metabolic pathway activity Example.cys

MetaPro's tools may take a long time to run if the user does not have the necessary computing resources. Therefore, we provide pre-computed output files (within the mouse1_run/ folder) so that the user is not forced to run computationally intensive steps during the tutorial.

Change the permissions of the mouse1_run/ folder in order to view its contents through a browser:

chmod -R 777 mouse1_run

Access your workspace in a web browser to view the output directory: (keep the tab open!)

http://[ insert student number ].uhn-hpc.ca/

iii. Edit the configuration file

MetaPro controls many of its features with a configuration file. It contains paths to tools and databases used by the program. These have already been pre-set for you in a modified config file located in the course data folder.

Copy the config file and preview the contents:

cp /media/data/CourseData/Module6/Data/config_mouse_tutorial.ini .
less config_mouse_tutorial.ini
  • verify that the path to the database folder is correct.
  • [Databases]
    database_path: /media/data/workspace/metapro_tutorial/databases
    UniVec_Core: %(database_path)s/univec_core/UniVec_Core.fasta #(the UniVec core database)
    Adapter: %(database_path)s/Trimmomatic_adapters/TruSeq3-PE-2.fa #(The adapters database)
    Host: %(database_path)s/Mouse_cds/Mouse_cds.fasta #(The host database)
    Rfam: %(database_path)s/Rfam/Rfam.cm #(The Infernal rRNA database)
    DNA_DB: %(database_path)s/ChocoPhlAn/ChocoPhlAn.fasta #(The BWA database.  It assumes the index is in the same directory)
    DNA_DB_Split: %(database_path)s/ChocoPhlAn/ChocoPhlAn_split/ #(The split database, for BLAT)
    ...
    
  • press q to exit
  • iv. Databases and licenses

    This tutorial relies on a few external databases and libraries to perform the filtering tasks associated with MetaPro.
    We have assembled the smaller databases in our precomputed files package

    The UniVec Core database
    A mouse host sequence database In this tutorial, we will use one from Ensembl

    There are optional databases that are mentioned in this tutorial. Due to their size, they will not be used here. It is highly recommended, however, that they be downloaded and used during a proper MetaPro run:

    These optional databases require indexing prior to use.

    • MetaPro requires 2 versions of ChocoPhlan:
      • One with all of the sequences in a single file, for BWA
      • One with all of the sequences separated, for BLAT

    The commands used to build the indexed databases are as follows [DO NOT RUN]

    • To unpack chocophlan and combine all of the sequences:
      • tar -xvf chocophlan.tar.gz && cd chocophlan
      • for i in $(ls | grep ".gz"); do gunzip $i; done
      • for i in $(ls | grep ".m8"); do cat $i >> chocophlan_full.fasta
      • mv chocophlan_full.fasta ..
    • To prepare ChocoPhlAn for BWA:
      • bwa index -a bwtsw /path/to/chocophlan_full.fasta
      • samtools faidx /path/to/chocophlan_full.fasta
    • To prepare NR for DIAMOND:
      • diamond makedb -p 8 --in /path/to/nr -d /path/to/nr
    • To prepare Kaiju:
      • /pipeline_tools/kaiju/makeDB.sh -r /destination/path/for/KaijuDB
    • Additionally, a license from MetaGeneMark is required to run to the contig assembly step

      v. Store paths to the configuration file and the output folder in variables

      config=/media/data/workspace/metapro_tutorial/config_mouse_tutorial.ini
      output=/media/data/workspace/metapro_tutorial/mouse1_run
      

      These paths will be used in running MetaPro commands, structured as follows: python3 /pipeline/MetaPro.py -c $config -s [sequence file] -o $output --tutorial [processing step]

      Verify the variables:

      echo $config
      echo $output
      

      Notes:

      All MetaPro steps share the same file directory scheme:

      • data: location of interim files for each step.
      • final results: location of final outputs for each step.
      • All of MetaPro's commands are output in separate shell scripts in the relevant folders.

      vi. Inspect the sequences and check read quality with FastQC

      Inspect the sequences:

      less mouse1.fastq
      

      Notes:

      • Type q to exit less.

      Check the read quality with FastQC:

      /pipeline_tools/FastQC/fastqc mouse1.fastq
      

      The FastQC report is generated as an HTML file mouse1_fastqc.html. A zip file is also generated which includes data files used to generate the report.

      Open the FastQC report in a browser!

      You can find the following information in the report:

      • Basic Statistics: Basic information of the mouse RNA-seq data, e.g. the total number of reads, read length, GC content.
      • Per base sequence quality: An overview of the range of quality values across all bases at each position.
      • Per Base Sequence Content: A plot showing nucleotide bias across sequence length.
      • Adapter Content: Provides information on the level of adapter contamination in your sequence sample.

      Process the Reads

      Step 1: Remove adapter sequences and trim low quality sequences.

      In the first step, MetaPro removes adaptor sequences, trims low-quality reads, and removes duplicate reads in one pass.

      The format of the MetaPro command is:

          read1='<path to input sequence>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial quality

      Run the command:

      read1=/media/data/workspace/metapro_tutorial/mouse1.fastq
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial quality
      

      In this Quality-filtering stage, MetaPro will perform several actions:

      • filter reads below a quality score of 75
      • filter reads below a minimum length of 30 bp
      • remove adapters
      • remove duplicate reads within the dataset


      Question 1.1: How many low quality sequences have been removed?



      Use FastQC to check the quality of the reads filtered for low quality bases and short length:

      /pipeline_tools/FastQC/fastqc mouse1_run/quality_filter/data/4_quality_filter/singletons_hq.fastq
      

      Navigate to the ~/4_quality_filter/ directory in your browser to view the HTML report.

      Compare with the previous report to see changes in the following sections:

      • Basic Statistics
      • Per base sequence quality



      Question 1.2: How has the per read sequence quality curve changed in the final filtered output?



      Use FastQC to check the quality of the final filtered output:

      cd mouse1_run/quality_filter/final_results/
      /pipeline_tools/FastQC/fastqc singletons_hq.fastq
      cd ../../..
      

      Compare with the previous reports to see changes in the following sections:

      • Basic Statistics
      • Per base sequence quality
      • Per sequence quality

      Notes on read quality filtering

      AdapterRemoval, which was used to remove the adapters and trim low quality bases in the reads, uses a sliding window method to remove contigous regions of low quality bases in reads. However, it is worthwhile to impose an overall read quality threshold to ensure that all reads being used in our analyses are of sufficiently error-free. For this we use the tool VSEARCH which can be found at this website (when processing paired-end data, this step should come after the read merging step):

      Example command only [DO NOT RUN]. MetaPro already automatically perfoms this task.
      vsearch --fastq_filter mouse1_trim.fastq --fastq_maxee 2.0 --fastqout mouse1_qual.fastq

      The command line parameters are: - --fastq_filter Instructs VSEARCH to use the quality filtering algorithm to remove low quality reads - --fastq_maxee 2.0 The expected error threshold. Set at 1. Any reads with quality scores suggesting that the average expected number of errors in the read are greater than 1 will be filtered. - --fastqout Indicates the output file contains the quality filtered reads


      Step 2. Remove duplicate reads

      To significantly reduce the amount of computating time required for identification and filtering of rRNA reads, we perform a dereplication step to remove duplicated reads using the software tool CD-HIT which can be obtained from this website.

      Example command only [DO NOT RUN]. MetaPro already calls this command as part of the Quality-filtering step
      /usr/local/prg/cd-hit-v4.6.7-2017-0501/cd-hit-auxtools/cd-hit-dup -i mouse1_qual.fastq -o mouse1_unique.fastq

      Notes:

      The command line parameters are: - -i: The input fasta or fastq file. - -o: The output file containing dereplicated sequences, where a unique representative sequence is used to represent each set of sequences with multiple replicates. A second output file mouse1_unique.fastq.clstr is created which shows exactly which replicated sequences are represented by each unique sequence in the dereplicated file and a third, empty, output file, mouse1_unique.fastq2.clstr is also created which is only used for paired-end reads.


      Question 2.1: Can you find how many unique reads there are?



      Navigate to mouse1_run/quality_filter/final_results/ to view the FastQC report, or look at the generated singletons.fastq file itself in the output directory.

      While the number of removed duplicate reads in this small dataset is relatively low, this step can reduce file size by as much as 50-80% in larger datasets.


      Step 3. Remove vector contamination

      To identify and filter reads from sources of vector, adapter, linker, and primer contamination we use the Burrows Wheeler aligner (BWA) and the BLAST-like alignment tool (BLAT) to search against a database of cow sequences. As a reference database for identifying contaminating vector and adapter sequences we rely on the UniVec_Core dataset which is a fasta file of known vectors and common sequencing adapters, linkers, and PCR Primers derived from the NCBI Univec Database.

      The format of the MetaPro command to perform vector filtering is:

          read1='<path to your quality filter final results mouse.fastq>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial vector

      Run the command:

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/quality_filter/final_results/singletons.fastq
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial vector
      

      MetaPro will automatically run the following:

      • Index the vector contamination database.
            bwa index -a bwtsw UniVec_Core
            samtools faidx UniVec_Core
            makeblastdb -in UniVec_Core -dbtype nucl
      • Use BWA to filter out reads aligning to the vector contamination database.
            bwa mem -t 4 UniVec_Core mouse1_unique.fastq > mouse1_univec_bwa.sam
            samtools view -bS mouse1_univec_bwa.sam > mouse1_univec_bwa.bam
            samtools fastq -n -F 4 -0 mouse1_univec_bwa_contaminats.fastq mouse1_univec_bwa.bam
            samtools fastq -n -f 4 -0 mouse1_univec_bwa.fastq mouse1_univec_bwa.bam
      • Use BLAT to filter out any remaining reads that align to the vector contamination database. Sicne BLAT only accepts fasta files, VSEARCH is used to first convert the reads from fastq to fasta.
            vsearch --fastq_filter mouse1_univec_bwa.fastq --fastaout mouse1_univec_bwa.fasta
            blat -noHead -minIdentity=90 -minScore=65 UniVec_Core mouse1_univec_bwa.fasta -fine -q=rna -t=dna -out=blast8 mouse1_univec.blatout
      • Lastly, a python script is used to filter the reads that BLAT does not confidently align to any vector sequences.
            python3 /pipeline/Scripts/read_BLAT_Filter_v3.py single high mouse1_univec_bwa.fastq mouse1_univec.blatout mouse1_univec_blat.fastq mouse1_univec_blat_contaminats.fastq

      Notes:

      The commands do the following tasks:
      - bwa index, samtools faidx, and makeblastdb: Index the UniVec core database for BWA and BLAT
      - bwa mem: Generates alignments of reads to the vector contaminant database
      - samtools view: Converts the .sam output of bwa into .bam for the following steps
      - samtools fastq: Generates fastq outputs for all reads that mapped to the vector contaminant database (-F 4) and all reads that did not map to the vector contaminant database (-f 4)

      The command line parameters for BLAT are:
      - -noHead: Suppresses .psl header (so it's just a tab-separated file).
      - -minIdentity: Sets minimum sequence identity is 90%.
      - -minScore: Sets minimum score is 65. This is the matches minus the mismatches minus some sort of gap penalty.
      - -fine: For high-quality mRNAs.
      - -q: Query type is RNA sequence.
      - -t: Database type is DNA sequence.

      The argument structure for the final python script is:
          read_BLAT_Filter_v3.py <operating mode: either "single" or "paired"> <filter stringency. to handle paired-read conflicts. "low" or "high"≶ <Input_Reads.fq> <BLAT_Output_File≶ <Unmapped_Reads_Output> <Mapped_Reads_Output>`

      Here, BLAT does not identify any additional sequences which align to the vector contaminant database. However, we have found that BLAT is often able find alignments not identified by BWA, particularly when searching against a database consisting of whole genomes.

      In handling paired-ended data, cases will arise where one read maps to a vector, while the pair does not. The filter stringency decides how to resolve such cases:

      • Low filter stringency will only remove reads where both pairs aligned to a vector.
      • High filter stringency will remove reads where either pair aligned to a vector.



      Question 3.1: Can you find how many reads BWA mapped to the vector database?


      Step 4. Remove host reads

      To identify and filter host reads (here, reads of mouse origin) we repeat the steps above using a database of mouse DNA sequences. For our purposes we use a mouse genome database downloaded from Ensembl.

      The format of the MetaPro command is:

          read1='<path to your vector filter final results mouse.fastq>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial host

      Run the command:

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/vector_read_filter/final_results/singletons.fastq  
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial host  
      

      This call will perform the following steps:

      • Prepare the host database for alignment (BWA + BLAT)
      • perform alignment using BWA
      • convert the unaligned reads from BWA to a format for BLAT
      • perform alignment of the unaligned reads using BLAT
      • Run a script to remove the host reads from the input sample

      The following are the commands run by the script:
          bwa index -a bwtsw mouse_cds.fa
          samtools faidx mouse_cds.fa
          makeblastdb -in mouse_cds.fa -dbtype nucl
          bwa mem -t 4 mouse_cds.fa mouse1_univec_blat.fastq > mouse1_mouse_bwa.sam
          samtools view -bS mouse1_mouse_bwa.sam > mouse1_mouse_bwa.bam
          samtools fastq -n -F 4 -0 mouse1_mouse_bwa_contaminats.fastq mouse1_mouse_bwa.bam
          samtools fastq -n -f 4 -0 mouse1_mouse_bwa.fastq mouse1_mouse_bwa.bam
          vsearch --fastq_filter mouse1_mouse_bwa.fastq --fastaout mouse1_mouse_bwa.fasta
          blat -noHead -minIdentity=90 -minScore=65 mouse_cds.fa mouse1_mouse_bwa.fasta -fine -q=rna -t=dna -out=blast8 mouse1_mouse.blatout
          ./1_BLAT_Filter.py mouse1_mouse_bwa.fastq mouse1_mouse.blatout mouse1_mouse_blat.fastq mouse1_mouse_blat_contaminats.fastq


      Question 4.1: How many reads did BWA and BLAT align to the mouse host sequence database?


      Optional: In your own future analyses you can choose to complete steps 3 and 4 simultaneously by combining the vector contamination database and the host sequence database using cat UniVec_Core mouse_cds.fa > contaminants.fa. However, doing these steps together makes it difficult to tell how much of your reads came specifically from your host organism.


      Step 5. Remove abundant rRNA sequences *** [DO NOT RUN]

      rRNA genes tend to be highly expressed in all samples and must therefore be screened out to avoid lengthy downstream processing times for the assembly and annotation steps. MetaPro uses Barrnap and Infernal. You could use sequence similarity tools such as BWA or BLAST for this step, but we find Infernal, albeit slower, is more sensitive as it relies on a database of covariance models (CMs) describing rRNA sequence profiles based on the Rfam database. Due to the reliance on CMs, Infernal, can take as much as 4 hours for ~100,000 reads on a single core. In an effort to shrink the computing time, we leverage a computing cluster's multiple cores.

      Here, MetaPro demonstrates the case for automation. MetaPro subdivides the input data, coordinates the concurrent processes, and collects the results into one single file after all of the scanning has been complete.

      MetaPro will perform the following:

      • subdivide the input data into user-defined chunk sizes (e.g. 1000 reads).
      • Each chunk is then run independently:
        • run each chunk through Barrnap
        • using the results of Barrnap, filter the data chunk into mRNA, and leftover data for further scanning.
        • run each leftover chunk through Infernal.
        • filter the Barrnap leftover chunk using the Infernal results, to get mRNA, and "other"
      • collect all of the data pieces (Barrnap mRNA, Infernal mRNA) into mRNA, and "other"

      By running things this way, the rRNA step takes 4 minutes (as recorded with a 40-core computing node with 200 GB RAM, and an rRNA chunksize of 1000 reads), but it requires significant computing power, memory, and storage space, not available on a typical desktop PC.

      If you were to run this on your own, you will need the RFam database.

      The format of the MetaPro command is:

          read1='<path to your host filter final results mouse.fastq>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial rRNA

      The command would look as follows: [DO NOT RUN]

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/host_read_filter/final_results/singletons.fastq
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial rRNA
      

      We have provided you with the pre-computed results:

      ls mouse1_run/rRNA_filter/final_results
      

      Here, we only remove about a thousand reads that map to rRNA, but in some datasets rRNA may represent up to 80% of the sequenced reads.


      Question 5.1: How many rRNA sequences were identified? How many reads are now remaining?


      Step 6. Rereplication / duplicate repopulation

      After removing contaminants, host sequences, and rRNA, we need to replace the previously removed replicate reads back in our data set.

      The format of the MetaPro command is:

          read1='<path to your rRNA filter final results mouse.fastq>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial repop

      Run the command:

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/rRNA_filter/final_results/mRNA/singletons.fastq
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial repop
      

      Now that we have filtered vectors, adapters, linkers, primers, host sequences, and rRNA, check read quality of putative mRNA reads with FastQC:

      /pipeline_tools/FastQC/fastqc mouse1_run/duplicate_repopulation/final_results/singletons.fastq
      


      Question 6.1: How many total contaminant, host, and rRNA reads were filtered out?


      Step 7. Contig assembly *** [DO NOT RUN]

      We have now identified the putative mRNA reads, and here we assemble the reads into contigs. Previous studies have shown that assembling reads into larger contigs significantly increases the ability to annotate them to known genes through sequence similarity searches. Here we will apply the SPAdes transcript assembly algorithm to our set of putative mRNA reads.

      The format of the MetaPro command is:

          read1='<path to your rereplicated mRNA final results mouse.fastq>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial contigs

      The command would look as follows: [DO NOT RUN]

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/duplicate_repopulation/final_results/singletons.fastq
      python3 /pipeline/MetaPro.py -c $config -s $read1 -o $output --tutorial contigs
      

      The pre-computed output is located in the following directory:

      ls mouse1_run/assemble_contigs/final_results
      

      Notes: In this step, MetaPro does the following:

      • Assemble the reads into contigs using SPAdes
      • Use MetaGeneMark to predict the genes in these contigs
      • Use BWA to align the mRNA reads against these split-contigs to find out which reads were consumed by the process.
      • Produce a relational map of split-contig and their constituent reads.

      SPAdes assembles long contigs, but MetaPro requires that each contig only represent 1 gene. Thus the need to disassemble them, using MetaGeneMark MetaGeneMark requires the user to register and obtain a free license. Thus, we have provided the results in the precomputed files package.


      Question 7.1: How many reads were used in the assembled contigs?
      Hint: check how many are left in the 'singletons' (unassembled reads) file.


      Question 7.2: How many genes were predicted in the assembled contigs?
      Hint: try using the command tail contigs.fasta


      Step 8. Annotate reads to known genes/proteins *** [DO NOT RUN]

      Here we will attempt to infer the specific genes our putative mRNA reads originated from. In our pipeline we rely on a tiered set of sequence similarity searches of decreasing accuracy - BWA, BLAT, and DIAMOND. While BWA provides high stringency, sequence diversity that occurs at the nucleotide level results in few matches observed for these processes. Nonetheless it is quick. To account for sequence diversity that occurs at the nucleotide level, particularly in the absence of reference microbial genomes, we use a cascaded method involving two other tools: BLAT, and DIAMOND. BLAT provides a more sensitive alignment, along with quality scores to rank the matches. DIAMOND is used to provide more sensitive peptide-based searches, which are unaffected by sequence changes between strains.

      Since BWA and BLAT perform nucleotide searches, we rely on the ChocoPhlan pangenome database from The Huttenhower lab, which contains reference genomes for over 10000 microbial organisms in separate .ffn files. We create a merged copy of these sequences, and index it for BWA to use. We leave it in its separated state for BLAT to use.

      For DIAMOND searches we use the Non-Redundant (NR) protein database from the NCBI.

      This is a computationally intensive step. We employ our subdivision strategy here, similar to our design for rRNA removal, as seen below

      • The mRNA read data (contigs, remaining singletons, and remaining paired reads if applicable) are split into smaller sequence files, or 'chunks', (GA_chunksize in the configuration)
      • Each chunk is sent through BWA to be aligned against the ChocoPhlan database
      • Reads not annotated by BWA are isolated, and are sent through to BLAT.
      • Reads not annotated by BLAT are isolated, and are sent through to DIAMOND.
      • At each step, a map of annotated genes (or proteins) to constituent reads is produced.

      In all, this leaves us with (split into chunks):

      • a batch of BWA-annotated gene-to-read maps
      • a batch of BLAT-annotated gene-to-read maps
      • a batch of DIAMOND-annotated protein-to-read maps
      • a batch of reads unannotated by BWA, BLAT, and DIAMOND.

      These files are then merged using a custom script, and output into the GA_FINAL_MERGE folder (GA = "gene annotation"). Files include:

      • a gene map gene_map.tsv
      • unannotated reads GA_leftover_contigs.fasta GA_leftover_singletons.fasta
      • translated protein sequences of all genes annotated over the three annotation steps all_proteins.faa

      The format of the MetaPro command is:

          read1='<path to your unassembled singletons.fastq>'
          contig='<path to your contigs.fasta>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial GA

      The command would look as follows: [DO NOT RUN] This step is heavily computationally intensive.

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/singletons.fastq
      contig=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/contigs.fasta
      python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial GA
      

      We have provided pre-computed outputs in the following directory:

      ls mouse1_run/GA_FINAL_MERGE/final_results
      

      Notes:

      • The gene annotation step will create 4 different subdirectories: GA_BWA, GA_BLAT, GA_DIAMOND, and GA_FINAL_MERGE

      • MetaPro will take the top-quality hits of each tool to count towards annotation:

        • In BWA: the CIGAR string is decoded. Any hit with less than 90% match is rejected
        • In BLAT and DIAMOND: the only reads that pass annotation are reads with all 3 conditions satisfied:
          • A sequence identity score of greater than 85
          • An alignment length of greater than 65%
          • A bitscore of over 60
      • MetaPro makes a number of extra considerations to account for multi-mapped reads:

        • Every read scanned by BWA, BLAT, and DIAMOND is affixed with a quality score suffix taken from the aligner's report
        • In BWA: the alignment score is used.
        • In BLAT and DIAMOND: the bitscore is used.
        • Should there be a case where a read is annotated to multiple genes or proteins, each tool will use their quality scores to rank the hit
          • Alignment information is iterated through, from the top of the file down to the bottom.
          • The hit with the higher score is used for all cases.
          • If the scores are tied, then the incumbent hit is used.
        • This read-ranking is done on each chunk when the gene-to-read maps are formed.
        • There are further considerations made for paired-end annotation:
          • MetaPro views paired-end data as 2 copies of the same read
          • If there are disagreements between the forward and reverse read's annotation, the quality scores are used to resolve the conflict.
          • If both reads agree on the same gene or protein, the read is counted only once.
          • This paired-read conflict resolution is performed in the GA_FINAL_MERGE step.
      • Unless you are running this tutorial on a computing cluster, most systems do not have enough memory to handle indexing or searching large databases like ChocoPhlan (19GB) and nr (>60GB). The descriptions in this section are purely for your information. Please use our precomputed gene, protein, and read mapping files from the tar file tar -xzf tutorial_files.tar.gz


      Step 9. Taxonomic Classification *** [DO NOT RUN]

      Now that we have putative mRNA transcripts, we can begin to infer the origins of our mRNA reads. Firstly, we will attempt to use a reference based short read classifier to infer the taxonomic orgin of our reads. Here we will use Kaiju, Centrifuge, and our Gene Annotation results to generate taxonomic classifications for our reads based on a reference database. Kaiju can classify prokaryotic reads at speeds of millions of reads per minute using the proGenomes database on a system with less than 16GB of RAM (~13GB). Using the entire NCBI nr database as a reference takes ~43GB. Similarly fast classification tools require >100GB of RAM to classify reads against large databases.

      However, Kaiju still takes too much memory for the systems in the workshop so we have precompiled the classifications, mouse1_classification.tsv, in the tar file tutorial_files.tar.gz. Centrifuge is a lightweight rapid microbial classification engine. It uses methods similar to BWA and the Ferrgina-Manzini (FM) index to make quick work of assigning taxomony.

      The ChocoPhlan Pangenome Database contains taxonomic information that MetaPro extracts. Kaiju, Centrifuge, and the extracted taxa are combined using WEVOTE. WEVOTE is the Weighted Voting Taxonomic Identification system. It performs consensus merging of various taxa results and reconciles the taxa identification from various sources.
      MetaPro uses this to settle on one confident taxon amongst Kaiju, Centrifuge, and the ChocoPhlan database choices.

      The format of the MetaPro command is:

          read1='<path to your unassembled singletons.fastq>'
          contig='<path to your contigs.fasta>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial TA

      The command would look as follows: [DO NOT RUN] MetaPro assumes the Gene Annotation step has completed.

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/singletons.fastq
      contig=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/contigs.fasta
      python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial TA
      

      We have provided pre-computed results here:

      ls mouse1_run/taxonomic_annotation/final_results
      

      We can use Krona to generate a hierarchical multi-layered pie chart summary of the taxonomic composition of our dataset. First, the export of MetaPro's taxonomic annotations needs to be slightly modified. Run the following commands to generate a Krona output:

      python3 /pipeline/Scripts/alter_taxa_for_krona.py mouse1_run/taxonomic_annotation/final_results/taxonomic_classifications.tsv mouse1_classification.tsv
      /pipeline_tools/kaiju/kaiju2krona -t databases/nodes.dmp -n databases/names.dmp -i mouse1_classification.tsv -o mouse1_classification_Krona.txt
      /pipeline_tools/KronaTools/scripts/ImportText.pl -o mouse1_classification.html mouse1_classification_Krona.txt
      

      View the pie chart representation of the taxonomies detected through a web browser.


      Question 9.1: What is the most abundant family in our dataset? What is the most abundant phylum?
      Hint: Try decreasing the Max depth value on the top left of the screen and/or double-clicking on specific taxa.


      Step 10. Enzyme Function Annotation *** [DO NOT RUN]

      To help interpret our metatranscriptomic datasets from a functional perspective, we rely on mapping our data to functional networks such as metabolic pathways and maps of protein complexes. Here we predict enzyme functions (EC numbers) in our dataset, such that we may later map them to KEGG metabolic pathways.

      MetaPro uses three tools to produce its enzyme annotations: DETECT, PRIAM, and DIAMOND.

      We use DIAMOND to identify homologs of our genes/proteins in the SWISS-PROT database that have assigned enzyme functions. This is a relatively coarse and straightforward way to annotate enzyme function by homology. We also use more robust methods for enzymatic function annotation, such as our own probability density-based enzyme function annotation tool, DETECT, and the tool PRIAM. MetaPro combines the predictions of all three tools to give two answers, a lower-confidence set of enzyme predictions lq_proteins.ECs_All, and a higher-confidence set of predictions proteins.ECs_All.

      PRIAM is incredibly resource-intensive and slow to run. For the sake of time, pre-computed results have been provided.

      The format of the MetaPro command is:

          read1='<path to your unassembled singletons.fastq>'
          contig='<path to your contigs.fasta>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial EC

      The command would look as follows: [DO NOT RUN] This command assumes that MetaPro has performed the Gene annotation steps.

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/singletons.fastq
      contig=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/contigs.fasta
      python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial EC
      

      The pre-computed results are provided in the following directory:

      ls mouse1_run/enzyme_annotation/final_results
      

      Notes:

      MetaPro's high-confidence and low-confidence are determined as follows:

      • DIAMOND: Low-confidence hits have an e-value of 1e-5 or smaller; high-confidence hits have an e-value of 1e-10 or smaller
      • PRIAM: Low-confidence hits have e-values lower than 1e-5; high-confidence hits have a probability value of 0.5 or higher
      • DETECT: No separation.

      MetaPro reconciles the three sets of annotations in the following manner:

      • Enzymes predicted by DETECT are included, followed by annotations that agree between PRIAM and DIAMOND
      • In the event that multiple enzymes are annotated to the same protein:
      • Each enzyme annotation comes with a probability score.
      • MetaPro includes an enzyme co-occurence database (compiled from the ENZYME database) that contains pairs of enzymes known to exist together. Using this co-occurence database, the pipeline filters out invalid predictions. If a pair of enzymes does not exist in the database, the enzyme with the higher probability score is declared the proper annotation.
      • In cases where more than two enzymes are annotated to a protein, the top two enzymes are assigned based on the probability score.


      Question 10.1: How many high-confidence unique enzyme functions were identified in our dataset?


      Step 11. Generate output files *** [DO NOT RUN]

      We have removed low quality bases/reads, vectors, adapters, linkers, primers, host sequences, and rRNA sequences and annotated reads to the best of our ability. We will now generate summaries of the gene counts, predicted functions and taxonomies in our microbiome.

      MetaPro generates many output files:

      • An account of read numbers during various filtering and annotation steps
      • A gene expression table of counts and RPKM values
      • RPKM values of genes in the 20-most prevalent taxa in the sample
      • A Cytoscape-compatible network file
      • An enzyme superpathway heatmap to visualize the distribution of enzymes found
      • A gene/protein-to-read map of all genes and proteins identified by MetaPro, followed by its constituent reads
      • A histogram of read quality
      • A summary of all taxa identified, followed by the number of reads associated with those taxa

      The format of the MetaPro command is:

          read1='<path to your unassembled singletons.fastq>'
          contig='<path to your contigs.fasta>'
          python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial output

      The command would appear as follows: [DO NOT RUN] The command assumes that MetaPro has performed the gene, taxa, and enzyme annotations.

      read1=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/singletons.fastq
      contig=/media/data/workspace/metapro_tutorial/mouse1_run/assemble_contigs/final_results/contigs.fasta
      python3 /pipeline/MetaPro.py -c $config -s $read1 --contig $contig -o $output --tutorial output  
      

      We have provided the pre-computed output files generated by MetaPro. View the files:

      ls mouse1_run/outputs/final_results/
      


      Question 11.1: How many unique enzyme activities were predicted, at low and high confidence? View the read_count.txt file.


      Question 11.2: Have a look at the RPKM_table.tsv file. What are the most highly expressed genes?


      Question 11.3: Which taxa appear to contribute the most metabolic activity? View the enzyme_superpathway_heatmap.jpg.


      Step 12. Visualize the results using a KEGG Pathway as a scaffold in Cytoscape.

      This step will be performed on your workstation, using a Cytoscape file output from MetaPro.

      To visualize our processed microbiome dataset in the context of the carbohydrate metabolism pathways, we use the network visualization tool Cytoscape together with the enhancedGraphics and KEGGscape plugins. Some useful commands for loading in networks, node attributes and changing visual properties are provided below (there are many Cytoscape tutorials available online). Use Cytoscape 3.7.2 instead of the latest version.

      Download the metabolic pathway

      First, download the carbohydrate metabolism pathways from KEGG onto your workstation by pasting the following addresses into a browser:

      https://github.com/ParkinsonLab/Metatranscriptome-Workshop/releases/download/EC/ec00010.xml
      https://github.com/ParkinsonLab/Metatranscriptome-Workshop/releases/download/EC/ec00500.xml
      

      Alternatively, if you are using Linux, you may use the wget command as follows:

      wget https://github.com/ParkinsonLab/Metatranscriptome-Workshop/releases/download/EC/ec00010.xml
      wget https://github.com/ParkinsonLab/Metatranscriptome-Workshop/releases/download/EC/ec00500.xml
      

      You can find other pathways on KEGG which can also be imported into Cytoscape by selecting the Download KGML option on the top of the page for each pathway.

      Next, download the Cytoscape_network.tsv file through the browser. It is located within the mouse1_run/outputs/final_results folder. This file contains expression values of enzymes (as RPKMs) of all taxa detected at >1% abundance within the dataset.

      Install the Cytoscape plugins

      • Select Apps -> App Manager
      • Search for enhancedGraphics
      • Select enhancedGraphics in the middle column then click Install in the bottom right
      • Search for KEGGScape
      • Select KEGGScape in the middle column then click Install in the bottom right

      Import an XML from KEGG into Cytoscape

      • Select File -> Import -> Network -> File...
      • Select the XML file, ec00010.xml or ec00500.xml and click Open
      • Check Import pathway details from KEGG Database box then select OK

      Note: If you get an error while loading network via KEGGscape, you can alternatively install CyKEGGparser plugin within Cytoscape. Once installed, select Apps -> KEGGparser -> Load KGML -> Load local KGML -> ...

    Loading a node attribute text file (.txt) - This will map attributes (e.g. expression values) to nodes in your network which you can subsequently visualize

    • Select File -> Import -> Table -> File...
    • Select the Cytoscape_network.tsv file and click Open
    • Click OK
    • If you do not see the node table in the panel below the network, you can try another network file. First, copy it in your directory running the command

      cp ~/CourseData/Module6/Data/Cytoscape_network_1.tsv ~/workspace/metapro_tutorial/mouse1_run/outputs/final_results/.

      Then download this file to your own workstation and open in Cytoscape.

    Visualizing your node attributes

    • In the left Control Panel select the Style tab
    • Check the Lock node width and height box
    • Click the left-most box by the Size panel and change the default node size to 20.0
    • Click the blank box immediately to the right of the box you clicked to change the default size, change the Column field to RPKM and the Mapping Type field to Continuous Mapping
    • Click the left-most box by the Image/Chart 1 panel, switch to the Charts tab, Click the doughnut ring icon, and press the >> "add all" button between the two column fields before clicking apply (make sure to remove overall RPKM from the fields that are added to the doughnut ring)
    • If you do not see the Image/Chart 1 panel, select Properties -> Paint -> Custom Paint 1 -> Image/Chart 1 from the to left corner of the control panel
    • To improve the visualization you can modify colour properties under Image/Chart 1 -> Charts -> Options, or modify other properties such as Label Font Size, Label Position, Fill Color, Node location, and edge properties

    Notes:

    • A cytoscape file with node attributes precalculated is provided for your convenience, tar -xzf tutorial_files.tar.gz Example.cys, feel free to open it and play with different visualizations and different layouts - compare the circular layouts with the spring embedded layouts for example. If you want to go back to the original layout then you will have to reload the file.
    • Cytoscape can be temperamental. If you don't see pie charts for the nodes, they appear as blank circles, you can show these manually. Under the 'properties' panel on the left, there is an entry labeled 'Custom Graphics 1'. Double click the empty box on the left (this is for default behavior) - this will pop up a new window with a choice of 'Images' 'Charts' and 'Gradients' - select 'Charts', choose the chart type you want (pie chart or donut for example) and select the different bacterial taxa by moving them from "Available Columns" to "Selected Columns". Finally click on 'Apply' in bottom right of window (may not be visible until you move the window).

    Visualization Questions:

    • Which genes are most highly expressed in these two systems?
    • Which taxa are responsible for most gene expression?
    • Can you identify sub-systems (groups of interacting genes) that display anomalous taxonomic profiles?
    • Think about how you might interpret these findings; for example are certain taxa responsible for a specific set of genes that operate together to fulfill a key function?
    • Can you use the gene annotations to identify the functions of these genes through online searches?
    • Think about the implications of sequence homology searches, what may be some caveats associated with interpreting these datasets?