From reads to assembly: working with INNUca pipeline

From microbial genomics to metagenomics
- Bacterial Genomics
  - Prepare the Virtual Machine
  - From reads to assembly: working with INNUca pipeline
  - In silico typing using ReMatCh and Abricate
  - _ Annotation with Prokka and intro of Roary_

Note 1: replace whatever is between <> with the proper value. For example, in "Organize the data" <your_species_name>, write the species name you selected (something like campylobacter_jejuni).
Note 2: if the VM has 16 CPUs, use 16 in CPUs/threads instead of 8.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.

Get genomic data

Get genomic information

In your computer

In NCBI website:

Select "Genome" in dropdown menu and search "Streptococcus agalactiae"
On the top box, bellow "All XX genomes for species" section, click on "Browse the list"
On "Levels" options, only select "Complete"
Take note of the average genome size using "Size (Mb)" column. For Streptococcus agalactiae, 2.1 Mb will be used.

Organize the data

In the VM

# Create a folder to store the HTS reads

mkdir ~/reads
mkdir ~/reads/<your_species_name>

# Create a folder for Streptococcus agalactiae example
mkdir ~/reads/streptococcus_agalactiae_example


# Create a folder to store the genomes

mkdir ~/genomes
mkdir ~/genomes/<your_species_name>

# Create a folder for Streptococcus agalactiae example
mkdir ~/genomes/streptococcus_agalactiae_example

Get HTS (High-throughput sequencing) data

Upload a file with IDs to download

In your computer

UNIX terminal

scp -i </path/to/provided/private/ssh/key/mgmc.key> </path/to/file/with/IDs.txt> cloud-user@<VM.IP>:~/reads/<your_species_name>

FileZilla

Get FileZilla here
More information on using Filezilla with SSH key here
Upload the file to /home/cloud-user/reads/<your_species_name>/

Get the data

In the VM

# Using the Streptococcus agalactiae example
# 10 samples

# Get the file with IDs
wget -O ~/reads/streptococcus_agalactiae_example/MPM_GBS_samples.tab https://raw.githubusercontent.com/INNUENDOCON/MicrobialGenomeMetagenomeCourse/master/MPM_GBS_samples.tab

# Produce a clean file by removing the header line (first line) and containing only the first column
# The next command pipes two different commands and redirects the output to a file
sed 1d ~/reads/streptococcus_agalactiae_example/MPM_GBS_samples.tab | cut -f 1 > ~/reads/streptococcus_agalactiae_example/ids.txt

# Download data using getSeqENA
getSeqENA.py --listENAids ~/reads/streptococcus_agalactiae_example/ids.txt \
             --outdir ~/reads/streptococcus_agalactiae_example/ \
             --asperaKey  ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
             --downloadLibrariesType PAIRED \
             --downloadInstrumentPlatform ILLUMINA \
             --threads 8 \
             --SRAopt

# Runtime :0.0h:2.0m:10.12s

More information about piping and redirection here
More information on skipping lines here
For more information about cutting text based on delimiters: cut --help or man cut, and here

Assembly HTS data

Assembly HTS data using INNUca

In the VM

# INNUca basic command

# You should specify where the output goes whenever there is an option to do that
# Whenever possible use the option to specify the number of CPUs/threads to be used

docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \
       INNUca.py --inputDirectory /data/reads/<your_species_name>/ \
                 --speciesExpected "<your species name with space>" \
                 --genomeSizeExpectedMb <your_species_genome size> \
                 --outdir /data/genomes/<your_species_name>/innuca/ \
                 --threads 8

Using the Streptococcus agalactiae example:

# Run inside a screen
screen -S streptococcus_agalactiae_example

# INNUca
docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \
       INNUca.py --inputDirectory /data/reads/streptococcus_agalactiae_example/ \
                 --speciesExpected "Streptococcus agalactiae" \
                 --genomeSizeExpectedMb 2.1 \
                 --outdir /data/genomes/streptococcus_agalactiae_example/innuca/ \
                 --threads 8 \
                 --fastQCproceed \
                 --fastQCkeepFiles \
                 --trimKeepFiles \
                 --saveExcludedContigs

# Detatch the screen
# Press Ctrl + A (release) and then D

# Runtime :1.0h:14.0m:33.47s

More information about screen here and man screen

Organize assemblies

Store all assembled genomes (good and bad assemblies) in a single folder to use with next tools.

In the VM

# Create the folder where assemblies will be stored
mkdir ~/genomes/<your_species_name>/all_assemblies

# Using the Streptococcus agalactiae example

mkdir ~/genomes/streptococcus_agalactiae_example/all_assemblies

# Copy INNUca's final assemblies
# The next command pipes different commands:
## The first sed command read the INNUca combine_samples_reports.tab file and skip the header line
## cut command will get the final_assembly column
## grep will ignore those samples that did not produced a final assembly ("NA")
## Then, the resulting list feeds parallel command that will copy the file for each entry
### Inside parallel {} substitutes each line that gets inside parallel
### Because INNUca ran inside Docker, the assembly path is relative to /data/
### The second sed (inside parallel) replaces the /data/ with user HOME directory in each line that gets inside parallel

sed 1d ~/genomes/streptococcus_agalactiae_example/innuca/combine_samples_reports.*.tab | \
          cut -f 23 | \
          grep --invert-match "NA" | \
          parallel --jobs 8 'cp $(sed s#/data/#$HOME/#1 <(echo {})) $HOME/genomes/streptococcus_agalactiae_example/all_assemblies/'

For more information about finding patterns: grep --help or man grep, and here
More information about using parallel: introduction and bioinformatics examples; manual; tutorial

Remove INNUca image

To avoid VM space problems during the course, unused Docker images will be removed

In the VM

# List Docker images
docker images
# Remove INNuca image
# Find the INNUca image line starting with ummidock/innuca
# Get the Image ID, something like 1f467865b7f3
docker rmi <INNUca_Image_ID>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPM_workingwithINNUCA.md

MPM_workingwithINNUCA.md

From reads to assembly: working with INNUca pipeline

Get genomic data

Get genomic information

Organize the data

Get HTS (High-throughput sequencing) data

Assembly HTS data

Organize assemblies

Remove INNUca image

Files

MPM_workingwithINNUCA.md

Latest commit

History

MPM_workingwithINNUCA.md

File metadata and controls

From reads to assembly: working with INNUca pipeline

Get genomic data

Get genomic information

Organize the data

Get HTS (High-throughput sequencing) data

Assembly HTS data

Organize assemblies

Remove INNUca image