Skip to content

Latest commit



254 lines (187 loc) · 10.4 KB

File metadata and controls

254 lines (187 loc) · 10.4 KB

From reads to assembly: working with INNUca pipeline

Note 1: replace whatever is between <> with the proper value. For example, in "Organize the data" <your_species_name>, write the species name you selected (something like campylobacter_jejuni).
Note 2: if the VM has 16 CPUs, use 16 in CPUs/threads instead of 8.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.

Get genomic data

Get genomic information

In your computer

In NCBI website:

  1. Select "Genome" in dropdown menu and search "Streptococcus agalactiae"
  2. On the top box, bellow "All XX genomes for species" section, click on "Browse the list"
  3. On "Levels" options, only select "Complete"
  4. Take note of the average genome size using "Size (Mb)" column. For Streptococcus agalactiae, 2.1 Mb will be used.

Organize the data

In the VM

# Create a folder to store the HTS reads

mkdir ~/reads
mkdir ~/reads/<your_species_name>

# Create a folder for Streptococcus agalactiae example
mkdir ~/reads/streptococcus_agalactiae_example

# Create a folder to store the genomes

mkdir ~/genomes
mkdir ~/genomes/<your_species_name>

# Create a folder for Streptococcus agalactiae example
mkdir ~/genomes/streptococcus_agalactiae_example

Get HTS (High-throughput sequencing) data

Upload a file with IDs to download

In your computer

UNIX terminal

scp -i </path/to/provided/private/ssh/key/mgmc.key> </path/to/file/with/IDs.txt> cloud-user@<VM.IP>:~/reads/<your_species_name>


  • Get FileZilla here
  • More information on using Filezilla with SSH key here
  • Upload the file to /home/cloud-user/reads/<your_species_name>/

Get the data

In the VM

# Using the Streptococcus agalactiae example
# 10 samples

# Get the file with IDs
wget -O ~/reads/streptococcus_agalactiae_example/

# Produce a clean file by removing the header line (first line) and containing only the first column
# The next command pipes two different commands and redirects the output to a file
sed 1d ~/reads/streptococcus_agalactiae_example/ | cut -f 1 > ~/reads/streptococcus_agalactiae_example/ids.txt

# Download data using getSeqENA --listENAids ~/reads/streptococcus_agalactiae_example/ids.txt \
             --outdir ~/reads/streptococcus_agalactiae_example/ \
             --asperaKey  ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
             --downloadLibrariesType PAIRED \
             --downloadInstrumentPlatform ILLUMINA \
             --threads 8 \

# Runtime :0.0h:2.0m:10.12s
  • More information about piping and redirection here
  • More information on skipping lines here
  • For more information about cutting text based on delimiters: cut --help or man cut, and here

Assembly HTS data

Assembly HTS data using INNUca

In the VM

# INNUca basic command

# You should specify where the output goes whenever there is an option to do that
# Whenever possible use the option to specify the number of CPUs/threads to be used

docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \ --inputDirectory /data/reads/<your_species_name>/ \
                 --speciesExpected "<your species name with space>" \
                 --genomeSizeExpectedMb <your_species_genome size> \
                 --outdir /data/genomes/<your_species_name>/innuca/ \
                 --threads 8

Using the Streptococcus agalactiae example:

# Run inside a screen
screen -S streptococcus_agalactiae_example

# INNUca
docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \ --inputDirectory /data/reads/streptococcus_agalactiae_example/ \
                 --speciesExpected "Streptococcus agalactiae" \
                 --genomeSizeExpectedMb 2.1 \
                 --outdir /data/genomes/streptococcus_agalactiae_example/innuca/ \
                 --threads 8 \
                 --fastQCproceed \
                 --fastQCkeepFiles \
                 --trimKeepFiles \

# Detatch the screen
# Press Ctrl + A (release) and then D

# Runtime :1.0h:14.0m:33.47s
  • More information about screen here and man screen

Organize assemblies

Store all assembled genomes (good and bad assemblies) in a single folder to use with next tools.

In the VM

# Create the folder where assemblies will be stored
mkdir ~/genomes/<your_species_name>/all_assemblies

# Using the Streptococcus agalactiae example

mkdir ~/genomes/streptococcus_agalactiae_example/all_assemblies

# Copy INNUca's final assemblies
# The next command pipes different commands:
## The first sed command read the INNUca file and skip the header line
## cut command will get the final_assembly column
## grep will ignore those samples that did not produced a final assembly ("NA")
## Then, the resulting list feeds parallel command that will copy the file for each entry
### Inside parallel {} substitutes each line that gets inside parallel
### Because INNUca ran inside Docker, the assembly path is relative to /data/
### The second sed (inside parallel) replaces the /data/ with user HOME directory in each line that gets inside parallel

sed 1d ~/genomes/streptococcus_agalactiae_example/innuca/combine_samples_reports.*.tab | \
          cut -f 23 | \
          grep --invert-match "NA" | \
          parallel --jobs 8 'cp $(sed s#/data/#$HOME/#1 <(echo {})) $HOME/genomes/streptococcus_agalactiae_example/all_assemblies/'

Remove INNUca image

To avoid VM space problems during the course, unused Docker images will be removed

In the VM

# List Docker images
docker images
# Remove INNuca image
# Find the INNUca image line starting with ummidock/innuca
# Get the Image ID, something like 1f467865b7f3
docker rmi <INNUca_Image_ID>