- From microbial genomics to metagenomics
- Bacterial Genomics
- Prepare the Virtual Machine
- From reads to assembly: working with INNUca pipeline
- In silico typing using ReMatCh and Abricate
- _ Annotation with Prokka and intro of Roary_
- Bacterial Genomics
Note 1: replace whatever is between <>
with the proper value. For example, in "Organize the data" <your_species_name>
, write the species name you selected (something like campylobacter_jejuni
).
Note 2: if the VM has 16 CPUs, use 16
in CPUs/threads instead of 8
.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.
In your computer
In NCBI website:
- Select "Genome" in dropdown menu and search "Streptococcus agalactiae"
- On the top box, bellow "All XX genomes for species" section, click on "Browse the list"
- On "Levels" options, only select "Complete"
- Take note of the average genome size using "Size (Mb)" column. For Streptococcus agalactiae, 2.1 Mb will be used.
In the VM
# Create a folder to store the HTS reads
mkdir ~/reads
mkdir ~/reads/<your_species_name>
# Create a folder for Streptococcus agalactiae example
mkdir ~/reads/streptococcus_agalactiae_example
# Create a folder to store the genomes
mkdir ~/genomes
mkdir ~/genomes/<your_species_name>
# Create a folder for Streptococcus agalactiae example
mkdir ~/genomes/streptococcus_agalactiae_example
Upload a file with IDs to download
In your computer
UNIX terminal
scp -i </path/to/provided/private/ssh/key/mgmc.key> </path/to/file/with/IDs.txt> cloud-user@<VM.IP>:~/reads/<your_species_name>
FileZilla
- Get FileZilla here
- More information on using Filezilla with SSH key here
- Upload the file to
/home/cloud-user/reads/<your_species_name>/
Get the data
In the VM
# Using the Streptococcus agalactiae example
# 10 samples
# Get the file with IDs
wget -O ~/reads/streptococcus_agalactiae_example/MPM_GBS_samples.tab https://raw.githubusercontent.com/INNUENDOCON/MicrobialGenomeMetagenomeCourse/master/MPM_GBS_samples.tab
# Produce a clean file by removing the header line (first line) and containing only the first column
# The next command pipes two different commands and redirects the output to a file
sed 1d ~/reads/streptococcus_agalactiae_example/MPM_GBS_samples.tab | cut -f 1 > ~/reads/streptococcus_agalactiae_example/ids.txt
# Download data using getSeqENA
getSeqENA.py --listENAids ~/reads/streptococcus_agalactiae_example/ids.txt \
--outdir ~/reads/streptococcus_agalactiae_example/ \
--asperaKey ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
--downloadLibrariesType PAIRED \
--downloadInstrumentPlatform ILLUMINA \
--threads 8 \
--SRAopt
# Runtime :0.0h:2.0m:10.12s
- More information about piping and redirection here
- More information on skipping lines here
- For more information about cutting text based on delimiters:
cut --help
orman cut
, and here
Assembly HTS data using INNUca
In the VM
# INNUca basic command
# You should specify where the output goes whenever there is an option to do that
# Whenever possible use the option to specify the number of CPUs/threads to be used
docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \
INNUca.py --inputDirectory /data/reads/<your_species_name>/ \
--speciesExpected "<your species name with space>" \
--genomeSizeExpectedMb <your_species_genome size> \
--outdir /data/genomes/<your_species_name>/innuca/ \
--threads 8
Using the Streptococcus agalactiae example:
# Run inside a screen
screen -S streptococcus_agalactiae_example
# INNUca
docker run --rm -u $(id -u):$(id -g) -it -v ~/:/data/ ummidock/innuca:3.1 \
INNUca.py --inputDirectory /data/reads/streptococcus_agalactiae_example/ \
--speciesExpected "Streptococcus agalactiae" \
--genomeSizeExpectedMb 2.1 \
--outdir /data/genomes/streptococcus_agalactiae_example/innuca/ \
--threads 8 \
--fastQCproceed \
--fastQCkeepFiles \
--trimKeepFiles \
--saveExcludedContigs
# Detatch the screen
# Press Ctrl + A (release) and then D
# Runtime :1.0h:14.0m:33.47s
- More information about
screen
here andman screen
Store all assembled genomes (good and bad assemblies) in a single folder to use with next tools.
In the VM
# Create the folder where assemblies will be stored
mkdir ~/genomes/<your_species_name>/all_assemblies
# Using the Streptococcus agalactiae example
mkdir ~/genomes/streptococcus_agalactiae_example/all_assemblies
# Copy INNUca's final assemblies
# The next command pipes different commands:
## The first sed command read the INNUca combine_samples_reports.tab file and skip the header line
## cut command will get the final_assembly column
## grep will ignore those samples that did not produced a final assembly ("NA")
## Then, the resulting list feeds parallel command that will copy the file for each entry
### Inside parallel {} substitutes each line that gets inside parallel
### Because INNUca ran inside Docker, the assembly path is relative to /data/
### The second sed (inside parallel) replaces the /data/ with user HOME directory in each line that gets inside parallel
sed 1d ~/genomes/streptococcus_agalactiae_example/innuca/combine_samples_reports.*.tab | \
cut -f 23 | \
grep --invert-match "NA" | \
parallel --jobs 8 'cp $(sed s#/data/#$HOME/#1 <(echo {})) $HOME/genomes/streptococcus_agalactiae_example/all_assemblies/'
- For more information about finding patterns:
grep --help
orman grep
, and here - More information about using parallel: introduction and bioinformatics examples; manual; tutorial
To avoid VM space problems during the course, unused Docker images will be removed
In the VM
# List Docker images
docker images
# Remove INNuca image
# Find the INNUca image line starting with ummidock/innuca
# Get the Image ID, something like 1f467865b7f3
docker rmi <INNUca_Image_ID>