In silico typing using ReMatCh and Abricate

From microbial genomics to metagenomics
- Bacterial Genomics
  - Prepare the Virtual Machine
  - From reads to assembly: working with INNUca pipeline
  - In silico typing using ReMatCh and Abricate
  - _ Annotation with Prokka and intro of Roary_

Note 1: replace whatever is between <> with the proper value. For example, in "Organize the data" <your_species_name>, write the species name you selected (something like campylobacter_jejuni).
Note 2: if the VM has 16 CPUs, use 16 in CPUs/threads instead of 8.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.

Get DB of antimicrobial resistance genes

Download a database containing antimicrobial resistance genes

Resfinder

What is Resfinder?

ResFinder identifies acquired antimicrobial resistance genes and/or find chromosomal mutations in total or partial sequenced isolates of bacteria

But it also provides a curated database for antimicrobial resistance genes.

Get ResFinder DB
In your computer

In Genomic Epidemiology databases Bitbucket webpage:

resfinder_db
Copy the HTTPS link

In the VM

Althoug ResFinder is a curated database, small details impair a simple concatenation for downstream use. Therefore several steps are required to properlly prepare the DB.

# Get ResFinder DB
git clone https://bitbucket.org/genomicepidemiology/resfinder_db.git
mv resfinder_db/ /media/volume/DBs/

# Create a single file containing the sequences of genes confering resistance to all antibiotic classes together
## Make sure all files end with a newline to avoid concatening two sequences together
mkdir temp_concat_resfinder
ls /media/volume/DBs/resfinder_db/*.fsa | parallel --jobs 8 'cp {} temp_concat_resfinder/; echo "" >> temp_concat_resfinder/{/}'
## Correct several details
### Concatenate all files
### Convert DOS to UNIX newlines special chracters
### Ignore blank lines
### Pass everytnhing to parallel
#### Split everything that enters parallel by sequence (--recstart '>')
#### cat receives each sequence
#### Each sequence is saved in seq variable
#### The header line is saved by grepping the line with ">" while removing trailing spaces with sed
#### The sequnce is saved in a file named as the sequence header without the '>' character, therefore removing duplicated sequences (at least on headers level)
mkdir temp_concat_resfinder/seq_file
cat temp_concat_resfinder/*.fsa | \
       sed $'s/\r$//' | \
       grep --invert-match "^$" | \
       parallel --jobs 1 --pipe -N 1 --recstart '>' 'cat | { seq=$(cat); header=$(echo "$seq" | grep ">" | sed -e "s/[[:space:]]*$//"); echo "$seq" > temp_concat_resfinder/seq_file/${header:1}.fasta; }'

# Concatenate the cleaned sequences
cat temp_concat_resfinder/seq_file/*.fasta > /media/volume/DBs/resfinder_db/resfinder_db.fasta
rm -r temp_concat_resfinder/

More informations:

Adding newlines at the end of files
Reading different files together: cat --help or man cat, and here
Converting newline characters from DOS to UNIX
Skipping blank lines
Removing trailing spaces
Removing characters from the beggining of a string

Update Abricate DB

After a Docker image is built, its content is static, which mean it might carry outdated databases. However, some small steps can overwrite the provided DBs with updated ones.

In the VM

Before start, make sure Abricate image is already in the VM

docker pull ummidock/abricate:latest

Start updating Abricate DB

# Create folder to store Abricate DBs
mkdir /media/volume/DBs/abricate

# Copy DB folder to outside image
## External folder will overwrite image inside folder when mapped
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/DBs/abricate/:/data/ ummidock/abricate:latest \
       cp -r /NGStools/miniconda/db/ /data/
mv /media/volume/DBs/abricate/db/* /media/volume/DBs/abricate/
rm -r /media/volume/DBs/abricate/db/

# Update DB in a folder that will persist after container stops
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest \
      abricate-get_db --db resfinder --force

Typing samples

Two different approaches can be used to type HTS data: directly from reads using a reference mapping approach, or throught similarity sequence search using an already produced genome assembly.

Since the similarity sequence search approach uses a reduced representation of original HTS reads dataset, it's a very fast method
Therefore, we will leave the first read command running and move to de novo assembly approach

In the VM

# Create folder to store typing results
mkdir /media/volume/typing
mkdir /media/volume/typing/<your_species_name>

# Create a folder for Streptococcus agalactiae example
mkdir /media/volume/typing/streptococcus_agalactiae_example

Using reads

MLST for a given taxon

Let's start with Multi-Locus Sequence Typing (MLST), and type all HTS reads available for a given taxon using ReMatCh.

For example purposes, we will use Bartonella bacilliformis because only around 30 reads datasets are available
Only Illumina reads will be used

In the VM

# Run inside a screen
screen -S rematch_mlst_Bb

rematch.py \
    --workdir /media/volume/typing/rematch_mlst_Bb_example/ \
    --mlst "Bartonella bacilliformis" \
    --mlstReference \
    --doubleRun \
    --threads 8 \
    --taxon "Bartonella bacilliformis" \
    --asperaKey ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
    --SRAopt

# Detatch the screen
# Press Ctrl + A (release) and then D

# 29 samples out of 30 run successfully
# Runtime :0.0h:55.0m:6.53s

Antibiotic resistance genes

To screen antibiotic resistance genes we will use the Resfinder DB obtained before. For that, we will:

Move back to Streptococcus agalactiae example
Re-download example's samples while running, or using already downloaded reads data

In the VM

Re-download example's samples while running

# Run inside a screen
screen -S rematch_ARgenes_download

rematch.py \
    --workdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_download/ \
    --reference /media/volume/DBs/resfinder_db/resfinder_db.fasta \
    --reportSequenceCoverage \
    --notWriteConsensus \
    --summary \
    --threads 8 \
    --listIDs ~/reads/streptococcus_agalactiae_example/ids.txt \
    --asperaKey  ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
    --SRAopt

# Detatch the screen
# Press Ctrl + A (release) and then D

# 10 samples out of 10 run successfully
# Runtime :0.0h:59.0m:54.58s

In the VM

Using already downloaded reads data

# Create a ReMatCh working directory
mkdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples

# Set ReMatCh working directory
## Using a bash for-loop for that
## Create symlinks instead of duplicate data
### Iterate over samples directories from reads folders
#### Create sample directory by obtaining only last directory name instead of directory path using basename
#### Create symbolic links to sample's reads inside newly created sample's directory
for sample_dir in $(ls -d ~/reads/streptococcus_agalactiae_example/*/); do
  mkdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)
  ln -s $(ls ${sample_dir}*_1.fq.gz) /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)/$(basename $(ls ${sample_dir}*_1.fq.gz))
  ln -s $(ls ${sample_dir}*_2.fq.gz) /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)/$(basename $(ls ${sample_dir}*_2.fq.gz))
done

# ReMatCh command example
# Not necessary to run because the output will exactly the same as the previous command
# Added fake option to avoid ReMatCh to run
# Don't forget use screen
rematch.py \
    --workdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/ \
    --reference /media/volume/DBs/resfinder_db/resfinder_db.fasta \
    --reportSequenceCoverage \
    --notWriteConsensus \
    --summary \
    --threads 8 \
    --avoidRunningReMatCh

# 10 samples out of 10 run successfully
# Runtime :0.0h:56.0m:30.66s

More informations:

Using bash for-loop
Listing only folders: ls --help or man ls and here
Using subcommands and other things
Create symbolic links: ln --help or man ln and here

for fastq in $(ls /media/volume/mydata/Fastqfiles/*.fq.gz); do \
   mkdir /media/volume/typing/staphylococcus_aureus/rematch_ARgenes_localSamples/$(basename $fastq | cut -d "_" -f 1); \
   ln -s $fastq /media/volume/typing/staphylococcus_aureus/rematch_ARgenes_localSamples/$(basename $fastq | cut -d "_" -f 1)/$(basename $fastq); \
done

Using de novo assembly

Typing with Abricate using Resfinder DB (the one previously updated)

Antibiotic resistance genes

In the VM

# Create folder to store Abricate outputs
mkdir /media/volume/typing/streptococcus_agalactiae_example/abricate

# Run Abricate for each sample
## List all produced assemblies and pass to parallel
### Run Docker without TTY (-t option)
### Map the assemblies folder in /data/ folder and the updated Abricate DB in Docker Abricate DB folder
### Redirect the Abricate output to a file named as the assembly but without the extension (using {/.})
### Note that the redirection should be done using filesystem paths outside the container
ls ~/genomes/streptococcus_agalactiae_example/all_assemblies/* | \
      parallel --jobs 8 'docker run --rm -u $(id -u):$(id -g) -v ~/genomes/streptococcus_agalactiae_example/all_assemblies/:/data/ -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest abricate --db resfinder /data/{/} > /media/volume/typing/streptococcus_agalactiae_example/abricate/{/.}.abricate_out.tab'

# Summarize all Abricate results into a single file
## With Docker, when using * to refer multiple files, the paths obtained refer to the filesystem outside the container
## Therefore, a shell script file is created containing the command to be run inside the container
## Note that the command inside the shell script only contains paths refering to mapped folder /data/
echo 'abricate --summary /data/*.abricate_out.tab > /data/abricate_summary.resfinder.tab' > /media/volume/typing/streptococcus_agalactiae_example/abricate/abricate_commands.sh
## Run the summary Abricate command
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/typing/streptococcus_agalactiae_example/abricate/:/data/ -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest \
      sh /data/abricate_commands.sh

More informations on what TTY is here

Remove image

To avoid VM space problems during the course, unused Docker images will be removed

In the VM

# List Docker images
docker images
# Remove Abricate image
# Find the Abricate image line starting with ummidock/abricate
# Get the Image ID, something like 1f467865b7f3
docker rmi <Abricate_Image_ID>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPM_ReMatCh_Abricate.md

MPM_ReMatCh_Abricate.md

In silico typing using ReMatCh and Abricate

Get DB of antimicrobial resistance genes

Resfinder

Update Abricate DB

Typing samples

Using reads

Using de novo assembly

Remove image

Files

MPM_ReMatCh_Abricate.md

Latest commit

History

MPM_ReMatCh_Abricate.md

File metadata and controls

In silico typing using ReMatCh and Abricate

Get DB of antimicrobial resistance genes

Resfinder

Update Abricate DB

Typing samples

Using reads

Using de novo assembly

Remove image