Note 1: replace whatever is between <>
with the proper value. For example, in "Organize the data" <your_species_name>
, write the species name you selected (something like campylobacter_jejuni
).
Note 2: if the VM has 16 CPUs, use 16
in CPUs/threads instead of 8
.
Note 3: do the steps bellow for the bacteria species of your choise. Streptococcus agalactiae is used as example.
Download a database containing antimicrobial resistance genes
What is Resfinder?
ResFinder identifies acquired antimicrobial resistance genes and/or find chromosomal mutations in total or partial sequenced isolates of bacteria
But it also provides a curated database for antimicrobial resistance genes.
Get ResFinder DB
In your computer
In Genomic Epidemiology databases Bitbucket webpage:
- resfinder_db
- Copy the HTTPS link
In the VM
Althoug ResFinder is a curated database, small details impair a simple concatenation for downstream use. Therefore several steps are required to properlly prepare the DB.
# Get ResFinder DB
git clone https://bitbucket.org/genomicepidemiology/resfinder_db.git
mv resfinder_db/ /media/volume/DBs/
# Create a single file containing the sequences of genes confering resistance to all antibiotic classes together
## Make sure all files end with a newline to avoid concatening two sequences together
mkdir temp_concat_resfinder
ls /media/volume/DBs/resfinder_db/*.fsa | parallel --jobs 8 'cp {} temp_concat_resfinder/; echo "" >> temp_concat_resfinder/{/}'
## Correct several details
### Concatenate all files
### Convert DOS to UNIX newlines special chracters
### Ignore blank lines
### Pass everytnhing to parallel
#### Split everything that enters parallel by sequence (--recstart '>')
#### cat receives each sequence
#### Each sequence is saved in seq variable
#### The header line is saved by grepping the line with ">" while removing trailing spaces with sed
#### The sequnce is saved in a file named as the sequence header without the '>' character, therefore removing duplicated sequences (at least on headers level)
mkdir temp_concat_resfinder/seq_file
cat temp_concat_resfinder/*.fsa | \
sed $'s/\r$//' | \
grep --invert-match "^$" | \
parallel --jobs 1 --pipe -N 1 --recstart '>' 'cat | { seq=$(cat); header=$(echo "$seq" | grep ">" | sed -e "s/[[:space:]]*$//"); echo "$seq" > temp_concat_resfinder/seq_file/${header:1}.fasta; }'
# Concatenate the cleaned sequences
cat temp_concat_resfinder/seq_file/*.fasta > /media/volume/DBs/resfinder_db/resfinder_db.fasta
rm -r temp_concat_resfinder/
More informations:
- Adding newlines at the end of files
- Reading different files together:
cat --help
orman cat
, and here - Converting newline characters from DOS to UNIX
- Skipping blank lines
- Removing trailing spaces
- Removing characters from the beggining of a string
After a Docker image is built, its content is static, which mean it might carry outdated databases. However, some small steps can overwrite the provided DBs with updated ones.
In the VM
Before start, make sure Abricate image is already in the VM
docker pull ummidock/abricate:latest
Start updating Abricate DB
# Create folder to store Abricate DBs
mkdir /media/volume/DBs/abricate
# Copy DB folder to outside image
## External folder will overwrite image inside folder when mapped
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/DBs/abricate/:/data/ ummidock/abricate:latest \
cp -r /NGStools/miniconda/db/ /data/
mv /media/volume/DBs/abricate/db/* /media/volume/DBs/abricate/
rm -r /media/volume/DBs/abricate/db/
# Update DB in a folder that will persist after container stops
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest \
abricate-get_db --db resfinder --force
Two different approaches can be used to type HTS data: directly from reads using a reference mapping approach, or throught similarity sequence search using an already produced genome assembly.
- Since the similarity sequence search approach uses a reduced representation of original HTS reads dataset, it's a very fast method
- Therefore, we will leave the first read command running and move to de novo assembly approach
In the VM
# Create folder to store typing results
mkdir /media/volume/typing
mkdir /media/volume/typing/<your_species_name>
# Create a folder for Streptococcus agalactiae example
mkdir /media/volume/typing/streptococcus_agalactiae_example
MLST for a given taxon
Let's start with Multi-Locus Sequence Typing (MLST), and type all HTS reads available for a given taxon using ReMatCh.
- For example purposes, we will use Bartonella bacilliformis because only around 30 reads datasets are available
- Only Illumina reads will be used
In the VM
# Run inside a screen
screen -S rematch_mlst_Bb
rematch.py \
--workdir /media/volume/typing/rematch_mlst_Bb_example/ \
--mlst "Bartonella bacilliformis" \
--mlstReference \
--doubleRun \
--threads 8 \
--taxon "Bartonella bacilliformis" \
--asperaKey ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
--SRAopt
# Detatch the screen
# Press Ctrl + A (release) and then D
# 29 samples out of 30 run successfully
# Runtime :0.0h:55.0m:6.53s
Antibiotic resistance genes
To screen antibiotic resistance genes we will use the Resfinder DB obtained before. For that, we will:
- Move back to Streptococcus agalactiae example
- Re-download example's samples while running, or using already downloaded reads data
In the VM
Re-download example's samples while running
# Run inside a screen
screen -S rematch_ARgenes_download
rematch.py \
--workdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_download/ \
--reference /media/volume/DBs/resfinder_db/resfinder_db.fasta \
--reportSequenceCoverage \
--notWriteConsensus \
--summary \
--threads 8 \
--listIDs ~/reads/streptococcus_agalactiae_example/ids.txt \
--asperaKey ~/NGStools/aspera/connect/etc/asperaweb_id_dsa.openssh \
--SRAopt
# Detatch the screen
# Press Ctrl + A (release) and then D
# 10 samples out of 10 run successfully
# Runtime :0.0h:59.0m:54.58s
In the VM
Using already downloaded reads data
# Create a ReMatCh working directory
mkdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples
# Set ReMatCh working directory
## Using a bash for-loop for that
## Create symlinks instead of duplicate data
### Iterate over samples directories from reads folders
#### Create sample directory by obtaining only last directory name instead of directory path using basename
#### Create symbolic links to sample's reads inside newly created sample's directory
for sample_dir in $(ls -d ~/reads/streptococcus_agalactiae_example/*/); do
mkdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)
ln -s $(ls ${sample_dir}*_1.fq.gz) /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)/$(basename $(ls ${sample_dir}*_1.fq.gz))
ln -s $(ls ${sample_dir}*_2.fq.gz) /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/$(basename $sample_dir)/$(basename $(ls ${sample_dir}*_2.fq.gz))
done
# ReMatCh command example
# Not necessary to run because the output will exactly the same as the previous command
# Added fake option to avoid ReMatCh to run
# Don't forget use screen
rematch.py \
--workdir /media/volume/typing/streptococcus_agalactiae_example/rematch_ARgenes_localSamples/ \
--reference /media/volume/DBs/resfinder_db/resfinder_db.fasta \
--reportSequenceCoverage \
--notWriteConsensus \
--summary \
--threads 8 \
--avoidRunningReMatCh
# 10 samples out of 10 run successfully
# Runtime :0.0h:56.0m:30.66s
More informations:
- Using bash for-loop
- Listing only folders:
ls --help
orman ls
and here - Using subcommands and other things
- Create symbolic links:
ln --help
orman ln
and here
for fastq in $(ls /media/volume/mydata/Fastqfiles/*.fq.gz); do \
mkdir /media/volume/typing/staphylococcus_aureus/rematch_ARgenes_localSamples/$(basename $fastq | cut -d "_" -f 1); \
ln -s $fastq /media/volume/typing/staphylococcus_aureus/rematch_ARgenes_localSamples/$(basename $fastq | cut -d "_" -f 1)/$(basename $fastq); \
done
Typing with Abricate using Resfinder DB (the one previously updated)
Antibiotic resistance genes
In the VM
# Create folder to store Abricate outputs
mkdir /media/volume/typing/streptococcus_agalactiae_example/abricate
# Run Abricate for each sample
## List all produced assemblies and pass to parallel
### Run Docker without TTY (-t option)
### Map the assemblies folder in /data/ folder and the updated Abricate DB in Docker Abricate DB folder
### Redirect the Abricate output to a file named as the assembly but without the extension (using {/.})
### Note that the redirection should be done using filesystem paths outside the container
ls ~/genomes/streptococcus_agalactiae_example/all_assemblies/* | \
parallel --jobs 8 'docker run --rm -u $(id -u):$(id -g) -v ~/genomes/streptococcus_agalactiae_example/all_assemblies/:/data/ -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest abricate --db resfinder /data/{/} > /media/volume/typing/streptococcus_agalactiae_example/abricate/{/.}.abricate_out.tab'
# Summarize all Abricate results into a single file
## With Docker, when using * to refer multiple files, the paths obtained refer to the filesystem outside the container
## Therefore, a shell script file is created containing the command to be run inside the container
## Note that the command inside the shell script only contains paths refering to mapped folder /data/
echo 'abricate --summary /data/*.abricate_out.tab > /data/abricate_summary.resfinder.tab' > /media/volume/typing/streptococcus_agalactiae_example/abricate/abricate_commands.sh
## Run the summary Abricate command
docker run --rm -u $(id -u):$(id -g) -it -v /media/volume/typing/streptococcus_agalactiae_example/abricate/:/data/ -v /media/volume/DBs/abricate/:/NGStools/miniconda/db/ ummidock/abricate:latest \
sh /data/abricate_commands.sh
More informations on what TTY is here
To avoid VM space problems during the course, unused Docker images will be removed
In the VM
# List Docker images
docker images
# Remove Abricate image
# Find the Abricate image line starting with ummidock/abricate
# Get the Image ID, something like 1f467865b7f3
docker rmi <Abricate_Image_ID>