Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mob 3.0.1 new release #73

Merged
merged 5 commits into from
Oct 30, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 20 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
![](https://img.shields.io/conda/dn/bioconda/mob_suite)
![](https://img.shields.io/docker/pulls/kbessonov/mob_suite)
![](https://img.shields.io/pypi/dm/mob-suite)
![](https://img.shields.io/github/v/release/phac-nml/mob-suite?include_prereleases)
![](https://img.shields.io/github/last-commit/phac-nml/mob-suite)
![](https://img.shields.io/github/issues/phac-nml/mob-suite)

# MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies

## Introduction ##
## Introduction
Plasmids are mobile genetic elements (MGEs), which allow for rapid evolution and adaption of
bacteria to new niches through horizontal transmission of novel traits to different genetic
backgrounds. The MOB-suite is designed to be a modular set of tools for the typing and
reconstruction of plasmid sequences from WGS assemblies.


The MOB-suite depends on a series of databases which are too large to be hosted in git-hub. They can be downloaded or updated by running mob_init or if running any of the tools for the first time, the databases will download and initialize automatically if you do not specify an alternate database location. However, they are quite large so the first run will take a long time depending on your connection and speed of your computer.
Databases can be manually downloaded from https://share.corefacility.ca/index.php/s/rYaAH7oxrSVtilN/download or https://zenodo.org/record/3786915/files/data.tar.gz?download=1. <br>
Our new automatic chromosome depletion feature in MOB-recon can be based on any collection of closed chromosome sequences but we have a prebuilt database available here: https://share.corefacility.ca/index.php/s/GJOgxxtbhWoX8fV/download
Databases can be manually downloaded from [here](https://share.corefacility.ca/index.php/s/rYaAH7oxrSVtilN/download) or [here](https://zenodo.org/record/3786915/files/data.tar.gz?download=1). <br>
Our new automatic chromosome depletion feature in MOB-recon can be based on any collection of closed chromosome sequences but we have a prebuilt database available [here](https://share.corefacility.ca/index.php/s/GJOgxxtbhWoX8fV/download).

### MOB-init
On first run of MOB-typer or MOB-recon, MOB-init should run to download the databases from figshare, sketch the databases and setup the blast databases. However, it can be run manually if the databases need to be re-initialized OR if you want to initialize the databases in an alternative directory.
On first run of MOB-typer or MOB-recon, MOB-init (invoked by `mob_init` command) should run to download the databases from figshare, sketch the databases and setup the blast databases. However, it can be run manually if the databases need to be re-initialized OR if you want to initialize the databases in an alternative directory.

```
% mob_init
```

### MOB-cluster
This tool creates plasmid similarity groups using fast genomic distance estimation using Mash. Plasmids are grouped into clusters using complete-linkage clustering and the cluster code accessions provided by the tool provide an approximation of operational taxonomic units OTU’s. The plasmid nomenclature is designed to group highly similar plasmids together which are unlikely to have multiple representatives within a single cell and have a strong concordance with replicon and relaxase typing but is universally applicable since it uses the complete sequence of the plasmid itself rather than specific biomarkers.
Expand All @@ -30,8 +34,9 @@ Provides in silico predictions of the replicon family, relaxase type, mate-pair
## Installation ##

## Requires
+ Python v. 3.7 +
+ ete3 >= 3
+ Python >= 3.7
+ ete3 >= 3.1.2
+ pandas >=0.22.0,<=1.05
+ biopython >= 1.70
+ pytables >= 3.3
+ pycurl >= 7.43
Expand Down Expand Up @@ -62,21 +67,20 @@ We recommend installing MOB-Suite via bioconda but you can install it via pip us

```
% pip3 install mob_suite

```

### Docker image
A docker image is also available at [https://hub.docker.com/r/kbessonov/mob_suite](https://hub.docker.com/r/kbessonov/mob_suite)
```
% docker pull kbessonov/mob_suite:2.0.0
% docker run --rm -v $(pwd):/mnt/ "kbessonov/mob_suite:2.0.0 " mob_recon -i /mnt/assembly.fasta -t -o /mnt/mob_recon_output

```
% docker pull kbessonov/mob_suite:3.0.1
% docker run --rm -v $(pwd):/mnt/ "kbessonov/mob_suite:3.0.1" mob_recon -i /mnt/assembly.fasta -t -o /mnt/mob_recon_output
```

### Singularity image
A singularity image could be built via singularity recipe donated by Eric Deveaud.
The recipe (`recipe.singularity`) is located in the singularity folder of this repository.
The docker image section also has instructions on how to create singularity image from a docker image.
The docker image [README section](https://hub.docker.com/repository/docker/kbessonov/mob_suite) also has instructions on how to create singularity image from a docker image.

```bash
% singularity build mobsuite.simg recipe.singularity
Expand Down Expand Up @@ -104,7 +108,6 @@ You can perform plasmid typing using a fasta formated file containing a single p

# Multiple independant plasmids
% mob_typer --multi --infile assembly.fasta --out_file sample_mobtyper_results.txt

```

## Using MOB-recon to reconstruct plasmids from draft assemblies
Expand All @@ -120,12 +123,13 @@ As of v. 3.0.0, we have added the ability of users to provide their own specific

```
### User sequence mask
% mob_recon --infile assembly.fasta --outdir my_out_dir --
% mob_recon --infile assembly.fasta --outdir my_out_dir --filter_db filter.fasta
```

As of v. 3.0.0, we have provided the ability to use a collection of closed genomes which will be quickly checked using Mash for genomes which are genetically close and limit blast searches to those chromosomes. This more nuanced and automatic approach is recommended for users where there are sequences which should be filtered in one genomic context but not another. We provide as an optional download as set of closed Enterobacteriacea genomes from NCBI which can be used to provide added accuracy for some organisms such as E. coli and Klebsiella where there are sequences which switch between chromosome and plasmids.
<br><br>
If reconstructed plasmids exceed the Mash distance for primary cluster assignment, then they will get assigned a name in the format novel_{md5} where the md5 hash is calculated based on all of the sequences belonging to that reconstructed plasmid. This will provide a unique name for them but any change will result in a changed in the md5 hash. It is inadvised to use these groups for further analyses. Rather they should be highlighted as cases where targeted long read sequencing is required to obtain a closer database representitive of that plasmid.

```
### Autodetected close genome filter
% mob_recon --infile assembly.fasta --outdir my_out_dir -g 2019-11-NCBI-Enterobacteriacea-Chromosomes.fasta
Expand Down
2 changes: 1 addition & 1 deletion mob_suite/blast/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import os

import pandas as pd
from pandas.io.common import EmptyDataError
from pandas.errors import EmptyDataError



Expand Down
28 changes: 14 additions & 14 deletions mob_suite/conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{% set version = "3.0.0" %}
{% set version = "3.0.1" %}

package:
name: mob_suite
Expand All @@ -10,27 +10,27 @@ build:
script: python -m pip install --no-deps --ignore-installed .

source:
#path: /root/mob_suite/mob-suite
path: /root/mob_suite/mob-suite
#url: https://github.com/phac-nml/mob-suite/archive/{{ version }}.tar.gz
#sha256: 221dc24eb6d98b119c25cabff5110709cd345790d9836cf5865bec9262fddc3f
git_url: https://github.com/phac-nml/mob-suite.git
git_rev: master
#git_url: https://github.com/phac-nml/mob-suite.git
#git_rev: master

requirements:
host:
- python >=3.7
- pip
run:
- python >=3.7
- numpy >=1.11.1
- pytables >=3.3
- pandas >=0.22.0
- biopython >=1.70
- pycurl >=7.43
- scipy >=1.1
- ete3 >=3.0
- blast >= 2.9.0
- mash >= 2.2.2
- python >=3.7,<4
- numpy >=1.11.1,<2
- pytables >=3.3,<4
- pandas >=0.22.0,<=1.0.5
- biopython >=1.70,<2
- pycurl >=7.43,<8
- scipy >=1.1,<2
- ete3 >=3.0,<4
- blast >=2.9.0,<3
- mash >=2.2.2,<3


test:
Expand Down
149 changes: 100 additions & 49 deletions mob_suite/wrappers/mob_recon.xml
Original file line number Diff line number Diff line change
@@ -1,89 +1,140 @@
<tool id="mob_recon" name="MOB-Recon" version="1.4.8">
<tool id="mob_recon" name="MOB-Recon" version="3.0.0">
<description>Type contigs and extract plasmid sequences</description>
<requirements>
<requirement type="package" version="1.4.8">mob_suite</requirement>
</requirements>
<requirement type="package" version="3.0.0">mob_suite</requirement>
</requirements>
<version_command>mob_recon --version</version_command>
<command detect_errors="exit_code">
<![CDATA[
#import re
#import os.path

#set $named_input = re.sub(r'(\s|\(|\)|:|!)', '_', str($input.element_identifier)+".fasta")
ln -s "$input" $named_input &&
#set $named_input = re.sub(r'(\s|\(|\)|:|!)', '_', str($input.element_identifier)+'.fasta')
ln -s '$input' '$named_input' &&

mob_recon --num_threads \${GALAXY_SLOTS:-4} --infile "${named_input}"
#if str($adv_param.unicycler_contigs) == "True":

mob_recon --num_threads \${GALAXY_SLOTS:-4} --infile '${named_input}' --run_typer

#if $adv_param.unicycler_contigs:
--unicycler_contigs
#end if
#if str($adv_param.run_circlator) == "True":
--run_circlator
#end if
#if str($adv_param.min_length_condition.min_length_param) == "True":
--min_length ${adv_param.min_length_condition.min_length_value}

#if $adv_param.run_overhang:
--run_overhang
#end if
--run_typer --min_rep_evalue '${adv_param.min_rep_evalue}'

#if $adv_param.debug:
--debug
#end if

#if $adv_param.plasmid_db
--plasmid_db '$adv_param.plasmid_db'
#end if

#if $adv_param.plasmid_mash_db
--plasmid_mash_db '$adv_param.plasmid_mash_db'
#end if

#if $adv_param.plasmid_meta
--plasmid_meta '$adv_param.plasmid_meta'
#end if

#if $adv_param.repetitive_mask
--repetitive_mask '$adv_param.repetitive_mask'
#end if

#if $adv_param.plasmid_mob
--plasmid_mob '$adv_param.plasmid_mob'
#end if

#if $adv_param.plasmid_mpf
--plasmid_mpf '$adv_param.plasmid_mpf'
#end if

#if $adv_param.plasmid_orit
--plasmid_orit '$adv_param.plasmid_orit'
#end if

--min_length '${adv_param.min_length}'
--min_rep_evalue '${adv_param.min_rep_evalue}'
--min_rep_evalue '${adv_param.min_rep_evalue}'
--min_mob_evalue '${adv_param.min_mob_evalue}'
--min_con_evalue '${adv_param.min_con_evalue}'
--min_rep_ident '${adv_param.min_rep_ident}'
--min_mob_ident '${adv_param.min_mob_ident}'
--min_con_ident '${adv_param.min_con_ident}'
--min_rpp_ident '${adv_param.min_rpp_ident}'
--outdir '.' &&
mkdir ./sequences && (cp plasmid*.fasta chromosome.fasta ./sequences 2> /dev/null || true)

--min_rep_cov '${adv_param.min_rep_cov}'
--min_mob_cov '${adv_param.min_mob_cov}'
--min_con_cov '${adv_param.min_con_cov}'
--min_rpp_cov '${adv_param.min_rpp_cov}'
--outdir 'outdir' &&
mkdir ./outdir/plasmids && (mv outdir/plasmid*.fasta ./outdir/plasmids 2> /dev/null || true)
]]>
</command>
<inputs>
<param name="input" type="data" format="fasta" label="Input" help="FASTA file with contig(s)"/>
<section name="adv_param" title="Advanced parameters" expanded="False">
<param name="unicycler_contigs" label="Check for circularity flag generated by unicycler in contigs fasta headers" type="select" value="True">
<option value="True">Yes</option>
<option value="False">No</option>
</param>
<param name="run_circlator" label="Run circlator minums2 pipeline to check for circular contigs" type="select" value="True">
<option value="True">Yes</option>
<option value="False">No</option>
</param>
<conditional name="min_length_condition">
<param name="min_length_param" label="Minimum length of contigs to process" type="select" value="False">
<option value="False">No</option>
<option value="True">Yes</option>
</param>
<when value="True">
<param name="min_length_value" type="integer" value="500" min="50"/>
</when>
<when value="False"/>
</conditional>
<param name="unicycler_contigs" type="boolean" truevalue="true" falsevalue="" checked="true" label="Check for circularity flag generated by unicycler in contigs fasta headers?"/>
<param name="run_overhang" type="boolean" truevalue="true" falsevalue="" checked="true" label="Detect circular contigs (i.e. potential plasmids) with assembly overhangs?"/>
<param name="debug" type="boolean" truevalue="true" falsevalue="" checked="false" label="Provide debug information?"/>

<param name="min_rep_evalue" label="Minimum evalue threshold for replicon blastn" type="float" min="0.00001" max="1" value="0.00001"/>
<param name="min_mob_evalue" label="Minimum evalue threshold for relaxase tblastn" type="float" min="0.00001" max="1" value="0.00001"/>
<param name="min_con_evalue" label="Minimum evalue threshold for contig blastn" type="float" min="0.00001" max="1" value="0.00001"/>
<param name="min_rpp_evalue" label="Minimum evalue threshold for repetitve elements blastn" type="float" min="0.00001" max="1" value="0.00001"/>
<param name="min_length" label="Minimum length of contigs to classify" type="integer" value="1000"/>
<param name="min_rep_ident" label="Minimum sequence identity for replicons" type="integer" min="0" max="100" value="80"/>
<param name="min_mob_ident" label="Minimum sequence identity for relaxases" type="integer" min="0" max="100" value="80"/>
<param name="min_con_ident" label="Minimum sequence identity for contigs" type="integer" min="0" max="100" value="80"/>
<param name="min_rpp_ident" label="Minimum sequence identity for repetitive elements" type="integer" min="0" max="100" value="80"/>

<param name="min_rep_cov" label="Minimum percentage coverage of replicon query by input assembly" type="integer" min="0" max="100" value="80"/>
<param name="min_mob_cov" label="Minimum percentage coverage of relaxase query by input assembly" type="integer" min="0" max="100" value="80"/>
<param name="min_con_cov" label="Minimum percentage coverage of assembly contig by the plasmid reference database to be considered" type="integer" min="0" max="100" value="60"/>
<param name="min_rpp_cov" label="Minimum percentage coverage of contigs by repetitive elements" type="integer" min="0" max="100" value="80"/>

<param name="plasmid_db" optional="true" type="data" format="fasta" label="Reference Database of complete plasmids" help=""/>
<param name="plasmid_mash_db" optional="true" type="data" format="binary" label="Custom MASH database of plasmids" help="MASH sketch of the reference plasmids database"/>
<param name="plasmid_meta" type="data" optional="true" format="text" label="Plasmid cluster metadata file" help=""/>
<param name="plasmid_replicons" type="data" optional="true" format="fasta" label="FASTA file with plasmid replicons" help=""/>
<param name="repetitive_mask" type="data" optional="true" format="fasta" label="FASTA of known repetitive elements" help=""/>
<param name="plasmid_mob" type="data" optional="true" format="fasta" label="FASTA of plasmid relaxases" help=""/>
<param name="plasmid_mpf" type="data" optional="true" format="fasta" label="FASTA of known plasmid mate-pair proteins" help=""/>
<param name="plasmid_orit" type="data" optional="true" format="fasta" label="FASTA of known plasmid oriT dna sequences" help=""/>
</section>
</inputs>
<outputs>
<data name="outfile1" format="tabular" from_work_dir="contig_report.txt" label="${tool.name} on ${on_string}: Overall contig MOB-recon report"/>
<data name="outfile2" format="tabular" from_work_dir="repetitive_blast_report.txt" label="${tool.name} on ${on_string}: Repetitive elements BLAST report"/>
<data name="outfile3" format="tabular" from_work_dir="mobtyper_aggregate_report.txt" label="${tool.name} on ${on_string}: Aggregate MOB-typer report for all contigs"/>
<collection name="seqhits" type="list" label="${tool.name} on ${on_string}: Extracted sequences (plasmids,chromosome(s))">
<discover_datasets pattern="__name_and_ext__" directory="sequences" />
<data name="contig_report" format="tabular" from_work_dir="outdir/contig_report.txt" label="${tool.name} on ${input.element_identifier}: Overall contig MOB-recon report"/>
<data name="mobtyper_aggregate_report" format="tabular" from_work_dir="outdir/mobtyper_results.txt" label="${tool.name} on ${input.element_identifier}: Aggregate MOB-typer report for all contigs"/>
<data name="chromosome" format="fasta" from_work_dir="outdir/chromosome.fasta" label="${tool.name} on ${input.element_identifier}: Chromosomal sequences"/>
<collection name="plasmids" type="list" label="${tool.name} on ${input.element_identifier}: Plasmids">
<discover_datasets pattern="__name_and_ext__" directory="outdir/plasmids" />
</collection>
</outputs>
<tests>
<test>
<param name="input" value="plasmid_476.fasta" ftype="fasta"/>
<section name="adv_param">
<param name="unicycler_contigs" value="True"/>
<param name="run_circlator" value="True"/>
</section>
<output name="outfile1">
<assert_contents>
<has_text text="NC_019097"/>
</assert_contents>
</output>
<param name="input" value="Ecoli_strain_KV7_complete_LT795502.fasta" ftype="fasta"/>
<section name="adv_param">
<param name="unicycler_contigs" value="True"/>
<param name="run_overhang" value="True"/>
</section>
<output name="contig_report">
<assert_contents>
<has_text text="chromosome"/>
<has_text text="plasmid"/>
<has_text text="IncHI1A"/>
<has_text text="IncN"/>
</assert_contents>
</output>
<output name="mobtyper_aggregate_report">
<assert_contents>
<has_text text="conjugative"/>
<has_text text="Gammaproteobacteria"/>
<has_text text="223020"/>
</assert_contents>
</output>
</test>
</tests>
<help>
Expand All @@ -96,7 +147,7 @@ For more information please visit https://github.com/phac-nml/mob-suite/.

**Workflow**

This preliminary \"Mobilome and Resistome Analysis Workflow\" linking mob_recon with staramr provides reports on mobilome and resistome for a given isolate given a draft genome assembly. The workflow is located in Shared Data --> Workflows --> Mobilome and Resistome Analysis Workflow (MOB-Recon and STARAMR). The workflow file can also be mamanually downloaded from https://raw.githubusercontent.com/phac-nml/galaxy_tools/master/tools/mob_suite/workflows/AMRworkflow_STARAMR.ga.
This preliminary \"Mobilome and Resistome Analysis Workflow\" linking mob_recon with staramr provides reports on mobilome and resistome for a given isolate given a draft genome assembly. The workflow is located in Shared Data --> Workflows --> Mobilome and Resistome Analysis Workflow (MOB-Recon and STARAMR). The workflow file can also be manually downloaded from https://raw.githubusercontent.com/phac-nml/galaxy_tools/master/tools/mob_suite/workflows/AMRworkflow_STARAMR.ga.

-----

Expand Down
Loading