phac-nml · kbessonov1984 · Oct 30, 2020 · Aug 4, 2020 · Oct 30, 2020 · Oct 30, 2020
diff --git a/README.md b/README.md
@@ -1,22 +1,26 @@
+![](https://img.shields.io/conda/dn/bioconda/mob_suite)
+![](https://img.shields.io/docker/pulls/kbessonov/mob_suite)
+![](https://img.shields.io/pypi/dm/mob-suite)
+![](https://img.shields.io/github/v/release/phac-nml/mob-suite?include_prereleases)
+![](https://img.shields.io/github/last-commit/phac-nml/mob-suite)
+![](https://img.shields.io/github/issues/phac-nml/mob-suite)
+
 # MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
 
-## Introduction ## 
+## Introduction
 Plasmids are mobile genetic elements (MGEs), which allow for rapid evolution and adaption of
 bacteria to new niches through horizontal transmission of novel traits to different genetic
 backgrounds. The MOB-suite is designed to be a modular set of tools for the typing and
 reconstruction of plasmid sequences from WGS assemblies.
 
 
 The MOB-suite depends on a series of databases which are too large to be hosted in git-hub. They can be downloaded or updated by running mob_init or if running any of the tools for the first time, the databases will download and initialize automatically if you do not specify an alternate database location. However, they are quite large so the first run will take a long time depending on your connection and speed of your computer.
-Databases can be manually downloaded from https://share.corefacility.ca/index.php/s/rYaAH7oxrSVtilN/download or https://zenodo.org/record/3786915/files/data.tar.gz?download=1. <br>
-Our new automatic chromosome depletion feature in MOB-recon can be based on any collection of closed chromosome sequences but we have a prebuilt database available here: https://share.corefacility.ca/index.php/s/GJOgxxtbhWoX8fV/download
+Databases can be manually downloaded from [here](https://share.corefacility.ca/index.php/s/rYaAH7oxrSVtilN/download) or [here](https://zenodo.org/record/3786915/files/data.tar.gz?download=1). <br>
+Our new automatic chromosome depletion feature in MOB-recon can be based on any collection of closed chromosome sequences but we have a prebuilt database available [here](https://share.corefacility.ca/index.php/s/GJOgxxtbhWoX8fV/download).
 
 ### MOB-init
-On first run of MOB-typer or MOB-recon, MOB-init should run to download the databases from figshare, sketch the databases and setup the blast databases. However, it can be run manually if the databases need to be re-initialized OR if you want to initialize the databases in an alternative directory.
+On first run of MOB-typer or MOB-recon, MOB-init (invoked by `mob_init` command) should run to download the databases from figshare, sketch the databases and setup the blast databases. However, it can be run manually if the databases need to be re-initialized OR if you want to initialize the databases in an alternative directory.
 
-```
-% mob_init
-```
 
 ### MOB-cluster
 This tool creates plasmid similarity groups using fast genomic distance estimation using Mash. Plasmids are grouped into clusters using complete-linkage clustering and the cluster code accessions provided by the tool provide an approximation of operational taxonomic units OTU’s. The plasmid nomenclature is designed to group highly similar plasmids together which are unlikely to have multiple representatives within a single cell and have a strong concordance with replicon and relaxase typing but is universally applicable since it uses the complete sequence of the plasmid itself rather than specific biomarkers.
@@ -30,8 +34,9 @@ Provides in silico predictions of the replicon family, relaxase type, mate-pair
 ## Installation ##
 
 ## Requires
-+ Python v. 3.7 +
-+ ete3 >= 3
++ Python >= 3.7
++ ete3 >= 3.1.2
++ pandas >=0.22.0,<=1.05
 + biopython >= 1.70
 + pytables >= 3.3
 + pycurl >= 7.43
@@ -62,21 +67,20 @@ We recommend installing MOB-Suite via bioconda but you can install it via pip us
 
 ```
 % pip3 install mob_suite
-
 ```
 
 ### Docker image
 A docker image is also available at [https://hub.docker.com/r/kbessonov/mob_suite](https://hub.docker.com/r/kbessonov/mob_suite)
-```
-% docker pull kbessonov/mob_suite:2.0.0 
-% docker run --rm -v $(pwd):/mnt/ "kbessonov/mob_suite:2.0.0 " mob_recon -i /mnt/assembly.fasta -t -o /mnt/mob_recon_output
 
+```
+% docker pull kbessonov/mob_suite:3.0.1 
+% docker run --rm -v $(pwd):/mnt/ "kbessonov/mob_suite:3.0.1" mob_recon -i /mnt/assembly.fasta -t -o /mnt/mob_recon_output
 ```
 
 ### Singularity image
 A singularity image could be built via singularity recipe donated by Eric Deveaud. 
 The recipe (`recipe.singularity`) is located in the singularity folder of this repository. 
-The docker image section also has instructions on how to create singularity image from a docker image.
+The docker image [README section](https://hub.docker.com/repository/docker/kbessonov/mob_suite) also has instructions on how to create singularity image from a docker image.
 
 ```bash
 % singularity build mobsuite.simg recipe.singularity
@@ -104,7 +108,6 @@ You can perform plasmid typing using a fasta formated file containing a single p
 
 # Multiple independant plasmids
 % mob_typer --multi --infile assembly.fasta --out_file sample_mobtyper_results.txt
-
 ```
 
 ## Using MOB-recon to reconstruct plasmids from draft assemblies
@@ -120,12 +123,13 @@ As of v. 3.0.0, we have added the ability of users to provide their own specific
 
 ```
 ### User sequence mask
-% mob_recon --infile assembly.fasta --outdir my_out_dir --
+% mob_recon --infile assembly.fasta --outdir my_out_dir --filter_db filter.fasta
 ```
 
 As of v. 3.0.0, we have provided the ability to use a collection of closed genomes which will be quickly checked using Mash for genomes which are genetically close and limit blast searches to those chromosomes. This more nuanced and automatic approach is recommended for users where there are sequences which should be filtered in one genomic context but not another. We provide as an optional download as set of closed Enterobacteriacea genomes from NCBI which can be used to provide added accuracy for some organisms such as E. coli and Klebsiella where there are sequences which switch between chromosome and plasmids.
 <br><br>
 If reconstructed plasmids exceed the Mash distance for primary cluster assignment, then they will get assigned a name in the format novel_{md5} where the md5 hash is calculated based on all of the sequences belonging to that reconstructed plasmid. This will provide a unique name for them but any change will result in a changed in the md5 hash. It is inadvised to use these groups for further analyses. Rather they should be highlighted as cases where targeted long read sequencing is required to obtain a closer database representitive of that plasmid.
+
 ```
 ### Autodetected close genome filter
 % mob_recon --infile assembly.fasta --outdir my_out_dir -g 2019-11-NCBI-Enterobacteriacea-Chromosomes.fasta

diff --git a/mob_suite/blast/__init__.py b/mob_suite/blast/__init__.py
@@ -6,7 +6,7 @@
 import os
 
 import pandas as pd
-from pandas.io.common import EmptyDataError
+from pandas.errors import EmptyDataError
 
 
 

diff --git a/mob_suite/conda/meta.yaml b/mob_suite/conda/meta.yaml
@@ -1,4 +1,4 @@
-{% set version = "3.0.0" %}
+{% set version = "3.0.1" %}
 
 package:
  name: mob_suite
@@ -10,27 +10,27 @@ build:
  script: python -m pip install --no-deps --ignore-installed .
 
 source:
- #path: /root/mob_suite/mob-suite
+ path: /root/mob_suite/mob-suite
  #url: https://github.com/phac-nml/mob-suite/archive/{{ version }}.tar.gz
  #sha256: 221dc24eb6d98b119c25cabff5110709cd345790d9836cf5865bec9262fddc3f
- git_url: https://github.com/phac-nml/mob-suite.git
- git_rev: master
+ #git_url: https://github.com/phac-nml/mob-suite.git
+ #git_rev: master
 
 requirements:
  host:
  - python >=3.7
  - pip
  run:
- - python >=3.7
- - numpy >=1.11.1
- - pytables >=3.3
- - pandas >=0.22.0
- - biopython >=1.70
- - pycurl >=7.43
- - scipy >=1.1
- - ete3 >=3.0
- - blast >= 2.9.0
- - mash >= 2.2.2
+ - python >=3.7,<4
+ - numpy >=1.11.1,<2
+ - pytables >=3.3,<4
+ - pandas >=0.22.0,<=1.0.5
+ - biopython >=1.70,<2
+ - pycurl >=7.43,<8
+ - scipy >=1.1,<2
+ - ete3 >=3.0,<4
+ - blast >=2.9.0,<3
+ - mash >=2.2.2,<3
 
 
 test:

diff --git a/mob_suite/wrappers/mob_recon.xml b/mob_suite/wrappers/mob_recon.xml
@@ -1,89 +1,140 @@
-<tool id="mob_recon" name="MOB-Recon" version="1.4.8">
+<tool id="mob_recon" name="MOB-Recon" version="3.0.0">
  <description>Type contigs and extract plasmid sequences</description>
  <requirements>
- <requirement type="package" version="1.4.8">mob_suite</requirement>
- </requirements> 
+ <requirement type="package" version="3.0.0">mob_suite</requirement>
+ </requirements>
+ <version_command>mob_recon --version</version_command>
  <command detect_errors="exit_code">
  <![CDATA[ 
  #import re
  #import os.path
 
- #set $named_input = re.sub(r'(\s|\(|\)|:|!)', '_', str($input.element_identifier)+".fasta")
- ln -s "$input" $named_input && 
+ #set $named_input = re.sub(r'(\s|\(|\)|:|!)', '_', str($input.element_identifier)+'.fasta')
+ ln -s '$input' '$named_input' &&
 
- mob_recon --num_threads \${GALAXY_SLOTS:-4} --infile "${named_input}" 
- #if str($adv_param.unicycler_contigs) == "True":
+
+ mob_recon --num_threads \${GALAXY_SLOTS:-4} --infile '${named_input}' --run_typer
+
+ #if $adv_param.unicycler_contigs:
  --unicycler_contigs 
  #end if 
- #if str($adv_param.run_circlator) == "True":
- --run_circlator 
- #end if 
- #if str($adv_param.min_length_condition.min_length_param) == "True":
- --min_length ${adv_param.min_length_condition.min_length_value}
+
+ #if $adv_param.run_overhang:
+ --run_overhang
  #end if 
- --run_typer --min_rep_evalue '${adv_param.min_rep_evalue}'
+
+ #if $adv_param.debug:
+ --debug
+ #end if
+
+ #if $adv_param.plasmid_db
+ --plasmid_db '$adv_param.plasmid_db'
+ #end if
+
+ #if $adv_param.plasmid_mash_db
+ --plasmid_mash_db '$adv_param.plasmid_mash_db'
+ #end if
+
+ #if $adv_param.plasmid_meta
+ --plasmid_meta '$adv_param.plasmid_meta'
+ #end if
+
+ #if $adv_param.repetitive_mask
+ --repetitive_mask '$adv_param.repetitive_mask'
+ #end if
+
+ #if $adv_param.plasmid_mob
+ --plasmid_mob '$adv_param.plasmid_mob'
+ #end if
+
+ #if $adv_param.plasmid_mpf
+ --plasmid_mpf '$adv_param.plasmid_mpf'
+ #end if
+
+ #if $adv_param.plasmid_orit
+ --plasmid_orit '$adv_param.plasmid_orit'
+ #end if
+
+ --min_length '${adv_param.min_length}' 
+ --min_rep_evalue '${adv_param.min_rep_evalue}'
  --min_rep_evalue '${adv_param.min_rep_evalue}'
  --min_mob_evalue '${adv_param.min_mob_evalue}'
  --min_con_evalue '${adv_param.min_con_evalue}'
  --min_rep_ident '${adv_param.min_rep_ident}'
  --min_mob_ident '${adv_param.min_mob_ident}'
  --min_con_ident '${adv_param.min_con_ident}'
  --min_rpp_ident '${adv_param.min_rpp_ident}'
- --outdir '.' && 
- mkdir ./sequences && (cp plasmid*.fasta chromosome.fasta ./sequences 2> /dev/null || true)
+
+ --min_rep_cov '${adv_param.min_rep_cov}'
+ --min_mob_cov '${adv_param.min_mob_cov}'
+ --min_con_cov '${adv_param.min_con_cov}'
+ --min_rpp_cov '${adv_param.min_rpp_cov}'
+ --outdir 'outdir' &&
+ mkdir ./outdir/plasmids && (mv outdir/plasmid*.fasta ./outdir/plasmids 2> /dev/null || true)
  ]]> 
  </command>
  <inputs>
  <param name="input" type="data" format="fasta" label="Input" help="FASTA file with contig(s)"/>
  <section name="adv_param" title="Advanced parameters" expanded="False">
- <param name="unicycler_contigs" label="Check for circularity flag generated by unicycler in contigs fasta headers" type="select" value="True">
- <option value="True">Yes</option>
- <option value="False">No</option>
- </param>
- <param name="run_circlator" label="Run circlator minums2 pipeline to check for circular contigs" type="select" value="True">
- <option value="True">Yes</option>
- <option value="False">No</option>
- </param>
- <conditional name="min_length_condition">
- <param name="min_length_param" label="Minimum length of contigs to process" type="select" value="False">
- <option value="False">No</option>
- <option value="True">Yes</option>
- </param> 
- <when value="True">
- <param name="min_length_value" type="integer" value="500" min="50"/> 
- </when> 
- <when value="False"/>
- </conditional> 
+ <param name="unicycler_contigs" type="boolean" truevalue="true" falsevalue="" checked="true" label="Check for circularity flag generated by unicycler in contigs fasta headers?"/>
+ <param name="run_overhang" type="boolean" truevalue="true" falsevalue="" checked="true" label="Detect circular contigs (i.e. potential plasmids) with assembly overhangs?"/> 
+ <param name="debug" type="boolean" truevalue="true" falsevalue="" checked="false" label="Provide debug information?"/>
+
  <param name="min_rep_evalue" label="Minimum evalue threshold for replicon blastn" type="float" min="0.00001" max="1" value="0.00001"/>
  <param name="min_mob_evalue" label="Minimum evalue threshold for relaxase tblastn" type="float" min="0.00001" max="1" value="0.00001"/>
  <param name="min_con_evalue" label="Minimum evalue threshold for contig blastn" type="float" min="0.00001" max="1" value="0.00001"/>
  <param name="min_rpp_evalue" label="Minimum evalue threshold for repetitve elements blastn" type="float" min="0.00001" max="1" value="0.00001"/>
+ <param name="min_length" label="Minimum length of contigs to classify" type="integer" value="1000"/>
  <param name="min_rep_ident" label="Minimum sequence identity for replicons" type="integer" min="0" max="100" value="80"/>
  <param name="min_mob_ident" label="Minimum sequence identity for relaxases" type="integer" min="0" max="100" value="80"/>
  <param name="min_con_ident" label="Minimum sequence identity for contigs" type="integer" min="0" max="100" value="80"/>
  <param name="min_rpp_ident" label="Minimum sequence identity for repetitive elements" type="integer" min="0" max="100" value="80"/>
+
+ <param name="min_rep_cov" label="Minimum percentage coverage of replicon query by input assembly" type="integer" min="0" max="100" value="80"/>
+ <param name="min_mob_cov" label="Minimum percentage coverage of relaxase query by input assembly" type="integer" min="0" max="100" value="80"/>
+ <param name="min_con_cov" label="Minimum percentage coverage of assembly contig by the plasmid reference database to be considered" type="integer" min="0" max="100" value="60"/>
+ <param name="min_rpp_cov" label="Minimum percentage coverage of contigs by repetitive elements" type="integer" min="0" max="100" value="80"/>
+
+ <param name="plasmid_db" optional="true" type="data" format="fasta" label="Reference Database of complete plasmids" help=""/>
+ <param name="plasmid_mash_db" optional="true" type="data" format="binary" label="Custom MASH database of plasmids" help="MASH sketch of the reference plasmids database"/>
+ <param name="plasmid_meta" type="data" optional="true" format="text" label="Plasmid cluster metadata file" help=""/>
+ <param name="plasmid_replicons" type="data" optional="true" format="fasta" label="FASTA file with plasmid replicons" help=""/>
+ <param name="repetitive_mask" type="data" optional="true" format="fasta" label="FASTA of known repetitive elements" help=""/>
+ <param name="plasmid_mob" type="data" optional="true" format="fasta" label="FASTA of plasmid relaxases" help=""/>
+ <param name="plasmid_mpf" type="data" optional="true" format="fasta" label="FASTA of known plasmid mate-pair proteins" help=""/>
+ <param name="plasmid_orit" type="data" optional="true" format="fasta" label="FASTA of known plasmid oriT dna sequences" help=""/>
  </section> 
  </inputs>
  <outputs>
- <data name="outfile1" format="tabular" from_work_dir="contig_report.txt" label="${tool.name} on ${on_string}: Overall contig MOB-recon report"/> 
- <data name="outfile2" format="tabular" from_work_dir="repetitive_blast_report.txt" label="${tool.name} on ${on_string}: Repetitive elements BLAST report"/>
- <data name="outfile3" format="tabular" from_work_dir="mobtyper_aggregate_report.txt" label="${tool.name} on ${on_string}: Aggregate MOB-typer report for all contigs"/>
- <collection name="seqhits" type="list" label="${tool.name} on ${on_string}: Extracted sequences (plasmids,chromosome(s))">
-  <discover_datasets pattern="__name_and_ext__" directory="sequences" />
+ <data name="contig_report" format="tabular" from_work_dir="outdir/contig_report.txt" label="${tool.name} on ${input.element_identifier}: Overall contig MOB-recon report"/> 
+ <data name="mobtyper_aggregate_report" format="tabular" from_work_dir="outdir/mobtyper_results.txt" label="${tool.name} on ${input.element_identifier}: Aggregate MOB-typer report for all contigs"/>
+ <data name="chromosome" format="fasta" from_work_dir="outdir/chromosome.fasta" label="${tool.name} on ${input.element_identifier}: Chromosomal sequences"/>
+ <collection name="plasmids" type="list" label="${tool.name} on ${input.element_identifier}: Plasmids">
+ <discover_datasets pattern="__name_and_ext__" directory="outdir/plasmids" />
  </collection>
  </outputs>
  <tests>
  <test>
- <param name="input" value="plasmid_476.fasta" ftype="fasta"/>
- <section name="adv_param">
- <param name="unicycler_contigs" value="True"/>
- <param name="run_circlator" value="True"/>
- </section>
- <output name="outfile1">
- <assert_contents>
- <has_text text="NC_019097"/>
- </assert_contents> 
- </output> 
+ <param name="input" value="Ecoli_strain_KV7_complete_LT795502.fasta" ftype="fasta"/>
+ <section name="adv_param">
+ <param name="unicycler_contigs" value="True"/>
+ <param name="run_overhang" value="True"/>
+ </section>
+ <output name="contig_report">
+ <assert_contents>
+ <has_text text="chromosome"/>
+ <has_text text="plasmid"/>
+ <has_text text="IncHI1A"/>
+ <has_text text="IncN"/>
+ </assert_contents>
+ </output>
+ <output name="mobtyper_aggregate_report">
+ <assert_contents>
+ <has_text text="conjugative"/>
+ <has_text text="Gammaproteobacteria"/>
+ <has_text text="223020"/>
+ </assert_contents>
+ </output>
  </test>
  </tests>
  <help>
@@ -96,7 +147,7 @@ For more information please visit https://github.com/phac-nml/mob-suite/.
 
 **Workflow**
 
-This preliminary \"Mobilome and Resistome Analysis Workflow\" linking mob_recon with staramr provides reports on mobilome and resistome for a given isolate given a draft genome assembly. The workflow is located in Shared Data --> Workflows --> Mobilome and Resistome Analysis Workflow (MOB-Recon and STARAMR). The workflow file can also be mamanually downloaded from https://raw.githubusercontent.com/phac-nml/galaxy_tools/master/tools/mob_suite/workflows/AMRworkflow_STARAMR.ga.
+This preliminary \"Mobilome and Resistome Analysis Workflow\" linking mob_recon with staramr provides reports on mobilome and resistome for a given isolate given a draft genome assembly. The workflow is located in Shared Data --> Workflows --> Mobilome and Resistome Analysis Workflow (MOB-Recon and STARAMR). The workflow file can also be manually downloaded from https://raw.githubusercontent.com/phac-nml/galaxy_tools/master/tools/mob_suite/workflows/AMRworkflow_STARAMR.ga.
 
 -----