This module is designed to function as both a standalone MAG binning pipeline as well as a component of the larger CAP2/CAMP metagenome analysis pipeline. As such, it is both self-contained (ex. instructions included for the setup of a versioned environment, etc.), and seamlessly compatible with other CAMP modules (ex. ingests and spawns standardized input/output config files, etc.).
As far as the binning procedure goes, the design philosophy is just to replicate the functionality of MetaWRAP (one of the original ensemble binning methods) with i) better dependency conflict management and ii) improved integration with new binning algorithms.
Currently, the binning algorithms MetaBAT2, CONCOCT, VAMB, and MaxBin2 are wrapped along with the bin refinement tool DAS Tool.
- Clone repo from Github.
- Set up the conda environment (contains, Snakemake) using
configs/conda/binning.yaml
. - There are some compatibility issues that I haven't ironed out (bowtie2 due to RedHat's geriatric dependencies), so you may have to substitute in your own version.
- Set up the conda environment (contains, Snakemake) using
3. The conda version of MaxBin2 doesn't seem to work, so the best way to add it to the module is to install it separately.
cd bin/ wget https://sourceforge.net/projects/maxbin2/files/latest/download tar -xf download spack load gcc@6.3.0 # This is only necessary for HPCs with extremely old gcc's cd MaxBin-2.2.7/src make ./autobuild_auxiliary wget https://github.com/loneknightpy/idba/releases/download/1.1.3/idba-1.1.3.tar.gz tar -xf idba-1.1.3.tar.gz cd idba-1.1.3/ ./configure --prefix=/home/lam4003/bin/MaxBin-2.2.7/auxiliary/idba-1.1.3 # IDBA-UD was not included in the auxiliary build make # Optional: Export or add the following to ~/.bashrc export PATH=$PATH:/path/to/bin/MaxBin-2.2.7:/path/to/bin/MaxBin-2.2.7/auxiliary/FragGeneScan_1.30:/path/to/bin/MaxBin-2.2.7/auxiliary/hmmer-3.1b1/src:/path/to/bin/MaxBin-2.2.7/auxiliary/bowtie2-2.2.3:/path/to/bin/MaxBin-2.2.7/auxiliary/idba-1.1.3/bin
- Update the locations of the test datasets in
samples.csv
, and the relevant parameters inconfigs/parameters.yaml
. - Make sure the installed pipeline works correctly.
- Note: VAMB generates >600MB in bin FastAs, which can be deleted immediately after running the test if storage space is an issue.' It is also not included in the sample output directory
test_data/test_out/
for this reason.
- Note: VAMB generates >600MB in bin FastAs, which can be deleted immediately after running the test if storage space is an issue.' It is also not included in the sample output directory
- ::
- # Create and activate conda environment cd camp_binning conda env create -f configs/conda/binning.yaml conda activate binning # Run tests on the included sample dataset python /path/to/camp_binning/workflow/binning.py test
Input: /path/to/samples.csv
provided by the user.
Output: 1) An output config file summarizing the locations of 2) the MAGs generated by MetaBAT2, CONCOCT, and VAMB. See test_data/test_out.tar.gz
for a sample output work directory.
/path/to/work/dir/binning/final_reports/samples.csv
for ingestion by the next module (ex. quality-checking)/path/to/work/dir/binning/*/sample_name/
, where*
is either1_metabat2
or2_concoct
, the directories containing FastAs (*.fa
) of MAGs inferred by MetaBAT2 and CONCOCT respectively
Structure:
└── workflow ├── Snakefile ├── binning.py ├── utils.py └── __init__.py
workflow/binning.py
: Click-based CLI that wraps thesnakemake
and other commands for clean management of parameters, resources, and environment variables.workflow/Snakefile
: Thesnakemake
pipeline.workflow/utils.py
: Sample ingestion and work directory setup functions, and other utility functions used in the pipeline and the CLI.
- Make your own
samples.csv
based on the template inconfigs/samples.csv
. ingest_samples
inworkflow/utils.py
expects Illumina reads in FastQ (may be gzipped) form and de novo assembled contigs in FastA formsamples.csv
requires either absolute paths or symlinks relative to the directory that the module is being run in
- Make your own
- Update the relevant
metabat2
,concoct
,vamb
, andmaxbin2
parameters inconfigs/parameters.yaml
. - Update the computational resources available to the pipeline in
resources.yaml
. - To run CAMP on the command line, use the following, where
/path/to/work/dir
is replaced with the absolute path of your chosen working directory, and/path/to/samples.csv
is replaced with your copy ofsamples.csv
. - The default number of cores available to Snakemake is 1 which is enough for test data, but should probably be adjusted to 10+ for a real dataset.
- Relative or absolute paths to the Snakefile and/or the working directory (if you're running elsewhere) are accepted!
- To run CAMP on the command line, use the following, where
- ::
- python /path/to/camp_binning/workflow/binning.py
- (-c max_number_of_local_cpu_cores) -d /path/to/work/dir -s /path/to/samples.csv
- Note: This setup allows the main Snakefile to live outside of the work directory.
- To run CAMP on a job submission cluster (for now, only Slurm is supported), use the following.
--slurm
is an optional flag that submits all rules in the Snakemake pipeline assbatch
jobs.- In Slurm mode, the
-c
flag refers to the maximum number ofsbatch
jobs submitted in parallel, not the pool of cores available to run the jobs. Each job will request the number of cores specified by threads inconfigs/resources/slurm.yaml
.
sbatch -J jobname -o jobname.log << "EOF" #!/bin/bash python /path/to/camp_binning/workflow/binning.py --slurm \ (-c max_number_of_parallel_jobs_submitted) \ -d /path/to/work/dir \ -s /path/to/samples.csv EOF
6. After checking over final_reports/
and making sure you have everything you need, you can delete all intermediate files to save space.
python /path/to/camp_binning/workflow/binning.py cleanup \ -d /path/to/work/dir \ -s /path/to/samples.csv
7. If for some reason the module keeps failing, CAMP can print a script containing all of the remaining commands that can be run manually.
python /path/to/camp_binning/workflow/binning.py --dry_run \ -d /path/to/work/dir \ -s /path/to/samples.csv > cmds.txt python /path/to/camp_binning/workflow/binning.py commands cmds.txt
- What if you've customized some components of the module, but you still want to update the rest of the module with latest version of the standard CAMP? Just do the following from within the module's home directory:
- The flag with the setting
-X ours
forces conflicting hunks to be auto-resolved cleanly by favoring the local (i.e.: your) version.
- The flag with the setting
- ::
- cd /path/to/camp_binning git pull -X ours
We love to see it! This module was partially envisioned as a dependable, prepackaged sandbox for developers to test their shiny new tools in.
These instructions are meant for developers who have made a tool and want to integrate or demo its functionality as part of the standard binning workflow, or developers who want to integrate an existing tool.
- Write a module rule that wraps your tool and integrates its input and output into the pipeline.
- This is a great Snakemake tutorial for writing basic Snakemake rules.
- If you're adding new tools from an existing YAML, use
conda env update --file configs/conda/existing.yaml --prune
. - If you're using external scripts and resource files that i) cannot easily be integrated into either utils.py or parameters.yaml, and ii) are not as large as databases that would justify an externally stored download, add them to
workflow/ext/
orworkflow/ext/scripts/
and userule external_rule
as a template to wrap them.
- Update the
make_config
inworkflow/Snakefile
rule to check for your tool's output files. Updatesamples.csv
to document its output if downstream modules/tools are meant to ingest it. - If you plan to integrate multiple tools into the module that serve the same purpose but with different input or output requirements (ex. for alignment, Minimap2 for Nanopore reads vs. Bowtie2 for Illumina reads), you can toggle between these different 'streams' by setting the final files expected by
make_config
using the example functionworkflow_mode
. - Update the description of the
samples.csv
input fields in the CLI scriptworkflow/binning.py
.
- If you plan to integrate multiple tools into the module that serve the same purpose but with different input or output requirements (ex. for alignment, Minimap2 for Nanopore reads vs. Bowtie2 for Illumina reads), you can toggle between these different 'streams' by setting the final files expected by
- Update the
- If applicable, update the default conda config using
conda env export > config/conda/binning.yaml
with your tool and its dependencies. - If there are dependency conflicts, make a new conda YAML under
configs/conda
and specify its usage in specific rules using theconda
option (seefirst_rule
for an example).
- If there are dependency conflicts, make a new conda YAML under
- If applicable, update the default conda config using
- Add your tool's installation and running instructions to the module documentation and (if applicable) add the repo to your Read the Docs account + turn on the Read the Docs service hook.
- Run the pipeline once through to make sure everything works using the test data in
test_data/
if appropriate, or your own appropriately-sized test data. - Note: Python functions imported from
utils.py
intoSnakefile
should be debugged on the command-line first before being added to a rule because Snakemake doesn't port standard output/error well when usingrun:
.
- Note: Python functions imported from
- Run the pipeline once through to make sure everything works using the test data in
6. Increment the version number of the modular pipeline- patch
for bug fixes (changes E), minor
for substantial changes to the rules and/or workflow (changes C), and major
only applies to major releases of the CAMP.
bump2version --current-version A.C.E patch
- If you want your tool integrated into the main CAP2/CAMP pipeline, send a pull request and we'll have a look at it ASAP!
- Please make it clear what your tool intends to do by including a summary in the commit/pull request (ex. "Release X.Y.Z: Integration of tool A, which does B to C and outputs D").
There is a dependency error that hasn't been addressed yet, namely bowtie2
in the main camp_binning
conda environment, which has conflicting C++ and Perl dependencies with some other packages.
- This package was created with Cookiecutter as a simplified version of the project template.
- This module is heavily inspired by four Snakefiles from MAG Snakemake workflow (Saheb Kashaf et al. 2021).
- The MAG N50, size, and GC calculation rule was adapted from a script in MetaWRAP.
- Free software: MIT
- Documentation: https://binning.readthedocs.io.