CRISPRloci provides an automated and comprehensive in silico characterization of CRISPR-Cas system on bacterial and archaeal genomes. It is a full suite for CRISPR locus characteriztion that includes CRISPR array orientation, detection of conserved leaders, cas gene annotation and subtype classification.
The web server interface of CRISPRloci is freely available at: rna.informatik.uni-freiburg.de/trunk/CRISPRloci
If you use CRISPRloci, please cite our papers:
- CRISPRidentify: identification of CRISPR arrays using machine learning approach Alexander Mitrofanov, Omer S. Alkhnbashi, Sergey A. Shmakov, Kira S. Makarova, Eugene V. Koonin, Rolf Backofen, Nucleic Acids Research, DOI: https://doi.org/10.1093/nar/gkaa1158
- Casboundary: Automated definition of integral Cas cassettes Victor A. Padilha, Omer S. Alkhnbashi, Van Dinh Tran, Shiraz A. Shah, André C. P. L. F. de Carvalho, Rolf Backofen, Bioinformatics, 2020, DOI: 10.1093/bioinformatics/btaa984.
- CRISPRCasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems Victor A. Padilha, Omer S. Alkhnbashi, Shiraz A. Shah, André C. P. L. F. de Carvalho, Rolf Backofen, GigaScience, 2020, DOI: 10.1093/gigascience/giaa062.
CRISPRloci_standalone.py has been tested with Python 3.7 To run it, we recommend installing the same library versions we used. Since we exported our classifiers following the model persistence guideline from scikit-learn, it is not guaranteed that they will work properly if loaded using other Python and/or library versions. For such, we recommend the use of our docker image or a conda virtual environment. They make it easy to install the correct Python and library dependencies without affecting the whole operating system (see below).
wget https://github.com/BackofenLab/CRISPRloci/archive/1.0.0.tar.gz
tar -xzf 1.0.0.tar.gz
Second step: download the Hidden Markov (HMM) and Machine Learning (ML) models
Due to GitHub's file size constraints, we made our HMM and ML models available in Google Drive. You can download them here and here. Save both tar.gz files inside CRISPRcasIdentifier's directory. It is not necessary to extract them, since the tool will do that the first time it is run.
Third step: download the Hidden Markov (HMM) and Machine Learning (ML) models
We made our HMM and ML models available in Google Drive. You can download them from the following links:
Save all tar.gz files inside Casboundary's folder. It is not necessary to extract them, since the tool will do that the first time it is run.
First we install Miniconda for python 3. Miniconda can be downloaded from here: miniconda.
Install Miniconda.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh chmod +x Miniconda3-latest-Linux-x86_64.sh ./Miniconda3-latest-Linux-x86_64.sh
Create and activate environment for CRISPRloci.
conda env create -f CRISPRloci-env.yml -n CRISPRloci-env
conda activate CRISPRloci-env
After using CRISPRloci_standalone.py you can deactivate the environment.
conda deactivate
In order to test the dna mode please execute the following the command
python3.7 CRISPRloci_standalone.py -f Example/NC_005230.fasta -st dna
In order to test the protein mode tun the following command:
python3.7 CRISPRloci_standalone.py -f Example/NC_005230_proteins.fasta -st protein
In order to test the viral more execute the following command
python3.7 CRISPRloci_standalone.py -f Example/Input3.fa -st virus
-
-f
: input DNA fasta file path. -
-output
folder where results will be stored -
-cpu
number of CPUs to use
-
-r
: list of regressors to use. Available options: CART, ERT or SVM (default: ERT). -
-c
: list of classifiers to use. Available options: CART, ERT or SVM (default: ERT). -
-s
: list of HMM models to use, available options: HMM1 to HMM5 and HMM2019 (default: HMM2019). The models HMM1 to HMM5 are the ones that were originally used in our paper. HMM2019 consists on the HMM models that were obtained from the most recent dataset by Makarova (2019). Setting this parameter is enough for the tool to know which ML models should be used. -
-sc
: sequence completeness (used only when-st
is set todna
). Available options:complete
orpartial
(default:complete
). -
-m
: run mode. Available options:classification
,regression
orcombined
(default:combined
). -
-cg
: maximum number of contiguous gaps allowed in a cassette (default: 1) -
-cm
: which ML models to use. Available options:ERT
orDNN
(default:ERT
).
-
--model
Model for the CRISPR array classification. Takes values: 8, 9, 10, ALL and specifies the classification model. The default value isALL
-
--strand
Specifies if the array orientation should be predicted. Available optionsTrue/False
. The default value isTrue
-
--is_element
Specifies if IS-Elements should be predicted. Available optionsTrue/False
. The default value isFalse
-
--fast_run
option to skip the candidate enhancement. Available optionsTrue/False
. The default value isFalse
-
--degenerated
allows search for degenerated repeat candidates on both ends of the CRISPR array candidate. Available optionsTrue/False
. The default value:True
-
--min_len_rep
specifies the minimum length of repeats in a CRISPR array. The default value: 21 -
--max_len_rep
specifies the maximum length of repeats in a CRISPR array. The default value: 55 -
--min_len_spacer
specifies the minimum average length of spacers in a CRISPR array. The default value: 18 -
--max_len_spacer
specifies the maximum average length of spacers in a CRISPR array. The default value: 78 -
--min_repeats
specifies the minimum number of repeats in a CRISPR array. The default value: 3 -
--enhancement_max_min
specifies if the filter approximation based on the max. and min. elements should be built The default value is True -
--enhancement_start_end
specifies if the filter approximation based on the max. and min. elements should be built The default value is True -
--max_identical_spacers
specifies the number of maximum identical spacers in a CRISPR array. The default value: 4 -
--max_identical_cluster_spacers
specifies the number of maximum identical consequent spacers in a CRISPR array. The default value: 3 -
--margin_degenerated
specifies the maximum length difference between a new spacer sequence (obtained with the search of degenerated repeats) and the average value of spacer length in the array. The default value: 30 -
--max_edit_distance_enhanced
specifies the number of editing operations for candidate enhancement. The default value: 6
-
-f
: input proteins fasta file path. -
-output
folder where results will be stored -
-cpu
number of CPUs to use
-
-r
: list of regressors to use. Available options: CART, ERT or SVM (default: ERT). -
-c
: list of classifiers to use. Available options: CART, ERT or SVM (default: ERT). -
-s
: list of HMM models to use, available options: HMM1 to HMM5 and HMM2019 (default: HMM2019). The models HMM1 to HMM5 are the ones that were originally used in our paper. HMM2019 consists on the HMM models that were obtained from the most recent dataset by Makarova (2019). Setting this parameter is enough for the tool to know which ML models should be used. -
-sc
: sequence completeness (used only when-st
is set todna
). Available options:complete
orpartial
(default:complete
). -
-m
: run mode. Available options:classification
,regression
orcombined
(default:combined
). -
-cg
: maximum number of contiguous gaps allowed in a cassette (default: 1) -
-cm
: which ML models to use. Available options:ERT
orDNN
(default:ERT
).
-
-f
: input proteins fasta file path. -
-output
folder where results will be stored -
-cpu
number of CPUs to use
evalue_s
the number of expected hits with spacer database. The default value: 1e-7