Improvement of global structure of protein 3D models via molecular dynamics (MD) and structural averaging (as seen in CASP14).
The PREFMD2 pipeline has two modes:
- Single initial model mode: the default mode. MD sampling is conducted starting from an initial 3D model supplied by the user. This mode is the one implemented in the publicly available Feig lab webserver: https://feig.bch.msu.edu/web/services/prefmd/.
- Multiple initial models mode: MD sampling is conducted from a user-supplied initial 3D model and from four additional conformations. These conformations are obtained by hybridizing (through the Iterative hybridize protocol of Rosetta) the user-supplied model with multiple homology models of the same protein (generated by MODELLER). This mode is computationally more expensive than the default one, but usually produces more accurate results when the target protein has templates available in the PDB.
For a detailed description and comparison of the two modes see [1].
For the Feig's lab refinement protocol used in CASP13, please see: https://github.com/feiglab/prefmd
The PREFMD2 pipeline runs on Linux systems. In order to use it on your machine, you need Python 3.6+ and you must install a series of dependencies. Note: running the multiple initial models mode requires to install extra dependencies (which are not required to run the single initial model mode).
Make sure to install these Python libraries.
- OpenMM
- Website: http://openmm.org/
- Note: by default PREFMD2 will use the CUDA platform. You can change the OpenMM platform that PREFMD2 will use by setting the
$PREFMD2_OPENMM_PLATFORM
environmental variable. - Role in the pipeline: running MD simulations.
- mdtraj
- Website: https://github.com/mdtraj/mdtraj
- Role in the pipeline: parsing and extracting data from MD trajectory files.
- scikit-learn
- Website: https://scikit-learn.org/stable/
- Role in the pipeline: clustering of MD snapshots.
- MODELLER (optional, used only in multiple initial models mode)
- Website: https://salilab.org/modeller/
- Role in the pipeline: performing template-based 3D modeling in the multiple initial models mode.
Make sure to install these dependencies and to set the required environmental variables (as explained in the Configuration sections). The .bashrc
files in the default
directory of this repository give an example of what your environmental variables should look like.
- CHARMM
- Obtain from: http://charmm.chemistry.harvard.edu
- Configuration: once you have installed CHARMM, set the following environmental variables:
CHARMMEXEC
: path to the executable file of CHARMM.
- Role in the pipeline: it is a dependency for locPREFMD (see below) and is used to prepare input files for the MD runs.
- MMSTSB
- Obtain from: https://github.com/mmtsb/toolset
- Configuration: once you have compiled the toolset, make sure that you have set the following environmental variables (you should already have set them during the MMSTSB installation process, but they are repeated here for a double check):
MMTSBDIR
: top directory of the locally-installed Git repository.CHARMMDATA
: path to$MMTSBDIR/data/charmm
.- Add
$MMTSBDIR/bin
and$MMTSBDIR/perl
to your$PATH
.
- Role in the pipeline: contains scripts necessary to manipulate PDB files and it is a dependency for locPREFMD (see below).
- locPREFMD
- Obtain from: https://github.com/feiglab/locprefmd
- Configuration: follow the installation instructions in the GitHub link and make sure that you have set the following environmental variables (note that you should already have set them during the locPREFMD installation):
LOCPREFMD
: path to the locPREFMD Git repository after checking out.MOLPROBITY
: path to the top of the MolProbity tree.
- Role in the pipeline: used to perform initial stereochemical refinement on the input model and on the averaged models.
- mdconv
- Obtain from: https://github.com/feiglab/mdconv
- Configuration: download the source code, compile and:
- Add the directory with the
mdconv
executable to your$PATH
.
- Add the directory with the
- Role in the pipeline: modifies the trajectories files generated in the production MD runs.
- TMscore
- Obtain from: https://zhanglab.ccmb.med.umich.edu/TM-score/
- Configuration: download the source code, compile and:
- Add the directory with the
TMscore
executable to your$PATH
.
- Add the directory with the
- Role in the pipeline: in the scoring phase, it compares the structures extracted from the MD trajectories to the initial model.
- RWplus
- Obtain from: https://zhanglab.ccmb.med.umich.edu/RW/
- Configuration: download the calRWplus program. Then set the following environmental variable:
RWPLUS_HOME
: path to the RWplus home directory (where thecalRWplus
executable is located).
- Role in the pipeline: scores the structures extracted from the MD trajectories in order to filter them before the averaging stage.
- Scwrl4
- Obtain from: http://dunbrack.fccc.edu/SCWRL3.php/
- Configuration: download and install the Scwrl4 program. Then:
- Add the directory with the
scwrl4
executable to your$PATH
.
- Add the directory with the
- Role in the pipeline: repacks the side chains of the averaged models.
- HHsuite
- Obtain from: https://github.com/soedinglab/hh-suite
- Also make sure to obtain:
- A Uniclust30 database (to be used by
hhblits
): https://uniclust.mmseqs.com/ - A PDB70 database (to be used by
hhsearch
when scanning for templates): http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/
- A Uniclust30 database (to be used by
- Configuration: install the suite and the required databases, the define the following environmental variables:
HHSUITE_SEQ_DB
: database path for the Uniclust30 database. This and the following variable ($HHSUITE_PDB_DB
) should be the same paths that you supply as the-d
argument when using thehhblits
orhhsearch
programs.HHSUITE_PDB_DB
: database path for the PDB70 database.- Add the directory with the HHsuite executables to your
$PATH
.
- Role in the pipeline: identifies templates for the input protein. The templates will be used to build homology models of the protein using MODELLER.
- TMalign
- Obtain from: https://zhanglab.ccmb.med.umich.edu/TM-align/
- Configuration: download the source code, compile and:
- Add the directory with the
TMalign
executable to your$PATH
.
- Add the directory with the
- Role in the pipeline: compares the initial model 3D structure with the templates identified by the HHsuite programs.
- Rosetta software suite
- Obtain from: https://www.rosettacommons.org/software
- Configuration: once you have installed Rosetta, set the following environmental variables:
ROSETTA_HOME
: path of the home directory of Rosetta (this is the directory where thedemos
,documentation
,main
andtools
directories of the Rosetta suite are located).ROSETTA_EXTENSION
(optional): name of the extension of the Rosetta binary files. If you do not specify it, PREFMD2 will assume that your Rosetta binaries have thelinuxgccrelease
extension. Depending on how you obtained the Rosetta binaries, you could have to modify it. For example, if you are using pre-compiled binaries on Linux, you should set this tostatic.linuxgccrelease
.
- Role in the pipeline: run a modified version of the Iterative hybridize protocol in order to hybridize the initial user-supplied 3D model with the template-based models built by MODELLER.
- GNU parallel
- Obtain from: https://www.gnu.org/software/parallel/
- Note: you may probably be able to install this program using the package manager of your Linux distribution.
- Configuration: the directory where the
parallel
executable file is located must be in your$PATH
. - Role in the pipeline: used to parallelize the Iterative hybridize protocol of Rosetta.
Once you have installed the required PREFMD2 dependencies, clone the PREFMD2 GitHub repository on your system. Run:
git clone https://github.com/feiglab/prefmd2.git
Then set the following environmental variable:
PREFMD2_HOME
: this should be the path of the PREFMD2 directory that you cloned on your system.
PREFMD2 uses for the MD runs in the main sampling stage and its preceding equilibration a modified version of the CHARMM36m force field. The files for this force field are provided in this repository.
For the relaxation of averaged structures and model quality assessment steps, the original version of CHARMM36m is used. The files for this force field are NOT provided in this repository. In order to use PREFMD2, you will need to provide your own CHARMM36m force field files, which are available from the CHARMM distribution (in the toppar
directory). Note that CHARMM provides protein force field files separately from water and ions. In order to use PREFMD2 you will need to specify the following three environmental variables:
$PREFMD2_FF_PARAMETER
: path to the parameter file of the selected force field, for example$HOME/apps/charmm/toppar/par_all36_prot.prm
(assuming that your CHARMM installation is in$HOME/apps/charmm
).$PREFMD2_FF_TOPOLOGY
: path to the topology file of the selected force field, for example$HOME/apps/charmm/toppar/top_all36_prot.rtf
.$PREFMD2_FF_WATER_IONS
: path to the water and ions parameter file of the selected force field, for example$HOME/apps/charmm/toppar/toppar_water_ions.str
.
Although in principle any force field can be used in PREFMD2, we recommend the CHARMM36m force field.
Prepare an input protein structure in PDB format. Then run:
python $PREFMD2_HOME/scripts/prefmd2.py -t my_refinement_job -i input.pdb
This will run the default single initial model mode. -t
is the name of the refinement job (the prefmd2.py
script will create a directory named my_refinement_job
and write all its output files in it) and -i
is the path of the PDB file of the 3D model that you want to refine. Using the default options, a typical refinement job takes around ~24 hours to complete when using a single GPU for a ~120 amino acid protein. Once a job is completed, prefmd2.py
will output 5 final models [1]. They can be found in the final
directory inside the output directory (in the example above, the my_refinement_job
directory).
-d/--dir
: working directory. The pipeline will be executed here and output directory will be written in it.-v/--verbose
: set verbose mode.--cpus
: number of CPUs to be used (default: 8).--gpus
: ids of the GPUs to use in the job (default=0). Examples: 1 (only GPU 1 will be used), 0:1 (use GPU 0 and 1), 0:1:3 (use GPU 0, 1 and 3). Each GPU will be used for a MD run when multiple MD runs can be run in parallel (e.g.: when performing the 5 MD production runs). This option will only take effect if your OpenMM platform uses GPU acceleration.--hybrid
: perform the multiple initial models mode.--extensive
: use longer MD production runs.--force
: overwrite a previous output directory if needed.--stage
: name of the stage of the refinement pipeline to be run. By default it is 'all' (the whole pipeline will be executed).-j/--json
: file path of the json file in a PREFMD2 output directory. It must be supplied when resuming a previous job to execute a specific stage using the--stage
argument.
- 1/21/2020: set up the repository.
- Heo L, Arbour CF, Janson G, Feig M. Improved Sampling Strategies for Protein Model Refinement Based on Molecular Dynamics Simulation. J Chem Theory Comput (2021) Feb 9. PMID: 33562962.