Skip to content
/ MARCEL Public

[ICLR 2024] MARCEL: Machine Learning over Molecular Conformer Ensembles

Notifications You must be signed in to change notification settings

SXKDZ/MARCEL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MARCEL is a PyTorch-based benchmark library that evaluates the potential of machine learning on conformer ensembles across a diverse set of molecules, datasets, and models.

Why Learning over Conformer Ensembles?

It is critical to recognize that in reality molecules are not rigid, static objects; rather, thermodynamically-permissible rotations of chemical bonds, small vibrational motions, and dynamic intermolecular interactions cause molecules to continuously convert between different conformations. As a consequence, many experimentally observable chemical properties depend on the full distribution of thermodynamically-accessible conformers. Also, it is often challenging to determine a priori the conformers that predominantly contribute to molecular properties without doing prohibitively expensive simulations. Therefore, it is important to investigate the collective power of many different conformer structures lying on the local minima of the potential energy surface, also known as the conformer ensemble, for improving molecular representation learning models.

Copyright © 2022 drugdesign.org

Datasets

MARCEL include four datasets that cover a diverse range of chemical space, which focuses on four chemically-relevant tasks for both molecules and reactions, with an emphasis on Boltzmann-averaged properties of conformer ensembles computed at the Density-Functional Theory (DFT) level.

Drugs-75K

Drugs-75K is a subset of the GEOM-Drugs dataset, which includes 75,099 molecules with at least 5 rotatable bonds. For each molecule, Auto3D is used to generate and optimize the conformer ensembles and AIMNet-NSE is used to calculate three important DFT-based reactivity descriptors: ionization potential, electron affinity, and electronegativity.

Links: Download, Instructions

Kraken

Kraken is a dataset of 1,552 monodentate organophosphorus (III) ligands along with their DFT-computed conformer ensembles. We consider four 3D catalytic ligand descriptors exhibiting significant variance among conformers: Sterimol B5, Sterimol L, buried Sterimol B5, and buried Sterimol L. These descriptors quantify the steric size of a substituent in Å, and are commonly employed for Quantitative Structure-Activity Relationship (QSAR) modeling. The buried Sterimol variants describe the steric effects within the first coordination sphere of a metal.

Links: Download, Instructions

EE

EE is a dataset of 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine, with 10 enamides as substrates. The dataset includes conformations of catalyst-substrate transition state complexes in two separate pro-S and pro-R configurations. The task is to predict the Enantiomeric Excess (EE) of the chemical reaction involving the substrate, defined as the absolute ratio between the concentration of each enantiomer in the product distribution. This dataset is generated with Q2MM, which automatically generates Transition State Force Fields (TSFFs) in order to simulate the conformer ensembles of each prochiral transition state complex. EE can then be computed from the conformer ensembles by Boltzmann-averaging the activation energies for the competing transition states. Unlike properties in Drugs-75K and Kraken, EE depends on the conformer ensembles of each pro-R and pro-S complex.

Links: Dataset access not publicly available, Instructions

BDE

BDE is a dataset containing 5,915 organometallic catalysts ML₁L₂ consisting of a metal center (M = Pd, Pt, Au, Ag, Cu, Ni) coordinated to two flexible organic ligands (L₁ and L₂), each selected from a 91-membered ligand library. The data includes conformations of each unbound catalyst, as well as conformations of the catalyst when bound to ethylene and bromide after oxidative addition with vinyl bromide. Each catalyst has an electronic binding energy, computed as the difference in the minimum energies of the bound-catalyst complex and unbound catalyst, following the DFT-optimization of their respective conformer ensembles. Although the binding energies are computed via DFT, the conformers provided for modeling are generated with Open Babel. This realistically represents the setting in which precise conformer ensembles are unknown at inference.

Links: Download, Instructions

Benchmarks

Prerequisites

The following packages are required for running the benchmarks.

  • pytorch >= 1.13.1
  • pyg >= 2.0
  • rdkit
  • nni
  • ogb

Dataset Loaders

MARCEL has implemented PyG data loaders for each dataset. Download the dataset and place each zipped file under its corresponding directory, i.e. datasets/<NAME>/raw.

Dataset Dataloader class
Drugs-75K data.drugs.Drugs
Kraken data.kraken.Kraken
EE data.ee.EE_2D for 2D models, data.ee.EE for the others
BDE data.bde.BDE

Batch Samplers

For Drugs-75K and Kraken, use EnsembleSampler to sample mini-batches of molecules from loaders.samplers. You can specify the sampling strategy to random that randomly samples one conformer, first that always loads the first conformer in each ensemble, or all that loads all conformers.

Since EE and BDE involve interactions between two molecules, we implement another sampler EnsembleMultiBatchSampler from loaders.samplers. In this case, each conformer of the system will be loaded as a tuple [data_0, data_1], which corresponds to one of the two molecules in the system.

Instructions on Reproducing Results

The default hyperparameters are set in config.py. Other model-dependent parameters are stored in the params folder separately. To reproduce the model you want to run, simply change the config parameter in ConfigLoader to the corresponding model parameter file. Then, specify dataset and target and change other parameters (e.g., learning_rate) when necessary in the command-line arguments.

Model Training script and key parameters
1D fingerprint model train_fp_rf.py
1D SMILES-based sequential model train_1d.py --model1d:model SEQ_ENCODER
2D model train_2d.py --model2d:model GRAPH_ENCODER
Single-conformer 3D model train_3d.py --model3d:augmentation False --model3d:model GRAPH_ENCODER
3D model with conformer sampling train_3d.py --model3d:augmentation True --model3d:model GRAPH_ENCODER
Conformer ensemble model train_ensemble.py --model4d:set_encoder SET_ENCODER --model4d:graph_encoder GRAPH_ENCODER

License

The MARCEL benchmarks are licensed under Apache 2.0 License.

About

[ICLR 2024] MARCEL: Machine Learning over Molecular Conformer Ensembles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages