GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions
GEM-2 is a molecular modeling framework which comprehensively considers full-range many-body interactions in molecules. Multiple tracks are utilized to model the full-range interactions between the many-bodies with different orders, and a novel axial attention mechanism is designed to approximate the full-range interaction modeling with much lower computational cost. A preprint version of our work can be found here.
The overall framework of GEM-2. First, a molecule is described by the representations of many-bodies of multiple orders. Then, Optimus blocks are designed to update the representations. Each Optimus block contains
PCQM4Mv2 is a large-scale quantum chemistry dataset containing the DFT-calculated HOMO-LUMO energy gaps. The OGB leaderboard for PCQM4Mv2 can be found here.
LIT-PCBA is a virtual screening dataset containing protein targets with their corresponding active and inactive compounds selected from high-confidence PubChem Bioassay data
- OS support: Linux
- Python version: 3.6, 3.7, 3.8
name | version |
---|---|
numpy | - |
pandas | - |
paddlepaddle | >=2.0.0 |
rdkit-pypi | - |
sklearn | - |
Firstly, download or clone the lastest github repository:
git clone https://github.com/PaddlePaddle/PaddleHelix.git
git checkout dev
cd apps/pretrained_compound/ChemRL/GEM-2
You can download the PCQM4Mv2 dataset from ogb website:
wget https://dgl-data.s3-accelerate.amazonaws.com/dataset/OGB-LSC/pcqm4m-v2.zip
You can also download the processed PCQM4Mv2 dataset with rdkit generated 3d information from here. And then use tar to unzip the data.
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/compound_datasets/pcqm4mv2_gem2.tgz
mkdir -p ../data
tar xzf pcqm4mv2_gem2.tgz -C ../data
We release the checkpoint for reproducing the results on PCQM4Mv2, which can also serve as a pretrain model for downstream tasks.
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_modeling/gem2_l12_c256.pdparams
mkdir -p model
mv gem2_l12_c256.pdparams model
You can adjsut the json files in the config folder to change the training settings.
data_dir
: where the data locatedtask_names
: the name of the label column in the datafile
model
: model related information, like the channel size, dropoutdata
: data transform setting
lr
: learning ratewarmup_step
: the step to warm up learning rate to lrmid_step
: steps before learning rate decay
sh scripts/train.sh
The models will be saved under ./model
. It will take around 60 mintues to finish one epoch on 16 A100 cards with total batch size of 512.
To reproduce the result from the ogb leaderboard, just run the inference command:
sh scripts/inference.sh
If you use the code or data in this repos, please cite:
@article{liu2022gem-2,
doi = {10.48550/ARXIV.2208.05863},
url = {https://arxiv.org/abs/2208.05863},
author = {Liu, Lihang and He, Donglong and Fang, Xiaomin and Zhang, Shanzhuo and Wang, Fan and He, Jingzhou and Wu, Hua},
title = {GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions},
publisher = {arXiv},
year = {2022}
}