Accepted in TMLR (Transaction on Machine Learning Research) and spotlighted in NeurIPS GenBio Workshop 2023 [arxiv].
Official Github for TacoGFN: Target Conditioned GFlowNet for Structure-based Drug Design by Tony Shen, Seonghwan Seo, Grayson Lee, Mohit Pandey, Jason Smith, Artem Cherkasov, Woo Youn Kim and Martin Ester.
We frame structure-based drug design as a Reinforcement Learning task, where the goal is to search the wider chemical space for molecules with desirable properties as opposed to fitting a training data distribution. We propose TacoGFN, a Generative Flow Network conditioned on protein pocket structure, using binding affinity, drug-likeliness and synthesizability measures as our reward.
Empirically, our method outperforms state-of-art methods on the CrossDocked2020 benchmark for every molecular property (Vina score, QED, SA), while improving the generation time by multiple orders of magnitude. TACOGFN achieves −8.82 in median docking score and 52.63% in Novel Hit Rate.
If you have any problems or need help with the code, please add an issue or contact tonyzshen@gmail.com.
Before running any scripts, please download the necessary package:
https://figshare.com/s/2738ce20d82463227113
This package includes:
- trained model weights
- pre-computed pharmacophores in lmdb
- saved pocket-graphs in lmdb
- misc files (data splitting, avg vina score, pocket centroid, generated molecules)
Please also setup up Conda Environment and install neccessary dependencies.
conda env create -f environment.yml
conda activate tacogfn
cd src/molvoxel
pip install -e .
If you wish to re-train the model we provide the HPS for 3 model variants presented in our paper:
- TacoGFN: hps/crossdocked_mo_256.json
- TacoGFN (ZINCDock-15M): hps/zinc_mo_256.json
- TacoGFN no pocket conditioning: hps/zinc_mo_256_noph.json
Note: TacoGFN_ranked is the same model as TacoGFN - we just generate 500 instead of 100 molecules at inference time, and rank by predicted docking score. The inference script takes care of that.
python3 src/tacogfn/tasks/pharmaco_frag.py --hps_path "$HPS_PATH"
If you just wish to generate molecules and evaluate them, we also provide trained models files. The following scripts re-generates molecules and computes metrics on them (Docking needs to be computed seperatly).
bash scripts/generate_and_evaluate.sh
Note if you have re-trained a model, you can specify your model path to generate and evaluate the performance. You can set $NUM_PER_POCKET to 100 for normal runs. If you'd like to run TacoGFN_ranked, please change $NUM_PER_POCKET to 500.
python3 src/tasks/generate_molecules.py \
--model_path "$MODEL_PATH" \
--num_per_pocket $NUM_PER_POCKET \
--comment "${COMMENT}"
python3 src/tasks/evaluate_molecules.py \
--molecules_path "misc/generated_molecules/1.0_1.0_${NUM_PER_POCKET}_${COMMENT}.json"
To display the metrics, we provide the generated molecules from our model and baseline models in misc/evaluations. The following scripts computes the metrics used in Table 1 and Table 2:
bash scripts/see_all_results.sh
Note if you've generated molecules from a trained model, please compute docking scores using QVina 2.1 first. Then you could call the following:
python3 src/tasks/aggergate_evals.py --eval_path "$EVAL_FILE"
@article{
shen2024tacogfn,
title={Taco{GFN}: Target-conditioned {GF}lowNet for Structure-based Drug Design},
author={Tony Shen and Seonghwan Seo and Grayson Lee and Mohit Pandey and Jason R Smith and Artem Cherkasov and Woo Youn Kim and Martin Ester},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=N8cPv95zOU},
}
This project modifies GFlowNet library for graph and molecular data.