💊 DST: Differentiable Scaffolding Tree for Molecule Optimization

This repository hosts DST (Differentiable Scaffolding Tree for Molecule Optimization) (Tianfan Fu*, Wenhao Gao*, Cao Xiao, Jacob Yasonik, Connor W. Coley, Jimeng Sun), which enables a gradient-based optimization on a chemical graph.

🚀 1. Installation

To install locally, we recommend to install from pip and conda. Please see conda.yml for the package dependency.

conda create -n dst python=3.7 
conda activate dst
pip install torch 
pip install PyTDC 
conda install -c rdkit rdkit

Activate conda environment.

conda activate dst

make directory

mkdir -p save_model result

📊 2. Data

In our setup, we restrict the number of oracle calls. In realistic discovery settings, the oracle acquisition cost is usually not negligible.

Raw Data

We use ZINC database, which contains around 250K drug-like molecules and can be downloaded download ZINC.

python src/download.py

output
- data/zinc.tab: all the smiles in ZINC, around 250K.

Oracle

Oracle is a property evaluator and is a function whose input is molecular structure, and output is the property. We consider following oracles:

JNK3: biological activity to JNK3, ranging from 0 to 1.
GSK3B biological activity to GSK3B, ranging from 0 to 1.
QED: Quantitative Estimate of Drug-likeness, ranging from 0 to 1.
SA: Synthetic Accessibility, we normalize SA to (0,1).
LogP: solubility and synthetic accessibility of a compound. It ranges from negative infinity to positive infinity.

For all the property scores above, higher is more desirable.

Optimization Task

There are two kinds of optimization tasks: single-objective and multi-objective optimization. Multi-objective optimization contains jnkgsk (JNK3 + GSK3B), qedsajnkgsk (QED + SA + JNK3 + GSK3B).

Generate Vocabulary

In this project, the basic unit is substructure, which can be atoms or single rings. The vocabulary is the set of frequent substructures.

python src/vocabulary.py

input
- data/zinc.tab: all the smiles in ZINC, around 250K.
output
- data/substructure.txt: including all the substructures in ZINC.
- data/vocabulary.txt: vocabulary, frequent substructures.

data cleaning

We remove the molecules that contains substructure that is not in vocabulary.

python src/clean.py

input
- data/vocabulary.txt: vocabulary
- data/zinc.tab: all the smiles in ZINC
output
- data/zinc_clean.txt

Labelling

We use oracle to evaluate molecule's properties to obtain the labels for training graph neural network.

python src/labelling.py

input
- data/zinc_clean.txt: all the smiles in ZINC, around 250K.
output
- data/zinc_label.txt: including 6 columns, smiles, qed, sa, jnk, gsk, logp. We only contains subset of zinc (10K).

🤖 3. Run

In our setup, we restrict the number of oracle calls in both training GNN and de novo design.

train graph neural network (GNN)

It corresponds to Section 3.2 in the paper.

python src/train.py $prop $train_oracle

prop represent the property to optimize, including qed, logp, jnk, gsk, jnkgsk, qedsajnkgsk.
train_oracle is number of oracle calls in training GNN.
input
- data/zinc_label.txt: training data includes (SMILES,y) pairs, where SMILES is the molecule, y is the label. y = GNN(SMILES)
output
- save_model/model_epoch_*.ckpt: saved GNN model.
log
- "loss/{$prop}.pkl" save the valid loss. For example,

python src/train.py jnkgsk 5000

de novo molecule design

It corresponds to Section 3.3 and 3.4 in the paper.

python src/denovo.py $prop $denovo_oracle

prop represent the property to optimize, including qed, logp, jnk, gsk, jnkgsk, qedsajnkgsk.
denovo_oracle is number of oracle calls.
input
- save_model/{$prop}_*.ckpt: saved GNN model. * is number of iteration or epochs.
output
- result/{$prop}.pkl: set of generated molecules.

For example,

python src/denovo.py jnkgsk 5000

evaluate

python src/evaluate.py $prop

input
- result/{$prop}.pkl
output
- diversity, novelty, average property of top-100 molecules with highest property.

For example,

python src/evaluate.py jnkgsk

Extension to multi-objective optimization

python src/multiobjective.py

📞 Contact

Please contact futianfan@gmail.com or gaowh19@gmail.com for help or submit an issue.

Cite Us

If you found this package useful, please cite our paper:

@article{fu2020differentiable,
  title={Differentiable Scaffolding Tree for Molecule Optimization},
  author={Tianfan Fu*, Wenhao Gao*, Cao Xiao, Jacob Yasonik, Connor W. Coley, Jimeng Sun},
  journal={International Conference on Learning Representation (ICLR)},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
src		src
README.md		README.md
conda.yml		conda.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💊 DST: Differentiable Scaffolding Tree for Molecule Optimization

Table Of Contents

🚀 1. Installation

📊 2. Data

Raw Data

Oracle

Optimization Task

Generate Vocabulary

data cleaning

Labelling

🤖 3. Run

train graph neural network (GNN)

de novo molecule design

evaluate

Extension to multi-objective optimization

📞 Contact

Cite Us

About

Releases

Packages

Languages

futianfan/DST

Folders and files

Latest commit

History

Repository files navigation

💊 DST: Differentiable Scaffolding Tree for Molecule Optimization

Table Of Contents

🚀 1. Installation

📊 2. Data

Raw Data

Oracle

Optimization Task

Generate Vocabulary

data cleaning

Labelling

🤖 3. Run

train graph neural network (GNN)

de novo molecule design

evaluate

Extension to multi-objective optimization

📞 Contact

Cite Us

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages