Wengong Jin, Regina Barzilay, Tommi Jaakkola. Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv preprint arXiv:1802.04364, 2018.
JTNN uses algorithm called junction tree algorithm to form a tree from the molecular graph.
Then the model will encode the tree and graph into two separate vectors z_G
and z_T
. Details can
be found in original paper. The brief process is as below (from original paper):
Goal: JTNN is an auto-encoder model, aiming to learn hidden representation for molecular graphs. These representations can be used for downstream tasks, such as property prediction, or molecule optimizations.
The ZINC database is a curated collection of commercially available chemical compounds prepared especially for virtual screening. (introduction from Wikipedia)
Generally speaking, molecules in the ZINC dataset are more drug-like. We uses ~220,000 molecules for training and 5000 molecules for validation.
Class JTNNDataset
will process a SMILES string into a dict, consisting of a junction tree, a graph with
encoded nodes(atoms) and edges(bonds), and other information for model to use.
To start training, use python train.py
. By default, the script will use ZINC dataset
with preprocessed vocabulary, and save model checkpoint periodically in the current working directory.
To start evaluation, use python reconstruct_eval.py
. By default, we will perform evaluation with
DGL's pre-trained model. During the evaluation, the program will print out the success rate of
molecule reconstruction.
Below gives the statistics of our pre-trained JTNN_ZINC
model.
Pre-trained model | % Reconstruction Accuracy |
---|---|
JTNN_ZINC |
73.7 |
Here we draw some "neighbor" of a given molecule, by adding noises on the intermediate representations.
You can download the script with wget https://data.dgl.ai/dgllife/jtnn_viz_neighbor_mol.ipynb
.
Please put this script at the current directory (examples/pytorch/model_zoo/chem/generative_models/jtnn/
).
If you want to use your own dataset, please create a file with one SMILES a line as below
CCO
Fc1ccccc1
You can generate the vocabulary file corresponding to your dataset with python vocab.py -d X -v Y
, where X
is the path to the dataset and Y
is the path to the vocabulary file to save. An example vocabulary file
corresponding to the two molecules above will be
CC
CF
C1=CC=CC=C1
CO
If you want to develop a model based on DGL's pre-trained model, it's important to make sure that the vocabulary
generated above is a subset of the vocabulary we use for the pre-trained model. By running vocab.py
above, we
also check if the new vocabulary is a subset of the vocabulary we use for the pre-trained model and print the
result in the terminal as follows:
The new vocabulary is a subset of the default vocabulary: True
To train on this new dataset, run
python train.py -t X
where X
is the path to the new dataset. If you want to use the vocabulary generated above, also add -v Y
, where
Y
is the path to the vocabulary file we just saved.
To evaluate on this new dataset, run python reconstruct_eval.py
with arguments same as above.