Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Junction Tree Variational Autoencoder for Molecular Graph Generation (JTNN)

Wengong Jin, Regina Barzilay, Tommi Jaakkola. Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv preprint arXiv:1802.04364, 2018.

JTNN uses algorithm called junction tree algorithm to form a tree from the molecular graph. Then the model will encode the tree and graph into two separate vectors z_G and z_T. Details can be found in original paper. The brief process is as below (from original paper):

image

Goal: JTNN is an auto-encoder model, aiming to learn hidden representation for molecular graphs. These representations can be used for downstream tasks, such as property prediction, or molecule optimizations.

Dataset

ZINC

The ZINC database is a curated collection of commercially available chemical compounds prepared especially for virtual screening. (introduction from Wikipedia)

Generally speaking, molecules in the ZINC dataset are more drug-like. We uses ~220,000 molecules for training and 5000 molecules for validation.

Preprocessing

Class JTNNDataset will process a SMILES string into a dict, consisting of a junction tree, a graph with encoded nodes(atoms) and edges(bonds), and other information for model to use.

Usage

Training

To start training, use python train.py. By default, the script will use ZINC dataset with preprocessed vocabulary, and save model checkpoint periodically in the current working directory.

Evaluation

To start evaluation, use python reconstruct_eval.py. By default, we will perform evaluation with DGL's pre-trained model. During the evaluation, the program will print out the success rate of molecule reconstruction.

Pre-trained models

Below gives the statistics of our pre-trained JTNN_ZINC model.

Pre-trained model % Reconstruction Accuracy
JTNN_ZINC 73.7

Visualization

Here we draw some "neighbor" of a given molecule, by adding noises on the intermediate representations. You can download the script with wget https://data.dgl.ai/dgllife/jtnn_viz_neighbor_mol.ipynb. Please put this script at the current directory (examples/pytorch/model_zoo/chem/generative_models/jtnn/).

Given Molecule

image

Neighbor Molecules

image

Dataset configuration

If you want to use your own dataset, please create a file with one SMILES a line as below

CCO
Fc1ccccc1

You can generate the vocabulary file corresponding to your dataset with python vocab.py -d X -v Y, where X is the path to the dataset and Y is the path to the vocabulary file to save. An example vocabulary file corresponding to the two molecules above will be

CC
CF
C1=CC=CC=C1
CO

If you want to develop a model based on DGL's pre-trained model, it's important to make sure that the vocabulary generated above is a subset of the vocabulary we use for the pre-trained model. By running vocab.py above, we also check if the new vocabulary is a subset of the vocabulary we use for the pre-trained model and print the result in the terminal as follows:

The new vocabulary is a subset of the default vocabulary: True

To train on this new dataset, run

python train.py -t X

where X is the path to the new dataset. If you want to use the vocabulary generated above, also add -v Y, where Y is the path to the vocabulary file we just saved.

To evaluate on this new dataset, run python reconstruct_eval.py with arguments same as above.