Skip to content

Latest commit

 

History

History
110 lines (78 loc) · 3.76 KB

ch12_generativemodels.asciidoc

File metadata and controls

110 lines (78 loc) · 3.76 KB

Chapter 12: Let the computer think about the chemical structure

jupyter

A generation model is one of the things that Deep Learning has had a great impact on the medicinal chemistry. In particular, the evolution of generation models in the last few years is amazing. Here, let’s propose a new synthesis proposal using REINVENT developed by Marcus Olivecrona.

What is a Generation Model?

The prediction model built in Chapter 11 is generally called a discrimination model. On the other hand, by modeling the distribution of inputs, it is possible to generate sampling or input data from the model. This is called a generative model.

For more details , we recommend reading PRML 1.5.4

Preparation

Install a deep learning library called PyTorch with conda. It does not work with the new version, so specify the version and install it.

What is pytorch?

Like keras, it is a library to use TensorFlow more conveniently.

$ conda install pytorch=0.3.1 -c pytorch

Then clone REINVENT itself from GitHub.

$ cd <path to your working directory>
$ git clone https://github.com/MarcusOlivecrona/REINVENT.git

Next, download a pre-trained model with about 1.1 million data sets of ChEMBL and replace it with the original data. This data takes five or six hours using the GTX 1080Ti GPU machine, but if you want to train yourself, the GPU machine is a must.

$ wget https://github.com/Mishima-syk/13/raw/master/generator_handson/data.zip
$ unzip data.zip
$ mv data ./REINVENT/

Now you are ready.

Illustration

Here we create a model that produces an analogue of the antidiabetic drug sitagliptin, known commercially as Januvia.

First, train the model to generate a highly similar structure using the tanimoto coefficients as scores. This time I will train 3000 steps, but it will take about 7 or 8 hours with Macbook Air, which is a little earlier. If you can not wait, please use the data here.

./main.py --scoring-function tanimoto --scoring-function-kwargs query_structure 'N[C@@H](CC(=O)N1CCn2c(C1)nnc2C(F)(F)F)Cc3cc(F)c(F)cc3F' --num-steps 3000 --sigma 80

From here, I will launch jupyter notebook.

Load the necessary libraries. Specify the REINVENT directory for sys.path.append.

%matplotlib inline
import sys
sys.path.append("[Your REINVENT DIR]")
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs, Draw
import torch
from model import RNN
from data_structs import Vocabulary
from utils import seq_to_smiles

Next, sample 50 compounds from the trained model.

voc = Vocabulary(init_from_file="/Users/kzfm/mishima_syk/REINVENT/data/Voc")
Agent = RNN(voc)
Agent.rnn.load_state_dict(torch.load("sitagliptin_agent_3000/Agent.ckpt"))
seqs, agent_likelihood, entropy = Agent.sample(50)
smiles = seq_to_smiles(seqs, voc)

Let’s see what kind of structure was actually generated.

mols = []
for smi in smiles:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        mols.append(mol)

Draw.MolsToGridImage(mols, molsPerRow=3, subImgSize=(500,400))

Is there anything like that?

Sitagliptin_analogues

About REINVENT