A generation model is one of the things that Deep Learning has had a great impact on the medicinal chemistry. In particular, the evolution of generation models in the last few years is amazing. Here, let’s propose a new synthesis proposal using REINVENT developed by Marcus Olivecrona.
The prediction model built in Chapter 11 is generally called a discrimination model. On the other hand, by modeling the distribution of inputs, it is possible to generate sampling or input data from the model. This is called a generative model.
For more details , we recommend reading PRML 1.5.4
Install a deep learning library called PyTorch with conda. It does not work with the new version, so specify the version and install it.
Like keras, it is a library to use TensorFlow more conveniently.
$ conda install pytorch=0.3.1 -c pytorch
Then clone REINVENT itself from GitHub.
$ cd <path to your working directory>
$ git clone https://github.com/MarcusOlivecrona/REINVENT.git
Next, download a pre-trained model with about 1.1 million data sets of ChEMBL and replace it with the original data. This data takes five or six hours using the GTX 1080Ti GPU machine, but if you want to train yourself, the GPU machine is a must.
$ wget https://github.com/Mishima-syk/13/raw/master/generator_handson/data.zip
$ unzip data.zip
$ mv data ./REINVENT/
Now you are ready.
Here we create a model that produces an analogue of the antidiabetic drug sitagliptin, known commercially as Januvia.
First, train the model to generate a highly similar structure using the tanimoto coefficients as scores. This time I will train 3000 steps, but it will take about 7 or 8 hours with Macbook Air, which is a little earlier. If you can not wait, please use the data here.
./main.py --scoring-function tanimoto --scoring-function-kwargs query_structure 'N[C@@H](CC(=O)N1CCn2c(C1)nnc2C(F)(F)F)Cc3cc(F)c(F)cc3F' --num-steps 3000 --sigma 80
From here, I will launch jupyter notebook.
Load the necessary libraries. Specify the REINVENT directory for sys.path.append.
%matplotlib inline
import sys
sys.path.append("[Your REINVENT DIR]")
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs, Draw
import torch
from model import RNN
from data_structs import Vocabulary
from utils import seq_to_smiles
Next, sample 50 compounds from the trained model.
voc = Vocabulary(init_from_file="/Users/kzfm/mishima_syk/REINVENT/data/Voc")
Agent = RNN(voc)
Agent.rnn.load_state_dict(torch.load("sitagliptin_agent_3000/Agent.ckpt"))
seqs, agent_likelihood, entropy = Agent.sample(50)
smiles = seq_to_smiles(seqs, voc)
Let’s see what kind of structure was actually generated.
mols = []
for smi in smiles:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
mols.append(mol)
Draw.MolsToGridImage(mols, molsPerRow=3, subImgSize=(500,400))
Is there anything like that?
By all means, please read Molecular De Novo Design through Deep Reinforcement Learning