Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in sampling #7

Closed
Rinkumc opened this issue Nov 23, 2023 · 5 comments
Closed

Error in sampling #7

Rinkumc opened this issue Nov 23, 2023 · 5 comments

Comments

@Rinkumc
Copy link

Rinkumc commented Nov 23, 2023

(reinvent4) rinku@admin:~/REINVENT4/configs/toml$ reinvent -l sampling.log sampling.toml
Traceback (most recent call last):
File "/home/rinku/miniconda3/envs/reinvent4/bin/reinvent", line 8, in
sys.exit(main())
File "/home/rinku/miniconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 284, in main
runner(input_config, actual_device, tb_logdir, responder_config)
File "/home/rinku/miniconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/samplers/run_sampling.py", line 101, in run_sampling
sampled = sampler.sample(input_smilies)
File "/home/rinku/miniconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/samplers/mol2mol.py", line 50, in sample
dataset = Dataset(smilies, self.model.get_vocabulary(), tokenizer)
File "/home/rinku/miniconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/models/mol2mol/dataset/dataset.py", line 25, in init
enc = self._vocabulary.encode(tokenized)
File "/home/rinku/miniconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/models/mol2mol/models/vocabulary.py", line 60, in encode
ohe_vect[i] = self._tokens[token]
KeyError: '[S@+]'
Error occured during sampling

#############################
Sampling.toml
# REINVENT4 TOML input example for sampling
#


run_type = "sampling"
use_cuda = true  # run on the GPU if true, on the CPU if false
json_out_config = "_sampling.json"  # write this TOML to JSON


[parameters]

# Uncomment one of the comment blocks below.  Each generator needs a model
# file and possibly a SMILES file with seed structures.

## Reinvent: de novo sampling
#model_file = "/home/rinku/REINVENT4/priors/reinvent.prior"

## LibInvent: find R-groups for the given scaffolds
#model_file = "priors/libinvent.prior"
#smiles_file = "scaffolds.smi"  # 1 scaffold per line with attachment points

## LinkInvent: find a linker/scaffold to link two fragments
#model_file = "priors/linkinvent.prior"
#smiles_file = "warheads.smi"  # 2 warheads per line separated with '|'

## Mol2Mol: find molecules similar to the provided molecules
model_file = "/home/rinku/REINVENT4/priors/mol2mol_medium_similarity.prior"
smiles_file = "mol2mol.smi"  # 1 compound per line
sample_strategy = "beamsearch"  # multinomial or beamsearch (deterministic)
temperature = 1.0 # temperature in multinomial sampling
tb_logdir = "tb_logs"  # name of the TensorBoard logging directory

output_file = 'sampling.csv'  # sampled SMILES and NLL in CSV format

num_smiles = 110  # number of SMILES to be sampled, 1 per input SMILES
unique_molecules = true  # if true remove all duplicatesd canonicalize smiles
randomize_smiles = true # if true shuffle atoms in SMILES randomly

################
Input.smi
O=C1O[C@@H](C(=O)N)CN1c2cc(F)c(cc2)[C@H]3CC[S@](=O)CC3
@halx
Copy link
Contributor

halx commented Nov 23, 2023

As far as I can see, the problem is that the first and last stereocentres are not real. You should remove those and try again.

@Rinkumc
Copy link
Author

Rinkumc commented Nov 24, 2023

Okay

@halx
Copy link
Contributor

halx commented Nov 24, 2023

I am sorry but I realize that I have pasted the wrong SMILES string. I see now that you posted O=C1O[C@@H](C(=O)N)CN1c2cc(F)c(cc2)[C@H]3CC[S@](=O)CC3. All sterechemistries in this molecules are real.

We will have a look into this case and see how to resolve this.

@halx
Copy link
Contributor

halx commented Nov 24, 2023

So, the issue is with the current prior models for Mol2Mol. Those have been trained on pairs from ChEMBL but pruned for molecules that did not come from the same publication. This was done under the assumption that the molecules are from the same series hence following chemical intuition. This, unfortunately, leads to a more limited chemistry in the models including the sulfoxide you have in your model. At the end, those priors are essentially just proof-of concept.

At some point in the future we will release models trained on the larger PubChem dataset without making assumptions how pairs were/should be constructure. For the time being you can only try to remove the stereochemistry annotation on the sulfur.

@Rinkumc
Copy link
Author

Rinkumc commented Nov 25, 2023

Okay Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants