Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid token while using Inception #46

Closed
Horizon371 opened this issue Mar 18, 2024 · 6 comments
Closed

Invalid token while using Inception #46

Horizon371 opened this issue Mar 18, 2024 · 6 comments

Comments

@Horizon371
Copy link

Hello,

Often times when I run reinforcement learning with the Inception module enabled I get the following error:

Traceback (most recent call last):
  File "/home/jovyan/cristian/REINVENT4/reinvent/Reinvent.py", line 312, in <module>
    main()
  File "/home/jovyan/cristian/REINVENT4/reinvent/Reinvent.py", line 292, in main
    runner(input_config, actual_device, tb_logdir, responder_config)
  File "//home/jovyan/cristian/REINVENT4/reinvent/runmodes/RL/run_staged_learning.py", line 377, in run_staged_learning
    terminate = optimize(package.terminator)
  File "//home/jovyan/cristian/REINVENT4/reinvent/runmodes/RL/learning.py", line 143, in optimize
    agent_lls, prior_lls, augmented_nll, loss = self.update(results)
  File "//home/jovyan/cristian/REINVENT4/reinvent/runmodes/RL/reinvent.py", line 29, in update
    return self.reward_nlls(
  File "//home/jovyan/cristian/REINVENT4/reinvent/runmodes/RL/reward.py", line 157, in __call__
    loss = inception_filter(
  File "//home/jovyan/cristian/REINVENT4/reinvent/runmodes/RL/memories/inception.py", line 219, in inception_filter
    agent_lls = -agent.likelihood_smiles(inception_smilies)
  File "//home/jovyan/cristian/REINVENT4/reinvent/models/model_factory/reinvent_adapter.py", line 17, in likelihood_smiles
    return self.model.likelihood_smiles(smiles)
  File "//home/jovyan/cristian/REINVENT4/reinvent/models/reinvent/models/model.py", line 264, in likelihood_smiles
    encoded = [self.vocabulary.encode(token) for token in tokens]
  File "//home/jovyan/cristian/REINVENT4/reinvent/models/reinvent/models/model.py", line 264, in <listcomp>
    encoded = [self.vocabulary.encode(token) for token in tokens]
  File "//home/jovyan/cristian/REINVENT4/reinvent/models/reinvent/models/vocabulary.py", line 60, in encode
    vocab_index[i] = self._tokens[token]
KeyError: '%11'

Another token for which I often get this error is [SH].

I was wondering about any possible solutions to this error and why REINVENT ends up generating smiles strings with tokens outside of the original vocabulary. Thank you in advance!

@halx
Copy link
Contributor

halx commented Mar 19, 2024

Hi,

many thanks for your interest in REINVENT and welcome to the community!

The reason for this is that the inception memory stores SMILES which are created from the sequence (an integer array, basically) that is generated from the Reinvent model. This is done with RDKit which has particular rules how to denote chemistry as unambiguous as possible. This means, for example, that RDKit may add hydrogens to a hetero-atom as in your case. The SMILES then needs to be translated back to a sequence because the RL algorithm also needs the prior NLL. For that the SMILES needs to be tokenized and fails if it finds an unknown token. So, it REINVENT does not really generate SMILES outside the known token set but it is an error in translation.

I am a bit sceptical about the "%11" as this means that there is a compound with 11 rings in the mind of RDKit. Does that compound by any chance come actually from an input file?

We are working on fixing this and we will push the new code out as soon as it is ready.

Cheers,
Hannes.

@Horizon371
Copy link
Author

Horizon371 commented Mar 19, 2024

Hello Hannes,

Thank you for the quick response!

I don't give any input smiles to REINVENT. I searched for the %11 token in the staged_learning.csv file and I found it in the following SMILES:

98.6043,96.0853,-50.0053,0.3600000,CN(C)CCOc1ccc(-c2ccc(Cc3ccc4cc(C5CCCCN5C(=O)c5cc(-c6nccs6)ccc5-c5cc(C(=O)O)c(-c6cc(-c7nc(N8CCc9cccc(C(=O)N=c%10[nH]c%11ccccc%11s%10)c9C8)ccc7-c7nc(O)cn7CCO)on6)c(C(F)(F)F)c5)ccn34)cc2)cn1,O=C(N=c1[nH]c2ccccc2s1)c1cccc2c1CN(c1ccc(-c3ncc[nH]3)c(-c3cc(-c4ccc(-c5ccc(-c6nccs6)cc5C(=O)N5CCCCC5c5ccn6c(Cc7ccc(-c8cccnc8)cc7)ccc6c5)cc4)no3)n1)CC2,0.3600000,0.3600,55

It has a good score relative to the other generated compunds so it probably ended up being stored and then sampled from Inception.

I attached some files that might come in handy if you want to look further into this issue. Feel free to reach out if you need anything else!

_staged_learning.json
staged_learning_1.csv
z_default_run.log

Best,
Cristian

@halx
Copy link
Contributor

halx commented Mar 19, 2024

That's a rather unusal setup. You have a batch size of 550 plus 50 from the replay memory. It is more typical to run with a number around 100. The batch size will have an influence on the SGD, potentially rather strong. I also note that from your log file the total score is very low and the number of valid SMILES drop rather quickly to below 50%. Given that also your molecule size increases by quite a bit, you must have a rather unusual scoring component or, potentially, a faulty one.

@Horizon371
Copy link
Author

Thank you for the feedback! I really appreciate it.

I have been using the DRD2 binding site oracle from TD Commons for the scoring.

Setting unique_sequences = false makes the score increase considerably while the number of valid SMILES stays above 95% during the whole training and the fraction of duplicates hovers around 1%. The molecule size still increases slightly though.

It is interesting that changing only this setting made such a big difference. Are there any potential shortcoming to doing it?

@halx
Copy link
Contributor

halx commented Mar 22, 2024

Hi,

what unique_sequences does is to supress sampled duplicate sequences. A sequence is an integer array that is translated to "tokens" which are then interpreted as SMILES. So, this is a mechanism to filter out duplicate molecules. However, the same molecule may have different SMILES/sequences, i.e. different ordering of atoms and nesting of branches, esp. because nowadays our priors are trained on randomized SMILES. I do not really know why we have this functiontality, maybe it was an early form of deduplication that may have worked at that time. We only keep it for backward compatibility. It is clear that this will result in incomplete duplicate filtering.

Sequence duplicates are entirely removed meaning that no down-stream code will be aware of this fact. This is important, however, for the diversity filter (DF) you are using. The DF takes explicitly account of duplicates. First, it looks at duplicate molecules (the SMILES have been canonicalized at this step such that the SMILES is unique to a molecule) and scores every further occurence of the SMILES as zero. Second, the augmented "NLL" is a sum of the prior NLL and the scaled scores. The score will only be non-zero for the first time a SMILES is encountered but will be zero for every further occurence. However, the prior NLL for this SMILESS will keep its value. So, it makes a difference if and how duplicates are handled.

We have seen rather pathological cases likes yours although not with your setup and not in connection with unique SMILES. So, thank you for reporting this.

Hope that helps,
Hannes.

@Horizon371
Copy link
Author

Thank you!

@halx halx closed this as completed Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants