-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid token while using Inception #46
Comments
Hi, many thanks for your interest in REINVENT and welcome to the community! The reason for this is that the inception memory stores SMILES which are created from the sequence (an integer array, basically) that is generated from the Reinvent model. This is done with RDKit which has particular rules how to denote chemistry as unambiguous as possible. This means, for example, that RDKit may add hydrogens to a hetero-atom as in your case. The SMILES then needs to be translated back to a sequence because the RL algorithm also needs the prior NLL. For that the SMILES needs to be tokenized and fails if it finds an unknown token. So, it REINVENT does not really generate SMILES outside the known token set but it is an error in translation. I am a bit sceptical about the "%11" as this means that there is a compound with 11 rings in the mind of RDKit. Does that compound by any chance come actually from an input file? We are working on fixing this and we will push the new code out as soon as it is ready. Cheers, |
Hello Hannes, Thank you for the quick response! I don't give any input smiles to REINVENT. I searched for the %11 token in the staged_learning.csv file and I found it in the following SMILES: 98.6043,96.0853,-50.0053,0.3600000,CN(C)CCOc1ccc(-c2ccc(Cc3ccc4cc(C5CCCCN5C(=O)c5cc(-c6nccs6)ccc5-c5cc(C(=O)O)c(-c6cc(-c7nc(N8CCc9cccc(C(=O)N=c%10[nH]c%11ccccc%11s%10)c9C8)ccc7-c7nc(O)cn7CCO)on6)c(C(F)(F)F)c5)ccn34)cc2)cn1,O=C(N=c1[nH]c2ccccc2s1)c1cccc2c1CN(c1ccc(-c3ncc[nH]3)c(-c3cc(-c4ccc(-c5ccc(-c6nccs6)cc5C(=O)N5CCCCC5c5ccn6c(Cc7ccc(-c8cccnc8)cc7)ccc6c5)cc4)no3)n1)CC2,0.3600000,0.3600,55 It has a good score relative to the other generated compunds so it probably ended up being stored and then sampled from Inception. I attached some files that might come in handy if you want to look further into this issue. Feel free to reach out if you need anything else! _staged_learning.json Best, |
That's a rather unusal setup. You have a batch size of 550 plus 50 from the replay memory. It is more typical to run with a number around 100. The batch size will have an influence on the SGD, potentially rather strong. I also note that from your log file the total score is very low and the number of valid SMILES drop rather quickly to below 50%. Given that also your molecule size increases by quite a bit, you must have a rather unusual scoring component or, potentially, a faulty one. |
Thank you for the feedback! I really appreciate it. I have been using the DRD2 binding site oracle from TD Commons for the scoring. Setting unique_sequences = false makes the score increase considerably while the number of valid SMILES stays above 95% during the whole training and the fraction of duplicates hovers around 1%. The molecule size still increases slightly though. It is interesting that changing only this setting made such a big difference. Are there any potential shortcoming to doing it? |
Hi, what Sequence duplicates are entirely removed meaning that no down-stream code will be aware of this fact. This is important, however, for the diversity filter (DF) you are using. The DF takes explicitly account of duplicates. First, it looks at duplicate molecules (the SMILES have been canonicalized at this step such that the SMILES is unique to a molecule) and scores every further occurence of the SMILES as zero. Second, the augmented "NLL" is a sum of the prior NLL and the scaled scores. The score will only be non-zero for the first time a SMILES is encountered but will be zero for every further occurence. However, the prior NLL for this SMILESS will keep its value. So, it makes a difference if and how duplicates are handled. We have seen rather pathological cases likes yours although not with your setup and not in connection with unique SMILES. So, thank you for reporting this. Hope that helps, |
Thank you! |
Hello,
Often times when I run reinforcement learning with the Inception module enabled I get the following error:
Another token for which I often get this error is [SH].
I was wondering about any possible solutions to this error and why REINVENT ends up generating smiles strings with tokens outside of the original vocabulary. Thank you in advance!
The text was updated successfully, but these errors were encountered: