Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

Closed
ankitskvmdam opened this issue May 21, 2024 · 17 comments
Assignees

Comments

@ankitskvmdam
Copy link

Summary

Currently, the REINVENT 4 Mol2MolSimilarity model generates new molecules with a similar 2D structure but not necessarily a similar 3D structure. Our goal is to train the REINVENT 4 Mol2MolSimilarity model to produce molecules with a 3D structure similar to the input molecule.

Approach 1

To achieve this, we will train the Mol2MolSimilarity model to generate new molecules with a similar 3D shape. The training process involves the following steps:

  1. Input the molecule into the Mol2MolSimilarity model.
  2. Pass the generated molecule to smiles-to-3d, which will generate 3D conformers from the SMILES notation.
  3. smiles-to-3d will produce an SDF file, which we will then pass to vsflow to obtain the similarity score.
  4. Use this similarity score as feedback for the Mol2MolSimilarity model to improve its performance.

Approach 2

While most steps are similar to Approach 1, this approach explores alternative tools for generating 3D conformers and calculating 3D shape similarity scores. One such tool we can investigate is Cheese.

Scope

Initiative 🐋

Objective(s)

To develop a model that can efficiently produce new molecules with a 3D structure similar to that of the input molecule.

Team

Role & Responsibility Username(s)
DRI / Lead Developer @ankitskvmdam
Supervisor @miquelduranfrigola

Timeline

TBD

Documentation

@miquelduranfrigola
Copy link
Member

Hello @ankitskvmdam,

After some thought, I suggest the following:

  1. Let's use REINVENT in transfer learning mode, as you suggested. As a starting model, we can use the Mol2MolSimilarity.
  2. Let's use 3D shape search via CHEESE.

The process will be the following:

  1. A molecule A is passed as input.
  2. We do a 3D Shape Similarity search via the CHEESE API. We can use the Enamine REAL database as the reference library. From that search, we get the top N compounds (L).
  3. Of these 100 compounds, we use 80% as a training set for the transfer learning, 10% as a validation set, and 10% as a held out test set.
  4. We train (fine-tune) REINVENT with transfer learning using the training set, and controling its performance with the validation set.
  5. With the test set, we make sure that, indeed, the test compounds are similar in 3D shape to molecule A. This 3D shape comparison can be done with VSFlow, if that is easier.

I hope this makes sense?

Then, I have more ideas to complicate things further (for example, to search against other databases such as ZINC in CHEESE, to do multiple similarity searches, to penalize molecules that are similar in 2D (favouring scaffold hopping), etc.). But let's go step by step.

Please let me know if something is not clear, @ankitskvmdam !

@ankitskvmdam
Copy link
Author

@miquelduranfrigola It makes sense. I will proceed with this.

@ankitskvmdam
Copy link
Author

Hi @miquelduranfrigola!

Following up on our last meeting:

Cheese API updates:

  • Need to adjust cheese settings to download the similar molecules.
  • We'll be lowering the n_neighbors parameter from 500 to 100. This is because the Morgan Tanimoto score seems to decrease beyond 100 neighbors.
  • Switch search_quality to very_accurate to enhance the Morgan Tanimoto score of neighboring molecules.
  • Do four queries per input molecules, morgan, espsim_electrostatic, espsim_shape, and consensus. By querying 100 molecules for each search type, we'll obtain a total of ~400 unique molecules.

Validation of newly generated molecule

To validate newly trained model, we'll develop a classifier model. This classifier will be trained using two datasets:

  • The 400 molecules we already downloaded using Cheese API.
  • A set of 1000 "negative" molecules obtained from a reference library (these represent undesirable molecules).

The classifier's purpose will be to classify how many newly generated molecules belongs to the class of "400 molecules"

@miquelduranfrigola
Copy link
Member

Thanks @ankitskvmdam this is a perfect summary.
Can you maybe also link explicitly the notebook and script that you shared with me on Slack? I think @GemmaTuron will appreciate this work :)

@ankitskvmdam
Copy link
Author

Follow up

  1. When we download similar molecules using the cheese API, the names of the keys in the response are not appropriate. We are planning to rename the following keys:
    1. zinc_id to identifier.
    2. Morgan Tanimoto to similarity.
  2. We will update the settings we are passing to autoML. For metrics we are using accuracy, we need to update it to roc_auc.
  3. @miquelduranfrigola or @GemmaTuron will provide a reference library to download ~4000 negative smiles, which we will use for training our classifier model.
  4. Instead of using automl.predict, we will use automl.predict_proba, which will provide insights into the classifier model's confidence levels.

@miquelduranfrigola
Copy link
Member

Hi @ankitskvmdam this is fantastic. As a reference library, you can use the following one: https://github.com/ersilia-os/groverfeat/blob/main/data/reference_library.csv

@ankitskvmdam
Copy link
Author

Hi @miquelduranfrigola

I have updated the notebook by adding the above changes.

I have a question regarding the reference library, which contains 10,00,000 SMILES. How should we select SMILES from this library? If we select randomly, there is a possibility of selecting SMILES with similar 3D structures. In the notebook, I used RMSD to filter out 3d similar molecules.

Also, instead of training a classifier model, can we just use RMSD?

@miquelduranfrigola
Copy link
Member

Thanks @ankitskvmdam

Let's discuss now in our meeting.

There is a possibility of selecting molecules with similar 3D structures, but I think this possibility is residual and we should not worry to much about it, especially since we are "just" evaluating the model and the worst that can happen is that we underestimate performance.

As for the classifier vs RMSD - the classifier should be faster since no conformers need to be generated.

@ankitskvmdam
Copy link
Author

Hi @miquelduranfrigola

I made the following changes to the notebook:

  1. Removed the dependency on RMSD.
  2. Ask the new model to generate around 10,000 molecules (on average, it generates about 1,100 new molecules, probably we can ask to generate 20 or 30 thousand).
  3. Randomly selecting 500 molecules from the reference library (previously, when selecting 4,000, the model filtered out more than 60% of the newly generated molecules by the end of the prediction).
  4. Selected the top 1,000 results (usually around 500-600).

P.S. I tried generating 30,000 new molecules, it took 3 minutes, and at the end I was able to get 1000 molecules.

@miquelduranfrigola
Copy link
Member

Thanks @ankitskvmdam this looks very good and the notebook is great.
About point 3, I think we want to randomly sample more than 500 molecules - I would do 4,000 (i.e. a 1:10 ratio) or so. What do you mean by filtered out more than 60%? If I understand correctly, we are only doing ranking, so you can still keep the top 1,000 results, right?

@ankitskvmdam
Copy link
Author

@miquelduranfrigola , What I am doing is only taking the smiles that have more than 60% probability of having same similarity score.

Questions:

  1. Do we want to keep all the smiles even if the probability score is less than 60% or 50%?
  2. In the case when we the new model only generate molecules less than 1000, do we need to train a classifier?

By filtered out more than 60%

If we train the model with 1:10 and then do autoML.predic_proba, the probability score of more than 60% of the generated smiles is less than 50%.

@miquelduranfrigola
Copy link
Member

Thanks for the clarification @ankitskvmdam, I understand. Those are very valid points.
When we have an imbalanced dataset, classification scores tend to get lower. A possible way of solving this is by changing your FLAML metric. You are using ROC AUC as your FLAML metric, which in principle is the right thing to do but does not care about the magnitude of the score. Perhaps using something like F1-score would increase our changes of having a better calibrated score. I would certainly try this.

In any case, I don't think we need to worry too much about it. It is still fine to keep molecules with "low" scores. I definitely think we can keep the top 1000 even if they do not have high scores.

You mentioned that you will ask for 20k or 30k molecules. Based on your experience, do you think we will ever get a case where we will only have less than 1000 molecules? If this is the case, I agree, no classifier is needed.

@ankitskvmdam
Copy link
Author

@miquelduranfrigola

I have updated the notebook:

  1. Use f1 metric at the time of training classifier.
  2. Usually when we ask for 10,000 new molecules we get almost 1100 new molecules. But I think it would be nice to ask for 15,000 molecules.
  3. At the end only returning/taking top 1000 molecules.

@miquelduranfrigola
Copy link
Member

Hi @ankitskvmdam thanks for the update. Let's do 15k molecules, this sounds like a good compromise.

@GemmaTuron
Copy link
Member

Hi @miquelduranfrigola and @ankitskvmdam

What is the status of this? are we aiming to complete it or we have put it on hold?

@miquelduranfrigola
Copy link
Member

This is essentially finished - Mol2MolSimilarity works great. I close the issue. @ankitskvmdam feel free to re-open if there are things you want to discuss further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants