[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

ankitskvmdam · 2024-05-21T06:34:20Z

Summary

Currently, the REINVENT 4 Mol2MolSimilarity model generates new molecules with a similar 2D structure but not necessarily a similar 3D structure. Our goal is to train the REINVENT 4 Mol2MolSimilarity model to produce molecules with a 3D structure similar to the input molecule.

Approach 1

To achieve this, we will train the Mol2MolSimilarity model to generate new molecules with a similar 3D shape. The training process involves the following steps:

Input the molecule into the Mol2MolSimilarity model.
Pass the generated molecule to smiles-to-3d, which will generate 3D conformers from the SMILES notation.
smiles-to-3d will produce an SDF file, which we will then pass to vsflow to obtain the similarity score.
Use this similarity score as feedback for the Mol2MolSimilarity model to improve its performance.

Approach 2

While most steps are similar to Approach 1, this approach explores alternative tools for generating 3D conformers and calculating 3D shape similarity scores. One such tool we can investigate is Cheese.

Scope

Initiative 🐋

Objective(s)

To develop a model that can efficiently produce new molecules with a 3D structure similar to that of the input molecule.

Team

Role & Responsibility	Username(s)
DRI / Lead Developer	@ankitskvmdam
Supervisor	@miquelduranfrigola

Timeline

TBD

Documentation

REINVENT: https://github.com/MolecularAI/REINVENT4
https://chemrxiv.org/engage/chemrxiv/article-details/65463cafc573f893f1cae33a
smiles-to-3d: https://github.com/ersilia-os/smiles-to-3d
vsflow-ligand-based-3d-screening: https://github.com/ersilia-os/vsflow-ligand-based-3d-screening
Cheese (DeepMed): https://cheese.themama.ai/

miquelduranfrigola · 2024-06-24T23:25:34Z

Hello @ankitskvmdam,

After some thought, I suggest the following:

Let's use REINVENT in transfer learning mode, as you suggested. As a starting model, we can use the Mol2MolSimilarity.
Let's use 3D shape search via CHEESE.

The process will be the following:

A molecule A is passed as input.
We do a 3D Shape Similarity search via the CHEESE API. We can use the Enamine REAL database as the reference library. From that search, we get the top N compounds (L).
Of these 100 compounds, we use 80% as a training set for the transfer learning, 10% as a validation set, and 10% as a held out test set.
We train (fine-tune) REINVENT with transfer learning using the training set, and controling its performance with the validation set.
With the test set, we make sure that, indeed, the test compounds are similar in 3D shape to molecule A. This 3D shape comparison can be done with VSFlow, if that is easier.

I hope this makes sense?

Then, I have more ideas to complicate things further (for example, to search against other databases such as ZINC in CHEESE, to do multiple similarity searches, to penalize molecules that are similar in 2D (favouring scaffold hopping), etc.). But let's go step by step.

Please let me know if something is not clear, @ankitskvmdam !

ankitskvmdam · 2024-06-25T06:00:43Z

@miquelduranfrigola It makes sense. I will proceed with this.

ankitskvmdam · 2024-07-12T11:45:54Z

Hi @miquelduranfrigola!

Following up on our last meeting:

Cheese API updates:

Need to adjust cheese settings to download the similar molecules.
We'll be lowering the n_neighbors parameter from 500 to 100. This is because the Morgan Tanimoto score seems to decrease beyond 100 neighbors.
Switch search_quality to very_accurate to enhance the Morgan Tanimoto score of neighboring molecules.
Do four queries per input molecules, morgan, espsim_electrostatic, espsim_shape, and consensus. By querying 100 molecules for each search type, we'll obtain a total of ~400 unique molecules.

Validation of newly generated molecule

To validate newly trained model, we'll develop a classifier model. This classifier will be trained using two datasets:

The 400 molecules we already downloaded using Cheese API.
A set of 1000 "negative" molecules obtained from a reference library (these represent undesirable molecules).

The classifier's purpose will be to classify how many newly generated molecules belongs to the class of "400 molecules"

miquelduranfrigola · 2024-07-15T14:25:08Z

Thanks @ankitskvmdam this is a perfect summary.
Can you maybe also link explicitly the notebook and script that you shared with me on Slack? I think @GemmaTuron will appreciate this work :)

ankitskvmdam · 2024-07-15T14:38:14Z

Sure!

Here are the links:

Notebook: https://github.com/ankitskvmdam/reinvent-transfer-learning/blob/master/notebooks/TransferLearning.ipynb
Utils:
1. Compute 3d Similarity: https://github.com/ankitskvmdam/reinvent-transfer-learning/blob/master/notebooks/utils/chem.py#L24
2. Download Similar smiles: https://github.com/ankitskvmdam/reinvent-transfer-learning/blob/master/notebooks/utils/download.py

ankitskvmdam · 2024-07-16T12:26:08Z

Follow up

When we download similar molecules using the cheese API, the names of the keys in the response are not appropriate. We are planning to rename the following keys:
1. zinc_id to identifier.
2. Morgan Tanimoto to similarity.
We will update the settings we are passing to autoML. For metrics we are using accuracy, we need to update it to roc_auc.
@miquelduranfrigola or @GemmaTuron will provide a reference library to download ~4000 negative smiles, which we will use for training our classifier model.
Instead of using automl.predict, we will use automl.predict_proba, which will provide insights into the classifier model's confidence levels.

miquelduranfrigola · 2024-07-16T12:42:42Z

Hi @ankitskvmdam this is fantastic. As a reference library, you can use the following one: https://github.com/ersilia-os/groverfeat/blob/main/data/reference_library.csv

ankitskvmdam · 2024-07-23T07:44:26Z

Hi @miquelduranfrigola

I have updated the notebook by adding the above changes.

I have a question regarding the reference library, which contains 10,00,000 SMILES. How should we select SMILES from this library? If we select randomly, there is a possibility of selecting SMILES with similar 3D structures. In the notebook, I used RMSD to filter out 3d similar molecules.

Also, instead of training a classifier model, can we just use RMSD?

miquelduranfrigola · 2024-07-23T11:08:35Z

Thanks @ankitskvmdam

Let's discuss now in our meeting.

There is a possibility of selecting molecules with similar 3D structures, but I think this possibility is residual and we should not worry to much about it, especially since we are "just" evaluating the model and the worst that can happen is that we underestimate performance.

As for the classifier vs RMSD - the classifier should be faster since no conformers need to be generated.

ankitskvmdam · 2024-07-25T04:35:49Z

Hi @miquelduranfrigola

I made the following changes to the notebook:

Removed the dependency on RMSD.
Ask the new model to generate around 10,000 molecules (on average, it generates about 1,100 new molecules, probably we can ask to generate 20 or 30 thousand).
Randomly selecting 500 molecules from the reference library (previously, when selecting 4,000, the model filtered out more than 60% of the newly generated molecules by the end of the prediction).
Selected the top 1,000 results (usually around 500-600).

P.S. I tried generating 30,000 new molecules, it took 3 minutes, and at the end I was able to get 1000 molecules.

miquelduranfrigola · 2024-07-25T07:33:00Z

Thanks @ankitskvmdam this looks very good and the notebook is great.
About point 3, I think we want to randomly sample more than 500 molecules - I would do 4,000 (i.e. a 1:10 ratio) or so. What do you mean by filtered out more than 60%? If I understand correctly, we are only doing ranking, so you can still keep the top 1,000 results, right?

ankitskvmdam · 2024-07-25T07:45:13Z

@miquelduranfrigola , What I am doing is only taking the smiles that have more than 60% probability of having same similarity score.

Questions:

Do we want to keep all the smiles even if the probability score is less than 60% or 50%?
In the case when we the new model only generate molecules less than 1000, do we need to train a classifier?

By filtered out more than 60%

If we train the model with 1:10 and then do autoML.predic_proba, the probability score of more than 60% of the generated smiles is less than 50%.

miquelduranfrigola · 2024-07-25T08:04:28Z

Thanks for the clarification @ankitskvmdam, I understand. Those are very valid points.
When we have an imbalanced dataset, classification scores tend to get lower. A possible way of solving this is by changing your FLAML metric. You are using ROC AUC as your FLAML metric, which in principle is the right thing to do but does not care about the magnitude of the score. Perhaps using something like F1-score would increase our changes of having a better calibrated score. I would certainly try this.

In any case, I don't think we need to worry too much about it. It is still fine to keep molecules with "low" scores. I definitely think we can keep the top 1000 even if they do not have high scores.

You mentioned that you will ask for 20k or 30k molecules. Based on your experience, do you think we will ever get a case where we will only have less than 1000 molecules? If this is the case, I agree, no classifier is needed.

ankitskvmdam · 2024-07-28T06:49:46Z

@miquelduranfrigola

I have updated the notebook:

Use f1 metric at the time of training classifier.
Usually when we ask for 10,000 new molecules we get almost 1100 new molecules. But I think it would be nice to ask for 15,000 molecules.
At the end only returning/taking top 1000 molecules.

miquelduranfrigola · 2024-07-29T07:26:11Z

Hi @ankitskvmdam thanks for the update. Let's do 15k molecules, this sounds like a good compromise.

GemmaTuron · 2024-09-19T07:36:14Z

Hi @miquelduranfrigola and @ankitskvmdam

What is the status of this? are we aiming to complete it or we have put it on hold?

miquelduranfrigola · 2024-09-19T22:28:49Z

This is essentially finished - Mol2MolSimilarity works great. I close the issue. @ankitskvmdam feel free to re-open if there are things you want to discuss further

ankitskvmdam assigned GemmaTuron and miquelduranfrigola May 21, 2024

DhanshreeA added this to Ersilia Model Hub May 28, 2024

DhanshreeA moved this to Queue in Ersilia Model Hub May 28, 2024

miquelduranfrigola mentioned this issue May 30, 2024

[Project]: Incorporate the REINVENT generative model in the Ersilia Model Hub #942

Closed

GemmaTuron mentioned this issue Jun 17, 2024

Add smiles-to-3d and vsflow code into repo ersilia-os/3d-analogues#2

Closed

miquelduranfrigola closed this as completed Sep 19, 2024

github-project-automation bot moved this from Queue to Done in Ersilia Model Hub Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

ankitskvmdam commented May 21, 2024

miquelduranfrigola commented Jun 24, 2024

ankitskvmdam commented Jun 25, 2024

ankitskvmdam commented Jul 12, 2024

miquelduranfrigola commented Jul 15, 2024

ankitskvmdam commented Jul 15, 2024

ankitskvmdam commented Jul 16, 2024

miquelduranfrigola commented Jul 16, 2024

ankitskvmdam commented Jul 23, 2024

miquelduranfrigola commented Jul 23, 2024

ankitskvmdam commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

ankitskvmdam commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

ankitskvmdam commented Jul 28, 2024

miquelduranfrigola commented Jul 29, 2024

GemmaTuron commented Sep 19, 2024

miquelduranfrigola commented Sep 19, 2024

[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

[Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape #1128

Comments

ankitskvmdam commented May 21, 2024

Summary

Approach 1

Approach 2

Scope

Objective(s)

Team

Timeline

Documentation

miquelduranfrigola commented Jun 24, 2024

ankitskvmdam commented Jun 25, 2024

ankitskvmdam commented Jul 12, 2024

Following up on our last meeting:

Cheese API updates:

Validation of newly generated molecule

miquelduranfrigola commented Jul 15, 2024

ankitskvmdam commented Jul 15, 2024

ankitskvmdam commented Jul 16, 2024

Follow up

miquelduranfrigola commented Jul 16, 2024

ankitskvmdam commented Jul 23, 2024

miquelduranfrigola commented Jul 23, 2024

ankitskvmdam commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

ankitskvmdam commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

ankitskvmdam commented Jul 28, 2024

miquelduranfrigola commented Jul 29, 2024

GemmaTuron commented Sep 19, 2024

miquelduranfrigola commented Sep 19, 2024