Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Encountered While Running Reinvent_TLRL.ipynb in REINVENT4 #88

Closed
kingljy0818 opened this issue May 27, 2024 · 13 comments
Closed

Comments

@kingljy0818
Copy link

Hi,

I have correctly installed REINVENT4 and generated the Reinvent_TLRL.ipynb file in the notebook directory using the jupytext command. When running the cell in Reinvent_TLRL.ipynb:

%%time
!reinvent -l stage1.log $stage1_config_filename

the following error message appears:


Traceback (most recent call last):
File "/home/Anaconda3/envs/reinvent4/bin/reinvent", line 8, in
sys.exit(main())
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 302, in main
runner(input_config, actual_device, tb_logdir, responder_config)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 248, in run_staged_learning
adapter, _, model_type = create_adapter(prior_model_filename, "inference", device)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 49, in create_adapter
compatibility_setup(model)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/create_adapter.py", line 120, in compatibility_setup
from reinvent.models.mol2mol.models.vocabulary import Vocabulary
ImportError: cannot import name 'Vocabulary' from 'reinvent.models.mol2mol.models.vocabulary' (/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/models/mol2mol/models/vocabulary.py)

I need your help to resolve this issue. Thank you very much!

Best regards,

Jiyuan

@halx
Copy link
Contributor

halx commented May 27, 2024

Most likely reinvent/models/mol2mol/models/vocabulary.py is not a symlink as it should be and the contents of the file is the path to the actual file. Either copy that file over or replace contents with

from reinvent.models.transformer.core.vocabulary import Vocabulary

@kingljy0818
Copy link
Author

kingljy0818 commented May 27, 2024

Thank you very much for your response. The previous errors have been resolved, but when I run %%time !reinvent -l stage1.log $stage1_config_filename, a new message appears:

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality

However, the above error message was resolved by running conda install -c conda-forge rdkit pandas.

@halx
Copy link
Contributor

halx commented May 27, 2024

That is only a waning message due to RDKit not being able to cope with new Pandas versions (I believe versions 2.0 and above). Unless you use PandasTools there should be no impact.

@kingljy0818
Copy link
Author

kingljy0818 commented May 28, 2024

Hi,

While continuing to debug Reinvent_TLRL.ipynb, I encountered the following error when running the cell in the notebook:

%%time
!reinvent -l stage2.log $stage2_config_filename

The error message is:

Traceback (most recent call last):
File "/home/Anaconda3/envs/reinvent4/bin/reinvent", line 8, in
sys.exit(main())
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/Reinvent.py", line 302, in main
runner(input_config, actual_device, tb_logdir, responder_config)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 322, in run_staged_learning
packages = create_packages(reward_strategy, stages)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/runmodes/RL/run_staged_learning.py", line 178, in create_packages
scoring_function = Scorer(scoring_config)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/scoring/scorer.py", line 41, in init
self.components = get_components(config["component"])
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/scoring/config.py", line 94, in get_components
component = Component(component_params)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent_plugins/components/comp_chemprop.py", line 85, in init
chemprop_args = chemprop.args.PredictArgs().parse_args(args)
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/tap/tap.py", line 478, in parse_args
self.process_args()
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 796, in process_args
super(PredictArgs, self).process_args()
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 190, in process_args
self.checkpoint_paths = get_checkpoint_paths(
File "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/chemprop/args.py", line 58, in get_checkpoint_paths
raise ValueError(f'Failed to find any checkpoints with extension "{ext}" in directory "{checkpoint_dir}"')
ValueError: Failed to find any checkpoints with extension ".pt" in directory "/tmp/R4_notebooks_output/chemprop"
CPU times: user 60.2 ms, sys: 21.9 ms, total: 82.1 ms
Wall time: 5.48 s

I still need your guidance and help to resolve this error message. Thank you very much.

@halx
Copy link
Contributor

halx commented May 28, 2024

You will need to copy the model file into that directory (see error message). You can find the download link for the file in the notebook.

@kingljy0818
Copy link
Author

kingljy0818 commented May 29, 2024

Hi,

Thank you very much for your guidance. I have successfully run through every cell of both Reinvent_demo.ipynb and Reinvent_TLRL.ipynb, but I still need your help with some logical issues. If I want to generate a Prior model based on Chembl33, in the stage1.toml file of Reinvent_demo.ipynb, the code is as follows:

run_type = "staged_learning"
device = "cuda:0"
tb_logdir = "tb_stage1"
json_out_config = "_stage1.json"

[parameters]

prior_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior"
agent_file = "/home/Anaconda3/envs/reinvent4/lib/python3.10/site-packages/reinvent/../priors/reinvent.prior"
summary_csv_prefix = "stage1"

batch_size = 100

use_checkpoint = false

[learning_strategy]

type = "dap"
sigma = 128
rate = 0.0001

[[stage]]

max_score = 1.0
max_steps = 300

chkpt_file = 'stage1.chkpt'

scoring_function.type = "custom_product"

[stage.scoring]
type = "geometric_mean"

[[stage.scoring.component]]
[stage.scoring.component.custom_alerts]

[[stage.scoring.component.custom_alerts.endpoint]]
name = "Alerts"

params.smarts = [
"[;r8]",
"[
;r9]",
"[;r10]",
"[
;r11]",
"[;r12]",
"[
;r13]",
"[;r14]",
"[
;r15]",
"[;r16]",
"[
;r17]",
"[#8][#8]",
"[#6;+]",
"[#16][#16]",
"[#7;!n][S;!$(S(=O)=O)]",
"[#7;!n][#7;!n]",
"C#C",
"C(=[O,S])[O,S]",
"[#7;!n][C;!$(C(=[O,N])[N,O])][#16;!s]",
"[#7;!n][C;!$(C(=[O,N])[N,O])][#7;!n]",
"[#7;!n][C;!$(C(=[O,N])[N,O])][#8;!o]",
"[#8;!o][C;!$(C(=[O,N])[N,O])][#16;!s]",
"[#8;!o][C;!$(C(=[O,N])[N,O])][#8;!o]",
"[#16;!s][C;!$(C(=[O,N])[N,O])][#16;!s]"
]

[[stage.scoring.component]]
[stage.scoring.component.QED]

[[stage.scoring.component.QED.endpoint]]
name = "QED"
weight = 0.6

[[stage.scoring.component]]
[stage.scoring.component.NumAtomStereoCenters]

[[stage.scoring.component.NumAtomStereoCenters.endpoint]]
name = "Stereo"
weight = 0.4

transform.type = "left_step"
transform.low = 0

How should I modify it? In other words, I don't quite understand how the prior models such as reinvent.prior in Reinvent_demo.ipynb and Reinvent_TLRL.ipynb are obtained?I look forward to your reply. Thank you very much!

@halx
Copy link
Contributor

halx commented May 29, 2024

To create a new prior training you woul need to look into reinvent/runmodes/create_model/create_reinvent.py. This creates an "empty" model with a pre-defined vocabulary. To actually train the model you would need to carry out TL with your dataset and I recomment to create a validation set. I would also suggest to have a look into Randomized SMILES strings improve the quality of molecular generative model to understand how prior models can be improved upon with augmentation. Please note, that data preparation is your responsibility as there is currently not much in place for that.

You probably also want to carefully consider why you need a new prior as it takes quite a bit of expertise to get this right. Chemical space coverage has probably not that much evolved in ChEMBL but if you want to support additional chemistry (the vocabulary is fixed) for example or think to support stereochemistry (but beware imbalanced data) then the current priors are limited in this.

@kingljy0818
Copy link
Author

kingljy0818 commented May 29, 2024

I have found reinvent/runmodes/create_model/create_reinvent.py, but I still don't know how to create an empty model. In REINVENT 3.2, there was a Create_Model_Demo.ipynb notebook that could be used to create an empty model with Chembl33. Could you please guide me on how to create an empty model with Chembl33 in REINVENT 4?

I see that there are many pre-existing prior models in the Prior directory of REINVENT4, such as reinvent.prior. How are these models trained? Can these pre-existing prior models be used directly? How should each of these prior models in the Prior directory be used respectively? Is there a detailed usage guide? I would appreciate your continued guidance. Thank you!

@halx
Copy link
Contributor

halx commented May 30, 2024

I can suggest to read our paper Reinvent 4: Modern AI–driven generative molecule design and the papers cited therein.

create_reinvent.py reads in a TOML configuration file. An example is in the same directory.

@kingljy0818
Copy link
Author

kingljy0818 commented May 30, 2024

It's quite a coincidence. Before receiving your reply, I carefully read your paper "Reinvent 4: Modern AI-driven generative molecule design" published in the Journal of Cheminformatics this afternoon. I have a basic understanding of the logic and operation mechanism of REINVENT4. However, after reading this paper, there are still a few questions that need your guidance:

  1. There are many pre-trained prior models in the Prior folder of REINVENT4. Can these models be directly applied to my own drug development scenarios?

  2. How was the model.pt in the chemprop directory of Reinvent_TLRL.ipynb trained?

  3. There are many pre-trained prior models in the Prior folder of REINVENT4. Can they handle most drug development scenarios? Is it necessary for me to train my own prior model, for example, training a prior model based on a database like Chembl34 which has 2 million compound structures?

  4. There's one part in the parameters section of the stage1.toml file in Reinvent_TLRL.ipynb that I don't quite understand. Why are both prior_file and agent_file using reinvent.prior? Why isn't agent_file using an agent model?

  5. In Reinvent_TLRL.ipynb, the input_model_file in the transfer_learning file uses the checkpoint file stage1.chkpt generated by stage1.toml. I want to ask if stage1.chkpt is equivalent to a prior model?

I look forward to your answers to these three questions. Thank you very much!

@halx
Copy link
Contributor

halx commented Jun 5, 2024

  1. The model in the notebook was trained with the sofware ChemProp.

3.You only need to train a new prior if you have specific needs in terms of supported chemistry.
4. The prior_file serves as a reference for regularization in the loss function. The agent needs to start from somewhere and so the starting point is the prior. The prior does no change during RL but the agent does.
5. A checkpoint file is simply the current state of the agent network model. It still has the same network hyper parameters but diferent, fine-tuned weights, biases, etc.

As for the other questions you will need to get that basic knowledge from the literature e.g. out paper. These things are not suitable for discussion in this forum.

@kingljy0818
Copy link
Author

In the Reinvent_TLRL.ipynb notebook, the model.pt is annotated with: "This is a model that has been trained on free energy simulation data computed for the TNKS2 target." I browsed through the ChemProp GitHub site and it seems that ChemProp does not have the capability to compute binding free energy. I'm having trouble understanding this annotation, and would appreciate further clarification.

@halx
Copy link
Contributor

halx commented Jun 6, 2024

ChemProp is software that allows the user to create deep learning models. The data to train on comes from the user. The model provided is just an example.

@halx halx closed this as completed Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants