T018 Pipeline #125

corey-taylor · 2021-08-25T11:23:06Z

Note: This PR has been superseded by #179

Details

Talktorial ID: T018
Title: Pipeline
Original authors: @corey-taylor @melvo98 @Armin-Ariamajd
Reviewer(s): @AndreaVolkamer, @dominiquesydow
Date of review:

Tech

Move notebook from CADD_Seminar2021 repo (https://github.com/volkamerlab/CADDSeminar_2021/pull/7) here after contacting @melvo98 and @Armin-Ariamajd (co-authored commit)

Content review

Code review

review-notebook-app · 2021-09-13T14:35:36Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

dominiquesydow · 2021-09-29T07:56:14Z

Hi @corey-taylor,

Thanks a lot for your edits on T018! 🚀

I started with stream-lining the text format with the other notebooks, e.g.

No title case in section headers
Splitting cells per paragraph (makes reviews via ReviewNB easier)
Using Markdown instead of HTML in text cells
Resized all image files to the size they are supposed to have in the talktorial (we avoid to use HTML to resize images in the talktorial)

Before I continue with the content review, could you please go through the notebook and

Remove HTML text formatting
- I fixed most if it in the Theory part, however I did not remove the <b> and <i> tags outside of references, yet - maybe we can split our forces here
- Let's do keep the HTML <a id='similarity_theory'></a> tags used to bookmark sections.
Add a note to cells that need execution times longer than a couple of seconds (we usually add a note like "Note: Running the next cell may take a couple of minutes/xxx minutes/up to a minute" in a Markdown cell above the code cell)

Do ping me if in doubt regarding the text format.
For quick checks, I usually have the template open in Jupyter Lab to remind myself of our formatting conventions: https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T000_template/talktorial.ipynb

corey-taylor · 2021-09-29T08:04:50Z

Yup, wasn't sure whether to use html or markdown as the original had a mix of both, hence why I made it consistent at least. Cheers for clarifying and for providing the template.

corey-taylor · 2021-10-06T13:34:20Z

@dominiquesydow

All the HTML is now removed, other than the hyperlink tags you mentioned. On execution times, there were already notes with each cell where it takes a while to run but without specifically saying something like Note: this will take x minutes to complete. Is this okay with you?

AndreaVolkamer · 2021-10-12T19:35:08Z

@corey-taylor I went through the theory part, reads well, just tried to shorten it further.

I mainly included a <details> part in most subsections, which is hidden unless extended by the user. Check if you like it.
Tbd if the respective figures should also go in there or stay in the main nb.

Small things that need to be checked:

Fig. author/source missing in most figure captions.
Can you add a ref for numbers (time/cost) that are mentioned for the DD pipeline?
Ideally, we would have a good reference directly linked in each main section
Note the href links disturb the header levels (at least in jupyter lab -> could be a problem on the website as well), can we change the order? so that the markdown (## ..) comes first? Or other ideas?
´ <a id='drug_design_theory'></a>
´ ## Drug design pipeline`
Check links to other talktorials, they need to be updated since they are merged to master already.

AAriam · 2021-10-13T09:16:47Z

@AndreaVolkamer, @corey-taylor

Figure 2 is copied from an old lecture slide that I had, but there was no reference for the figure in the slide.

Figure 4 is a screenshot from the DoGSiteScorer website, which is mentioned in the caption as "...as detected by the DoGSiteScorer web-service".

Figure 8 is a screenshot from the PLIP website, indicated in the caption as "...detected by the PLIP web-service".

All other figures are made by myself; that's why there is no reference in the captions.

Also, I've noticed that some figures have been removed, but the figure numbering has not been updated, so there are some gaps between the figure numbers.

[-> @Armin-Ariamajd: regarding your figure numbering comment, note, I did not remove figures, but some are hidden in the details section]

AndreaVolkamer · 2021-10-13T19:25:12Z

@corey-taylor (and @Armin-Ariamajd), nicely done.

Again, I mostly shortened the text a bit and added the <details> flag here and there.

I do have a few questions or comments for you:

Fig 9: Lead Optimization pipeline, can we add classesto the title? Otherwise one would expect a more pipeline like figure? And under ligand similarity, I think there is an 's' missing for analogs
Text below Fig 9, what is meant by constructing the software? maybe reformulate.
Generally, when mentioning functions, parameters or other coder related things in the text use e.g. utils.py instead of utils.py.
Use data path as in other notebooks, e.g. HERE = Path(_dh[-1])and DATA = HERE / "data"
Check warnings in docking instantiation, cell 28: numpy DeprecationWarning
Tbd: Not sure if NBVAL checks are good for whole dfs? tbd
Spottet this first in cell [29], but might apply earlier and surely later: We should rename affinity -> estimated affinity or so, to not confuse the reader: These are docking scores, not measured affinities!
Same cell: What is drug score total? Is that a value that is returned by smina? maybe add explanation.
Problems with nglview/widgets: Cell [33]: For me no image is shown when executing project1.Docking.visualize_all_poses() - I only see the drop down menu?
Same cell: Generally the drop down menu is quite extensive, can we only show the best pose per analog?
Cell [34]: Why do we dock the input ligand? Wasn't it co-crystallized in the first place?
In selected ligands, cell [39]: There is a docking_pose_2_interactions_hydrophobic instance, is there a maximum of poses associated with this information, or could there potentially be 100? which would make it very heavy ...
cell [41]: The interaction affinity correlation plot is not entirely clear to me: what is on the x-axis? Maybe a bit too crowded? If a correlation between # interactions and affinity is plottet, I would expect rather a scatter plot with those two values as x and y?
in cell [42]: again would it be possible to only show the best pose per compound?
in [45] interaction analysis: Does the user need to input all these info? If so, maybe we can specify what is included? And what is actually calculated?
Ligand output:
- Selected ligands: can we rather call them ligand suggestions than optimized ligands? I.e. tone down the wording in the final output, these are 'only' predictions
- cell [48]: Wouldn't it make more sense to collect the final results for the best docking pose per ligand only?
- the 'best' value is termed highest affinity but it actually refers to the lowest value, consider changing.
- "The pipeline successfully found an analog which is better than the input ligand in all of the three metrics, i.e. drug-likeness, binding affinity, and total number of protein-ligand interactions." Very strong statement. We do not know this! It suggests a promising ligand, more we can't state! Please adapt.
- Can we get a 3D image how the suggested ligands might bind, also in comparison to the original ligand?
Putting all pieces together:
- Do we really need the function in the nb (cell 50)? Or could we just call it (move to utils)? To me, it does not contain more information than what was shown individually above?
- Please check: When I run the project2, unfortunately, I get no results with better scores, but still 60149 shows as optimized ligand?
- Also in discussion figure make clear that these are estimated binding affinity!!!
Questions section is quite extensive ;) shorten to three general questions.

Note:

I did not check the suppl. part
And only inspected the notebook, not the utils fcts.

AndreaVolkamer

@corey-taylor ready for you again :) Maybe coordinate with @Armin-Ariamajd he might be able to take over some of the questions/comments ...

dominiquesydow · 2021-11-08T07:27:10Z

Hi @Armin-Ariamajd,

Thank you for your offer, that would be great indeed!

I won't work on this PR until Thursday. If you are free in the meantime, you could take over something from this list:

And by the way, I'd say we change all mentions of 'helper class(es)' to 'helper module(s)', since they are no longer classes.

I agree, thanks for pointing that out!

Freeze ligand from round 2

I have done this for round 1 as follows:

frozen_cids = [
    "65997",
    "2435",
    "57469",
    "5011",
    "62389",
    "11292933",
    "214347",
    "1530",
    "11256587",
    "1935",
    "62274",
    "53462",
    "84759",
    "62275",
    "6451164",
    "103148",
    "7019",
    "62805",
    "675",
    "5546",
]

from utils import Consts

analogs = {
    cid: Ligand(
        identifier_type=Consts.Ligand.InputTypes.CID,
        identifier_value=cid,
        ligand_output_path=project1.Specs.OutputPaths.similarity_search,
    )
    for cid in frozen_cids
}
project1.Ligand.analogs = analogs

Could you please check

if this workaround is correct (the pipeline does run, so I assume it is)
how we could add this workaround elegantly for round 2

You could also go ahead and resolve this PR's conflicts.

If you are making changes, could you please push them until Thursday morning or let me know to hold off for a bit longer so that we do not run into conflicts?

Thanks again for your help!

Proteins.Plus is back online; so rolled back the BindingSiteDetection section to its original form

there is no need for an ```else``` statement here, because the ```if``` clause in wrapped inside a while loop with an ```else``` statement. Basically, if the ```if``` statement evaluates to ```True``` then it breaks out of the while loop, otherwise the loop counter is increased by 1. At the end, when the loop counter reaches the defined value (without the ```if``` statement ever evaluating to ```True```), then it executes the ```else``` statement, i.e. it raises a ```ValueError```

corrected the docstring of the ```init``` method;

added a new optional parameter called ```frozen_data_filepath```, which can be used to provide a filepath of a CSV file containing the frozen analogs' data (i.e. CID and CanonicalSMILES), in which case, instead of querying the PubChem server, the class instance is built using the frozen data.

Added a demonstration for the ```similarity_search``` function of the ```pubchem``` module.

removed a redundant import of the ```LeadOptimizationPipeline``` class

corrected typo

activated BindingSiteDetection (rolled it back to the original form since ProteinsPlus is back online). Added ```frozen_data_filepath``` parameter to the ```run``` function, which passes it to the ```LigandSimilaritySearch``` class.

dominiquesydow · 2021-11-11T08:20:19Z

Thanks a lot, @Armin-Ariamajd, for solving so many open TODOs!!

I am summarizing here the last bits before we can merge :)

Run notebook with Armin's changes to check if everything is running as planned (thank you for the round 2 freezing!!)
Resolve PR conflicts (so that the CI starts running; hoping we won't run into too much trouble there)
Render website (make html in docs folder) and resolve utils import issues
Check rendered website for formatting issues; resolve them

corey-taylor · 2021-11-11T08:39:50Z

@dominiquesydow and I will knock these off today, hopefully.

AAriam · 2021-11-11T08:52:17Z

@dominiquesydow, @corey-taylor,

You're welcome; I'm not completely finished though.
If you would give me another day or two, there are some others corrections I'd like to make.
I'll then let you know when I'm finished.

dominiquesydow · 2021-11-11T09:08:10Z

Hi @Armin-Ariamajd, could you post a checklist of things that you still would like to do, please?

AAriam · 2021-11-11T09:18:19Z

Hi @dominiquesydow,

I'm in the middle of modifying the talktorials to run with the frozen CIDs, so there are some changes that are still not finalized and I haven't pushed yet. Also in the initial test after freezing the CIDs it seemed to me that the results are still not completely deterministic. So I have to take a closer look at that as well.

I'm in the middle of a lecture right now. Today is a bit busy day for me; I'll try to finalize and push the changes later today in the evening, or tomorrow evening at the latest, if that's okay with you.

dominiquesydow · 2021-11-11T09:44:06Z

Hi @Armin-Ariamajd,

Alrighty, we will wait :)

By tomorrow morning

please let us know if you are ready (then we pick up our open TODOs) or
push what you have and post what the remaining open issues are (just to check in and decide if you want to continue on Friday or if we can split tasks between us by morning/afternoon/evening, ok?)

adapted the notebook for the new way of freezing ligands for both runs.

corrected spacing in the list in "Aim of this talktorial"

proofread again and corrected some wording and formatting

AAriam · 2021-11-12T04:53:56Z

Hi @dominiquesydow,

So far, I have:

changed all mentions of "helper class(es)" to "helper module(s)"
frozen the ligands for the second run as well, and changed the way how to use frozen ligands.
Your workaround was technically correct, but in my opinion there was a more elegant way to do it, so I changed the procedure for both runs as follows:

For each run, the list of frozen ligands is saved in a separate CSV file under the data folder (name FrozenPubchemQuery_project1.csv and FrozenPubchemQuery_project2.csv). Then I added a new parameter to the __init__ function of the ligand_similarity_search class, called frozen_data_filepath. This parameter has the dafault value None, and so if it's not provided in the instantiation of the class, the similarity search will be performed using PubChem as before. However, when the parameter is provided, ligand_similarity_search will no longer perform a similarity search using PubChem, and instead builds the instance using the provided ligands in the CSV file.
rolled back the binding site detection part to its original form, since Proteins.Plus is back online.
proofread the whole talktorial again and corrected some wording and formatting.
made some other small corrections (see commit comments)

The still remaining issues are:

The pipeline (i.e. its docking part) is still not deterministic. It was very hard to pinpoint why, because for some ligands (mostly the analogs in the first run) the docking results are always the same, but for some other ligands (mostly the analogs in the second run, i.e. the automated run at the end) they're always different! That's why the result of the first run is always the same, but for the second run, the output ligand changes from run to run.

So after a couple of hours I finally found out what the problem is; It's not in the docking part with Smina, but rather in the generation of 3D structures for docking, using Open Babel:
Before docking with Smina, the docking class calls the create_pdbqt_from_pdb_file and the create_pdbqt_from_smiles functions of the obabel helper module to generate 3D structures for the protein and the ligands, respectively. Both these functions then call the optimize_structure_for_docking function of the obabel module, which in turn calls the make3D function of the pybel module in the openbabel package. This is where the problem lies; The generation of 3D structures has inherent randomness in the algorithm, but the openbabel python API does not have the option to provide a seed.

Apparently, for some ligands (and probably for the protein as well) the algorithm converges to the same coordinates every time, but for some other ligands, it does not.

This is a known issue with the openbabel package (see this posted issue on their GitHub: smiles to xyz conversion isn't repeatable openbabel/openbabel#1934).

Apparently the developers have provided a way to set a seed via a shell command, but I couldn't find a way to use this in the python API environment.
At this point, the only way that comes to my mind for making the pipeline deterministic is to also freeze the 3d structures (i.e. the pdbqt files) generated before docking.
When running the docking part, I now get the following warning from Open Babel:
Open Babel Warning in PerceiveBondOrders. Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

I get the warning for the protein and some ligands (but not all of them). It is from the functions create_pdbqt_from_pdb_file and create_pdbqt_from_smiles in the obabel helper module of the talktorial.

However, as far as I can tell, it doesn't have any effect on the docking results.

The only reason that comes to my mind which may have caused this warning to appear now and not before is that I have updated the biopandas package, which may have also updated the obabel package.

There is also this warning from the Structure module of opencadd.structure.core used in the function extract_molecule_from_pdb_file in the pdb helper module of the talktorial:

861: DeprecationWarning: Using the last letter of the segid for the chainID is now deprecated and will be changed in 2.0. In 2.0, the chainID attribute will be used if it exists, or a placeholder value. warnings.warn("Using the last letter of the segid for the chainID "

Please feel free to let me know if there is anything else I can help you with.

P.S. I could not resolve the conflicts, as they do not show up for me on my local machine.

dominiquesydow · 2021-11-12T06:58:00Z

Hi @Armin-Ariamajd,

Thanks a lot for all the work you put in!!

We will take it from here and try to resolve the last outstanding issues.

* frozen the ligands for the second run as well, and changed the way how to use frozen ligands.

I like the new setup!

* rolled back the binding site detection part to its original form, since Proteins.Plus is back online.

Perfect, thanks!

  So after a couple of hours I finally found out what the problem is; It's not in the docking part with Smina, but rather in the generation of 3D structures for docking, using Open Babel:
  Before docking with Smina, the `docking` class calls the `create_pdbqt_from_pdb_file` and the `create_pdbqt_from_smiles` functions of the `obabel` helper module to generate 3D structures for the protein and the ligands, respectively. Both these functions then call the `optimize_structure_for_docking` function of the `obabel` module, which in turn calls the `make3D` function of the `pybel` module in the `openbabel` package. This is where the problem lies; The generation of 3D structures has inherent randomness in the algorithm, but the `openbabel` python API does not have the option to provide a seed.
  At this point, the only way that comes to my mind for making the pipeline deterministic is to also freeze the 3d structures (i.e. the pdbqt files) generated before docking.

Thanks for boiling the problem down to this. My first though was as well to freeze the PDBQT files. I will give some more thought.

* When running the docking part, I now get the following warning from Open Babel:
  `Open Babel Warning  in PerceiveBondOrders. Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders`

Looking into this, thanks.

  `861: DeprecationWarning: Using the last letter of the segid for the chainID is now deprecated and will be changed in 2.0. In 2.0, the chainID attribute will be used if it exists, or a placeholder value. warnings.warn("Using the last letter of the segid for the chainID "`

This is a warning from MDAnalysis < 2.0.0; since our environment requires >= 2.0.0 we should be fine. We'll check both warnings with a fresh teachopencadd environment.

corey-taylor · 2021-11-12T07:53:48Z

Can't see a way around freezing the pdbqt's.

At least with something like Omega, the starting point is the same i.e. if you ask for 50 conformers in one run then 2000 in another, the first 50 of the latter should be identical. This is as opposed to Obabel where if you do multiple runs of 50, each batch of 50 will be slightly different.

For rigid docking, this is a really irritating problem as, unlike MD, it's meant to be deterministic. Cheers for hunting it down, @Armin-Ariamajd.

dominiquesydow · 2021-11-13T07:14:03Z

Closing this PR; superseded by #179.

Create new branch

f5a8d5f

dominiquesydow mentioned this pull request Sep 8, 2021

Base branch for new talktorials #74

Merged

27 tasks

dominiquesydow added the new talktorial New talktorial label Sep 8, 2021

Base automatically changed from t011-base to master September 9, 2021 10:02

This was referenced Sep 9, 2021

T018: Remove talktorial and add WIP note to README #142

Merged

Open tasks for version 2021.1 #137

Closed

Initial commit of notebook from CADDSeminar_2021 - T04-MV_AA

a61afde

corey-taylor added 8 commits September 13, 2021 16:46

Second commit to resolve conflicts

f46d77b

Refactored talktorial into utils

63dacbb

Markdown code and text edited to ensure correct rendering in browser

0b6e756

Additional div fixes

05fe8c0

Additional div fixes - 2

fac3394

Additional text editing

f71bff0

More work on text and class diagram added

0695962

Edited draft of talktorial + utils

5a58ac3

corey-taylor requested review from AndreaVolkamer and dominiquesydow September 24, 2021 10:19

dominiquesydow added 2 commits September 29, 2021 09:38

T018: Format text (split cells, HTML>MD, rm title case)

117b348

T018: Resize image files (won't be resized in talktorial)

35f66b0

HTML removed except as necessary

cd0f30e

small edits and shortening og theory

0ffcf29

small textual adaptions to practical part

a9bfebe

AndreaVolkamer reviewed Oct 13, 2021

View reviewed changes

dominiquesydow and others added 11 commits November 8, 2021 11:16

T018: Actually freeze CIDs...

82e56b2

changed all mentions of helper classes to helper modules

6cf0f02

activated BindingSiteDetection

22dbd75

Proteins.Plus is back online; so rolled back the BindingSiteDetection section to its original form

Update ligand_similarity_search.py

65cd3f2

corrected the docstring of the ```init``` method;

Update talktorial.ipynb

d538b42

Added a demonstration for the ```similarity_search``` function of the ```pubchem``` module.

Update talktorial.ipynb

bf99458

removed a redundant import of the ```LeadOptimizationPipeline``` class

Update ligand_similarity_search.py

9f4b57c

corrected typo

Update lead_optimization_pipeline.py

5cc5da4

activated BindingSiteDetection (rolled it back to the original form since ProteinsPlus is back online). Added ```frozen_data_filepath``` parameter to the ```run``` function, which passes it to the ```LigandSimilaritySearch``` class.

Created CSV files containing the frozen ligands from PubChem

942e650

AAriam added 3 commits November 12, 2021 03:15

Update talktorial.ipynb

a1df004

adapted the notebook for the new way of freezing ligands for both runs.

Update talktorial.ipynb

7efff78

corrected spacing in the list in "Aim of this talktorial"

Update talktorial.ipynb

742a3c8

proofread again and corrected some wording and formatting

Conflict solves: utils.py, T018 README

15f348a

This was referenced Nov 12, 2021

T018 Pipeline #179

Merged

T018: OpenBabel PerceiveBondOrders warnings #180

Open

dominiquesydow closed this Nov 13, 2021

dominiquesydow deleted the ct-018-pipeline branch December 6, 2021 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T018 Pipeline #125

T018 Pipeline #125

corey-taylor commented Aug 25, 2021 •

edited by dominiquesydow

Loading

review-notebook-app bot commented Sep 13, 2021

dominiquesydow commented Sep 29, 2021

corey-taylor commented Sep 29, 2021

corey-taylor commented Oct 6, 2021 •

edited

Loading

AndreaVolkamer commented Oct 12, 2021 •

edited

Loading

AAriam commented Oct 13, 2021 •

edited by AndreaVolkamer

Loading

AndreaVolkamer commented Oct 13, 2021 •

edited

Loading

AndreaVolkamer left a comment

dominiquesydow commented Nov 8, 2021

dominiquesydow commented Nov 11, 2021

corey-taylor commented Nov 11, 2021

AAriam commented Nov 11, 2021

dominiquesydow commented Nov 11, 2021

AAriam commented Nov 11, 2021

dominiquesydow commented Nov 11, 2021

AAriam commented Nov 12, 2021

dominiquesydow commented Nov 12, 2021

corey-taylor commented Nov 12, 2021

dominiquesydow commented Nov 13, 2021

T018 Pipeline #125

T018 Pipeline #125

Conversation

corey-taylor commented Aug 25, 2021 • edited by dominiquesydow Loading

Details

Tech

Content review

Code review

review-notebook-app bot commented Sep 13, 2021

dominiquesydow commented Sep 29, 2021

corey-taylor commented Sep 29, 2021

corey-taylor commented Oct 6, 2021 • edited Loading

AndreaVolkamer commented Oct 12, 2021 • edited Loading

AAriam commented Oct 13, 2021 • edited by AndreaVolkamer Loading

AndreaVolkamer commented Oct 13, 2021 • edited Loading

AndreaVolkamer left a comment

Choose a reason for hiding this comment

dominiquesydow commented Nov 8, 2021

dominiquesydow commented Nov 11, 2021

corey-taylor commented Nov 11, 2021

AAriam commented Nov 11, 2021

dominiquesydow commented Nov 11, 2021

AAriam commented Nov 11, 2021

dominiquesydow commented Nov 11, 2021

AAriam commented Nov 12, 2021

dominiquesydow commented Nov 12, 2021

corey-taylor commented Nov 12, 2021

dominiquesydow commented Nov 13, 2021

corey-taylor commented Aug 25, 2021 •

edited by dominiquesydow

Loading

corey-taylor commented Oct 6, 2021 •

edited

Loading

AndreaVolkamer commented Oct 12, 2021 •

edited

Loading

AAriam commented Oct 13, 2021 •

edited by AndreaVolkamer

Loading

AndreaVolkamer commented Oct 13, 2021 •

edited

Loading