Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T018 Pipeline #125

Closed
wants to merge 61 commits into from
Closed

T018 Pipeline #125

wants to merge 61 commits into from

Conversation

corey-taylor
Copy link
Collaborator

@corey-taylor corey-taylor commented Aug 25, 2021

Note: This PR has been superseded by #179

Details

Tech

Content review

  • Potential labels or categories (e.g. machine learning, small molecules, online APIs): small molecules, online APIs, CADD pipeline
  • One line summary: Automated structure-based virtual screening pipeline
  • Theory: Check if text is copy-pasted from other notebooks; if so indicate with quote block and reference respective talktorials for further reading
  • Practical: Add class schema (architecture)
  • Practical: Move big classes to utils.py script to make reading the notebook a bit more fluent
  • The table of contents reflects the talktorial story-line; order of #, ##, ### headers is correct
  • URLs are linked with meaningful words, instead of pasting the URL directly or linking words like here.
  • I have spell-checked the notebook
  • Images have enough resolution to be rendered with quality, without being too heavy.
  • All figures have a description
  • Markdown cell content is still in-line with code cell output (whenever results are discussed)
  • I have checked that cell outputs are not incredibly long (this applies also to DataFrames)
  • Formatting looks correctly on the Sphinx render (bold, italics, figure placing)
  • Working title: "Automated pipeline for lead optimization" (update in manuscript if title differs)

Code review

  • Time it took to execute (approx.): code execution only = 30 minutes
  • Variable and function names follow snake case rules (e.g. a_variable_name vs aVariableName)
  • Spacing follows PEP8 (run Black on the code cells if needed)
  • Code line are under 99 characters each (run black -l 99)
  • Comments are useful and well placed
  • There are no unpythonic idioms like for i in range(len(list)) (see slides)
  • All 3rd party dependencies are listed at the top of the notebook
  • I have marked all code cell with output referenced in markdown cells with the label # NBVAL_CHECK_OUTPUT
  • I have identified potential candidates for a code refactor / useful functions
  • All import ... lines are at the top (practice part) cell, ordered by standard library / 3rd party packages / our own (teachopencadd.*)
  • I have update the relative paths to absolute paths.
  • List here unfamiliar libraries you find in the imports and their intended use:
  • opencadd is not an in-house package any more; it is a third party package
  • Add reference to Hopkins paper mentioned in docstrings (check for other paper mentions without reference)
  • Run flake8 on the Python file and flake8-nb on the Jupyter notebook
  • Freeze PubChem CID dataset for CI; add comment on how users can unfreeze the dataset
  • Runs without problems on Jupyter Lab

@dominiquesydow dominiquesydow mentioned this pull request Sep 8, 2021
27 tasks
@dominiquesydow dominiquesydow added the new talktorial New talktorial label Sep 8, 2021
Base automatically changed from t011-base to master September 9, 2021 10:02
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dominiquesydow
Copy link
Collaborator

Hi @corey-taylor,

Thanks a lot for your edits on T018! 🚀

I started with stream-lining the text format with the other notebooks, e.g.

  • No title case in section headers
  • Splitting cells per paragraph (makes reviews via ReviewNB easier)
  • Using Markdown instead of HTML in text cells
  • Resized all image files to the size they are supposed to have in the talktorial (we avoid to use HTML to resize images in the talktorial)

Before I continue with the content review, could you please go through the notebook and

  • Remove HTML text formatting
    • I fixed most if it in the Theory part, however I did not remove the <b> and <i> tags outside of references, yet - maybe we can split our forces here
    • Let's do keep the HTML <a id='similarity_theory'></a> tags used to bookmark sections.
  • Add a note to cells that need execution times longer than a couple of seconds (we usually add a note like "Note: Running the next cell may take a couple of minutes/xxx minutes/up to a minute" in a Markdown cell above the code cell)

Do ping me if in doubt regarding the text format.
For quick checks, I usually have the template open in Jupyter Lab to remind myself of our formatting conventions: https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T000_template/talktorial.ipynb

@corey-taylor
Copy link
Collaborator Author

Yup, wasn't sure whether to use html or markdown as the original had a mix of both, hence why I made it consistent at least. Cheers for clarifying and for providing the template.

@corey-taylor
Copy link
Collaborator Author

corey-taylor commented Oct 6, 2021

@dominiquesydow

All the HTML is now removed, other than the hyperlink tags you mentioned. On execution times, there were already notes with each cell where it takes a while to run but without specifically saying something like Note: this will take x minutes to complete. Is this okay with you?

@AndreaVolkamer
Copy link
Member

AndreaVolkamer commented Oct 12, 2021

@corey-taylor I went through the theory part, reads well, just tried to shorten it further.

I mainly included a <details> part in most subsections, which is hidden unless extended by the user. Check if you like it.
Tbd if the respective figures should also go in there or stay in the main nb.

Small things that need to be checked:

  • Fig. author/source missing in most figure captions.
  • Can you add a ref for numbers (time/cost) that are mentioned for the DD pipeline?
  • Ideally, we would have a good reference directly linked in each main section
  • Note the href links disturb the header levels (at least in jupyter lab -> could be a problem on the website as well), can we change the order? so that the markdown (## ..) comes first? Or other ideas?
    ´ <a id='drug_design_theory'></a>
    ´ ## Drug design pipeline`
  • Check links to other talktorials, they need to be updated since they are merged to master already.

@AAriam
Copy link
Collaborator

AAriam commented Oct 13, 2021

@AndreaVolkamer, @corey-taylor

Figure 2 is copied from an old lecture slide that I had, but there was no reference for the figure in the slide.

Figure 4 is a screenshot from the DoGSiteScorer website, which is mentioned in the caption as "...as detected by the DoGSiteScorer web-service".

Figure 8 is a screenshot from the PLIP website, indicated in the caption as "...detected by the PLIP web-service".

All other figures are made by myself; that's why there is no reference in the captions.

Also, I've noticed that some figures have been removed, but the figure numbering has not been updated, so there are some gaps between the figure numbers.

[-> @Armin-Ariamajd: regarding your figure numbering comment, note, I did not remove figures, but some are hidden in the details section]

@AndreaVolkamer
Copy link
Member

AndreaVolkamer commented Oct 13, 2021

@corey-taylor (and @Armin-Ariamajd), nicely done.

Again, I mostly shortened the text a bit and added the <details> flag here and there.

I do have a few questions or comments for you:

  • Fig 9: Lead Optimization pipeline, can we add classesto the title? Otherwise one would expect a more pipeline like figure? And under ligand similarity, I think there is an 's' missing for analogs
  • Text below Fig 9, what is meant by constructing the software? maybe reformulate.
  • Generally, when mentioning functions, parameters or other coder related things in the text use e.g. utils.py instead of utils.py.
  • Use data path as in other notebooks, e.g. HERE = Path(_dh[-1])and DATA = HERE / "data"
  • Check warnings in docking instantiation, cell 28: numpy DeprecationWarning
  • Tbd: Not sure if NBVAL checks are good for whole dfs? tbd
  • Spottet this first in cell [29], but might apply earlier and surely later: We should rename affinity -> estimated affinity or so, to not confuse the reader: These are docking scores, not measured affinities!
  • Same cell: What is drug score total? Is that a value that is returned by smina? maybe add explanation.
  • Problems with nglview/widgets: Cell [33]: For me no image is shown when executing project1.Docking.visualize_all_poses() - I only see the drop down menu?
  • Same cell: Generally the drop down menu is quite extensive, can we only show the best pose per analog?
  • Cell [34]: Why do we dock the input ligand? Wasn't it co-crystallized in the first place?
  • In selected ligands, cell [39]: There is a docking_pose_2_interactions_hydrophobic instance, is there a maximum of poses associated with this information, or could there potentially be 100? which would make it very heavy ...
  • cell [41]: The interaction affinity correlation plot is not entirely clear to me: what is on the x-axis? Maybe a bit too crowded? If a correlation between # interactions and affinity is plottet, I would expect rather a scatter plot with those two values as x and y?
  • in cell [42]: again would it be possible to only show the best pose per compound?
  • in [45] interaction analysis: Does the user need to input all these info? If so, maybe we can specify what is included? And what is actually calculated?
  • Ligand output:
    • Selected ligands: can we rather call them ligand suggestions than optimized ligands? I.e. tone down the wording in the final output, these are 'only' predictions
    • cell [48]: Wouldn't it make more sense to collect the final results for the best docking pose per ligand only?
    • the 'best' value is termed highest affinity but it actually refers to the lowest value, consider changing.
    • "The pipeline successfully found an analog which is better than the input ligand in all of the three metrics, i.e. drug-likeness, binding affinity, and total number of protein-ligand interactions." Very strong statement. We do not know this! It suggests a promising ligand, more we can't state! Please adapt.
    • Can we get a 3D image how the suggested ligands might bind, also in comparison to the original ligand?
  • Putting all pieces together:
    • Do we really need the function in the nb (cell 50)? Or could we just call it (move to utils)? To me, it does not contain more information than what was shown individually above?
    • Please check: When I run the project2, unfortunately, I get no results with better scores, but still 60149 shows as optimized ligand?
    • Also in discussion figure make clear that these are estimated binding affinity!!!
  • Questions section is quite extensive ;) shorten to three general questions.

Note:

  • I did not check the suppl. part
  • And only inspected the notebook, not the utils fcts.

Copy link
Member

@AndreaVolkamer AndreaVolkamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corey-taylor ready for you again :) Maybe coordinate with @Armin-Ariamajd he might be able to take over some of the questions/comments ...

@dominiquesydow
Copy link
Collaborator

Hi @Armin-Ariamajd,

Thank you for your offer, that would be great indeed!

I won't work on this PR until Thursday. If you are free in the meantime, you could take over something from this list:

  • And by the way, I'd say we change all mentions of 'helper class(es)' to 'helper module(s)', since they are no longer classes.

    I agree, thanks for pointing that out!

  • Freeze ligand from round 2

    I have done this for round 1 as follows:

    frozen_cids = [
        "65997",
        "2435",
        "57469",
        "5011",
        "62389",
        "11292933",
        "214347",
        "1530",
        "11256587",
        "1935",
        "62274",
        "53462",
        "84759",
        "62275",
        "6451164",
        "103148",
        "7019",
        "62805",
        "675",
        "5546",
    ]
    
    from utils import Consts
    
    analogs = {
        cid: Ligand(
            identifier_type=Consts.Ligand.InputTypes.CID,
            identifier_value=cid,
            ligand_output_path=project1.Specs.OutputPaths.similarity_search,
        )
        for cid in frozen_cids
    }
    project1.Ligand.analogs = analogs

    Could you please check

    • if this workaround is correct (the pipeline does run, so I assume it is)
    • how we could add this workaround elegantly for round 2
  • You could also go ahead and resolve this PR's conflicts.

If you are making changes, could you please push them until Thursday morning or let me know to hold off for a bit longer so that we do not run into conflicts?

Thanks again for your help!

dominiquesydow and others added 11 commits November 8, 2021 11:16
Proteins.Plus is back online; so rolled back the BindingSiteDetection section to its original form
there is no need for an ```else``` statement here, because the ```if``` clause in wrapped inside a while loop with an ```else``` statement. Basically, if the ```if``` statement evaluates to ```True``` then it breaks out of the while loop, otherwise the loop counter is increased by 1. At the end, when the loop counter reaches the defined value (without the ```if``` statement ever evaluating to ```True```), then it executes the ```else``` statement, i.e. it raises a ```ValueError```
corrected the docstring of the ```init``` method;
added a new optional parameter called ```frozen_data_filepath```, which can be used to provide a filepath of a CSV file containing the frozen analogs' data (i.e. CID and CanonicalSMILES), in which case, instead of querying the PubChem server, the class instance is built using the frozen data.
Added a demonstration for the ```similarity_search``` function of the ```pubchem``` module.
removed a redundant import of the ```LeadOptimizationPipeline``` class
activated BindingSiteDetection (rolled it back to the original form since ProteinsPlus is back online).

Added ```frozen_data_filepath``` parameter to the ```run``` function, which passes it to the ```LigandSimilaritySearch``` class.
@dominiquesydow
Copy link
Collaborator

Thanks a lot, @Armin-Ariamajd, for solving so many open TODOs!!

I am summarizing here the last bits before we can merge :)

  • Run notebook with Armin's changes to check if everything is running as planned (thank you for the round 2 freezing!!)
  • Resolve PR conflicts (so that the CI starts running; hoping we won't run into too much trouble there)
  • Render website (make html in docs folder) and resolve utils import issues
  • Check rendered website for formatting issues; resolve them

@corey-taylor
Copy link
Collaborator Author

@dominiquesydow and I will knock these off today, hopefully.

@AAriam
Copy link
Collaborator

AAriam commented Nov 11, 2021

@dominiquesydow, @corey-taylor,

You're welcome; I'm not completely finished though.
If you would give me another day or two, there are some others corrections I'd like to make.
I'll then let you know when I'm finished.

@dominiquesydow
Copy link
Collaborator

Hi @Armin-Ariamajd, could you post a checklist of things that you still would like to do, please?

@AAriam
Copy link
Collaborator

AAriam commented Nov 11, 2021

Hi @dominiquesydow,

I'm in the middle of modifying the talktorials to run with the frozen CIDs, so there are some changes that are still not finalized and I haven't pushed yet. Also in the initial test after freezing the CIDs it seemed to me that the results are still not completely deterministic. So I have to take a closer look at that as well.

I'm in the middle of a lecture right now. Today is a bit busy day for me; I'll try to finalize and push the changes later today in the evening, or tomorrow evening at the latest, if that's okay with you.

@dominiquesydow
Copy link
Collaborator

Hi @Armin-Ariamajd,

Alrighty, we will wait :)

By tomorrow morning

  • please let us know if you are ready (then we pick up our open TODOs) or
  • push what you have and post what the remaining open issues are (just to check in and decide if you want to continue on Friday or if we can split tasks between us by morning/afternoon/evening, ok?)

adapted the notebook for the new way of freezing ligands for both runs.
corrected spacing in the list in "Aim of this talktorial"
proofread again and corrected some wording and formatting
@AAriam
Copy link
Collaborator

AAriam commented Nov 12, 2021

Hi @dominiquesydow,

So far, I have:

  • changed all mentions of "helper class(es)" to "helper module(s)"

  • frozen the ligands for the second run as well, and changed the way how to use frozen ligands.
    Your workaround was technically correct, but in my opinion there was a more elegant way to do it, so I changed the procedure for both runs as follows:

    For each run, the list of frozen ligands is saved in a separate CSV file under the data folder (name FrozenPubchemQuery_project1.csv and FrozenPubchemQuery_project2.csv). Then I added a new parameter to the __init__ function of the ligand_similarity_search class, called frozen_data_filepath. This parameter has the dafault value None, and so if it's not provided in the instantiation of the class, the similarity search will be performed using PubChem as before. However, when the parameter is provided, ligand_similarity_search will no longer perform a similarity search using PubChem, and instead builds the instance using the provided ligands in the CSV file.

  • rolled back the binding site detection part to its original form, since Proteins.Plus is back online.

  • proofread the whole talktorial again and corrected some wording and formatting.

  • made some other small corrections (see commit comments)

The still remaining issues are:

  • The pipeline (i.e. its docking part) is still not deterministic. It was very hard to pinpoint why, because for some ligands (mostly the analogs in the first run) the docking results are always the same, but for some other ligands (mostly the analogs in the second run, i.e. the automated run at the end) they're always different! That's why the result of the first run is always the same, but for the second run, the output ligand changes from run to run.

    So after a couple of hours I finally found out what the problem is; It's not in the docking part with Smina, but rather in the generation of 3D structures for docking, using Open Babel:
    Before docking with Smina, the docking class calls the create_pdbqt_from_pdb_file and the create_pdbqt_from_smiles functions of the obabel helper module to generate 3D structures for the protein and the ligands, respectively. Both these functions then call the optimize_structure_for_docking function of the obabel module, which in turn calls the make3D function of the pybel module in the openbabel package. This is where the problem lies; The generation of 3D structures has inherent randomness in the algorithm, but the openbabel python API does not have the option to provide a seed.

    Apparently, for some ligands (and probably for the protein as well) the algorithm converges to the same coordinates every time, but for some other ligands, it does not.

    This is a known issue with the openbabel package (see this posted issue on their GitHub: smiles to xyz conversion isn't repeatable openbabel/openbabel#1934).

    Apparently the developers have provided a way to set a seed via a shell command, but I couldn't find a way to use this in the python API environment.
    At this point, the only way that comes to my mind for making the pipeline deterministic is to also freeze the 3d structures (i.e. the pdbqt files) generated before docking.

  • When running the docking part, I now get the following warning from Open Babel:
    Open Babel Warning in PerceiveBondOrders. Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders

    I get the warning for the protein and some ligands (but not all of them). It is from the functions create_pdbqt_from_pdb_file and create_pdbqt_from_smiles in the obabel helper module of the talktorial.

    However, as far as I can tell, it doesn't have any effect on the docking results.

    The only reason that comes to my mind which may have caused this warning to appear now and not before is that I have updated the biopandas package, which may have also updated the obabel package.

    There is also this warning from the Structure module of opencadd.structure.core used in the function extract_molecule_from_pdb_file in the pdb helper module of the talktorial:

    861: DeprecationWarning: Using the last letter of the segid for the chainID is now deprecated and will be changed in 2.0. In 2.0, the chainID attribute will be used if it exists, or a placeholder value. warnings.warn("Using the last letter of the segid for the chainID "

Please feel free to let me know if there is anything else I can help you with.

P.S. I could not resolve the conflicts, as they do not show up for me on my local machine.

@dominiquesydow
Copy link
Collaborator

Hi @Armin-Ariamajd,

Thanks a lot for all the work you put in!!

We will take it from here and try to resolve the last outstanding issues.

* frozen the ligands for the second run as well, and changed the way how to use frozen ligands.

I like the new setup!

* rolled back the binding site detection part to its original form, since Proteins.Plus is back online.

Perfect, thanks!

  So after a couple of hours I finally found out what the problem is; It's not in the docking part with Smina, but rather in the generation of 3D structures for docking, using Open Babel:
  Before docking with Smina, the `docking` class calls the `create_pdbqt_from_pdb_file` and the `create_pdbqt_from_smiles` functions of the `obabel` helper module to generate 3D structures for the protein and the ligands, respectively. Both these functions then call the `optimize_structure_for_docking` function of the `obabel` module, which in turn calls the `make3D` function of the `pybel` module in the `openbabel` package. This is where the problem lies; The generation of 3D structures has inherent randomness in the algorithm, but the `openbabel` python API does not have the option to provide a seed.
  At this point, the only way that comes to my mind for making the pipeline deterministic is to also freeze the 3d structures (i.e. the pdbqt files) generated before docking.

Thanks for boiling the problem down to this. My first though was as well to freeze the PDBQT files. I will give some more thought.

* When running the docking part, I now get the following warning from Open Babel:
  `Open Babel Warning  in PerceiveBondOrders. Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders`

Looking into this, thanks.

  `861: DeprecationWarning: Using the last letter of the segid for the chainID is now deprecated and will be changed in 2.0. In 2.0, the chainID attribute will be used if it exists, or a placeholder value. warnings.warn("Using the last letter of the segid for the chainID "`

This is a warning from MDAnalysis < 2.0.0; since our environment requires >= 2.0.0 we should be fine. We'll check both warnings with a fresh teachopencadd environment.

@corey-taylor
Copy link
Collaborator Author

Can't see a way around freezing the pdbqt's.

At least with something like Omega, the starting point is the same i.e. if you ask for 50 conformers in one run then 2000 in another, the first 50 of the latter should be identical. This is as opposed to Obabel where if you do multiple runs of 50, each batch of 50 will be slightly different.

For rigid docking, this is a really irritating problem as, unlike MD, it's meant to be deterministic. Cheers for hunting it down, @Armin-Ariamajd.

@dominiquesydow
Copy link
Collaborator

Closing this PR; superseded by #179.

@dominiquesydow dominiquesydow deleted the ct-018-pipeline branch December 6, 2021 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new talktorial New talktorial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants