Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements to tokenization, deterministic sorting #76

Merged
merged 9 commits into from
Oct 12, 2022

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Oct 4, 2022

We now do the following:

  • tokenization via _gufe_tokenize is done of keyed-dict form, avoiding
    a long chain of serialization for e.g. AlchemicalNetworks featuring
    ProteinComponents
  • GufeTokenizable.__lt__ is done on self.key value, not on the results
    from hash(self), which does not deliver the same result across
    Python processes
  • explicit sorting in key places for ChemicalSystem,
    AlchemicalNetwork, to ensure .key stability

We now do the following:
- tokenization via `_gufe_tokenize` is done of keyed-dict form, avoiding
  a long chain of serialization for e.g. `AlchemicalNetwork`s featuring
  `ProteinComponent`s
- `GufeTokenizable.__lt__` is done on `self.key` value, not on the results
  from `hash(self)`, which does not deliver the same result across
  Python processes
- explicit sorting in key places for `ChemicalSystem,
  `AlchemicalNetwork`, to ensure `.key` stability
@dotsdl dotsdl changed the title Performance improvements to tokenization, deterministic sorting [WIP] Performance improvements to tokenization, deterministic sorting Oct 4, 2022
@dotsdl
Copy link
Member Author

dotsdl commented Oct 4, 2022

Will add tests for this; a bit tricky since it requires spinning up separate Python processes to validate that keys don't change across processes.

@codecov
Copy link

codecov bot commented Oct 4, 2022

Codecov Report

Base: 96.96% // Head: 97.39% // Increases project coverage by +0.42% 🎉

Coverage data is based on head (efd7172) compared to base (5dddacc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
+ Coverage   96.96%   97.39%   +0.42%     
==========================================
  Files          26       26              
  Lines        1419     1420       +1     
==========================================
+ Hits         1376     1383       +7     
+ Misses         43       37       -6     
Impacted Files Coverage Δ
gufe/chemicalsystem.py 100.00% <ø> (ø)
gufe/transformations/transformation.py 90.32% <ø> (+1.90%) ⬆️
gufe/network.py 100.00% <100.00%> (+6.25%) ⬆️
gufe/tokenization.py 95.73% <100.00%> (-2.39%) ⬇️
gufe/components/smallmoleculecomponent.py 100.00% <0.00%> (+6.17%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dwhswenson
Copy link
Member

Obvious performance gains to be made: Components do not contain gufe objects, but their dict form may contain many dicts/iterables. The default to_dict uses dict_encode_dependencies, so recursively looks for gufe tokenizables in all containers. For objects that can not contain gufe tokenizables, we should override:

def to_dict(self, include_defaults=True):
    dct = self._to_dict()
    if not include_defaults:
         ...  # as exists currently

@dotsdl dotsdl requested a review from dwhswenson October 4, 2022 16:38
@dwhswenson
Copy link
Member

Will add tests for this; a bit tricky since it requires spinning up separate Python processes to validate that keys don't change across processes.

Can't you just hard-code a key? It shouldn't change across versions either. (Or if it does, we should be alerted with a failing test.)

The implementation so far looks good to me. Should I add the overrides for to_dict for components here or in a separate PR?

@dotsdl
Copy link
Member Author

dotsdl commented Oct 5, 2022

@dwhswenson feel free to add any to_dict optimizations here!

@dotsdl
Copy link
Member Author

dotsdl commented Oct 5, 2022

And yes, for Components I think it's safe to say that they are the bottom of the GufeTokenizable tree, so we can use that fact to optimize their tokenization behavior where possible. Good call!

@dotsdl
Copy link
Member Author

dotsdl commented Oct 5, 2022

Added stability checks in d2a7f02 for all GufeTokenizables that we expect should have stable keys.

@dwhswenson after you add the optimizations you have in mind for Component subclasses, we should be good to go on this PR.

Copy link
Contributor

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@dwhswenson
Copy link
Member

Anything that contains a ProteinComponent is failing with these keys (and locally I'm getting the same keys as tests). @dotsdl : I can replace these with the keys that I find, but you should be aware that something is different in your environment that is breaking the transferability of these keys.

I've added the faster to_dict implementations. I added them explicitly in SolventComponent and ExplicitMoleculeComponent rather than the generic component superclass because I figure it is possible that someone in the future will come up with a component that does contain other gufe tokenizables.

@dotsdl
Copy link
Member Author

dotsdl commented Oct 5, 2022

@dwhswenson that's concerning! I ran my tests on mskcc2, with openff-toolkit 0.11.0, rdkit 2022.03.05. Will try to isolate how this is happening.

dotsdl added 2 commits October 5, 2022 18:34
Something very odd is going on, and I can't quite pin it down.
May need another set of eyes here.
@dotsdl
Copy link
Member Author

dotsdl commented Oct 6, 2022

I'm pulling my hair out trying to pin down unstable .keys for AlchemicalNetwork, Transformation, NonTransformation and ProteinComponent examples in our tests. My suspicion is that something about ProteinComponent tokenization is nondeterministic and that's causing the other failures on up the tree.

What's really weird is I can get the tests to all pass locally, but after committing or pushing the files to my test host the tests fail again. It's spooky really, and I don't understand it.

@dotsdl
Copy link
Member Author

dotsdl commented Oct 6, 2022

@dwhswenson are there any aspects of the ProteinComponent._to_dict that may be nondeterministic? Or perhaps how the ProteinComponent.from_pdb_file pulls in its state in the first place?

@dwhswenson
Copy link
Member

@dwhswenson are there any aspects of the ProteinComponent._to_dict that may be nondeterministic? Or perhaps how the ProteinComponent.from_pdb_file pulls in its state in the first place?

I haven't investigated in too much depth, but relevant info is that I match CI's Ubuntu keys on macOS. It can't be completely random, and can't be Linux-specific.

@richardjgowers richardjgowers added this to the Next release milestone Oct 6, 2022
Needed deterministic `_to_dict` for `ProteinComponent` by sorting
molecule props.
@dotsdl
Copy link
Member Author

dotsdl commented Oct 7, 2022

So even with sorting the call to self._rdkit.GetPropsAsDict in ProteinComponent._to_dict, we're still getting the shell game behavior I demonstrated earlier today. Fixing the keys based on what the tests give works initially, but after committing the changes the tests are immediately broken again. It's the strangest thing to observe.

Our working hypothesis is that some things in ProteinComponent._to_dict are not deterministic. Since we are pulling things out of RDKit, are we guaranteed to get the same ordering of atoms, bonds, and conformers? Should we be sorting any of these instead?

@dwhswenson
Copy link
Member

Another suggestion: instead of tokenizing with the string rep, use json.dumps with sorted keys. It occurred to me on the flight that we wouldn't even have these dictionary order questions if we did that (and that's what I've always done before in this situation).

You can also compare those two string reps to see exactly what differs. (Do an assert test against that string, which you can dump to a file. pytest will give you the diff.) Might be easier than diving through details of rdkit code.

@dotsdl
Copy link
Member Author

dotsdl commented Oct 7, 2022

@richardjgowers can you have a look at the _to_dict implementation to see if there are any non-deterministic list generations in there?

@dotsdl dotsdl mentioned this pull request Oct 11, 2022
Also removed optimizations for ExplicitMoleculeComponent and
SolventComponent; they don't appear to really make a difference in my
tokenization timings, and I'd rather avoid the additional complexity of
having the overrides given this. Can re-add later if they prove vital
for larger systems.
@dotsdl
Copy link
Member Author

dotsdl commented Oct 12, 2022

FYI @dwhswenson: in efd7172 I removed optimizations for ExplicitMoleculeComponent and SolventComponent to_dict; they don't appear to really make a difference in my tokenization timings, and I'd rather avoid the additional complexity of having the overrides given this. We can re-add later if they prove vital for larger systems, though.

@dotsdl dotsdl changed the title [WIP] Performance improvements to tokenization, deterministic sorting Performance improvements to tokenization, deterministic sorting Oct 12, 2022
@dotsdl dotsdl merged commit 749d3d0 into main Oct 12, 2022
@dotsdl dotsdl deleted the tokenization-fixes branch April 1, 2023 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants