Performance improvements to tokenization, deterministic sorting #76

dotsdl · 2022-10-04T15:44:27Z

We now do the following:

tokenization via _gufe_tokenize is done of keyed-dict form, avoiding
a long chain of serialization for e.g. AlchemicalNetworks featuring
ProteinComponents
GufeTokenizable.__lt__ is done on self.key value, not on the results
from hash(self), which does not deliver the same result across
Python processes
explicit sorting in key places for ChemicalSystem,
AlchemicalNetwork, to ensure .key stability

We now do the following: - tokenization via `_gufe_tokenize` is done of keyed-dict form, avoiding a long chain of serialization for e.g. `AlchemicalNetwork`s featuring `ProteinComponent`s - `GufeTokenizable.__lt__` is done on `self.key` value, not on the results from `hash(self)`, which does not deliver the same result across Python processes - explicit sorting in key places for `ChemicalSystem, `AlchemicalNetwork`, to ensure `.key` stability

dotsdl · 2022-10-04T15:46:00Z

Will add tests for this; a bit tricky since it requires spinning up separate Python processes to validate that keys don't change across processes.

codecov · 2022-10-04T15:51:10Z

Codecov Report

Base: 96.96% // Head: 97.39% // Increases project coverage by +0.42% 🎉

Coverage data is based on head (efd7172) compared to base (5dddacc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
+ Coverage   96.96%   97.39%   +0.42%     
==========================================
  Files          26       26              
  Lines        1419     1420       +1     
==========================================
+ Hits         1376     1383       +7     
+ Misses         43       37       -6

Impacted Files	Coverage Δ
gufe/chemicalsystem.py	`100.00% <ø> (ø)`
gufe/transformations/transformation.py	`90.32% <ø> (+1.90%)`	⬆️
gufe/network.py	`100.00% <100.00%> (+6.25%)`	⬆️
gufe/tokenization.py	`95.73% <100.00%> (-2.39%)`	⬇️
gufe/components/smallmoleculecomponent.py	`100.00% <0.00%> (+6.17%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dwhswenson · 2022-10-04T16:38:47Z

Obvious performance gains to be made: Components do not contain gufe objects, but their dict form may contain many dicts/iterables. The default to_dict uses dict_encode_dependencies, so recursively looks for gufe tokenizables in all containers. For objects that can not contain gufe tokenizables, we should override:

def to_dict(self, include_defaults=True):
    dct = self._to_dict()
    if not include_defaults:
         ...  # as exists currently

dwhswenson · 2022-10-04T17:33:54Z

Will add tests for this; a bit tricky since it requires spinning up separate Python processes to validate that keys don't change across processes.

Can't you just hard-code a key? It shouldn't change across versions either. (Or if it does, we should be alerted with a failing test.)

The implementation so far looks good to me. Should I add the overrides for to_dict for components here or in a separate PR?

dotsdl · 2022-10-05T03:19:41Z

@dwhswenson feel free to add any to_dict optimizations here!

dotsdl · 2022-10-05T03:24:41Z

And yes, for Components I think it's safe to say that they are the bottom of the GufeTokenizable tree, so we can use that fact to optimize their tokenization behavior where possible. Good call!

dotsdl · 2022-10-05T05:15:38Z

Added stability checks in d2a7f02 for all GufeTokenizables that we expect should have stable keys.

@dwhswenson after you add the optimizations you have in mind for Component subclasses, we should be good to go on this PR.

richardjgowers

Lgtm

dwhswenson · 2022-10-05T15:41:20Z

Anything that contains a ProteinComponent is failing with these keys (and locally I'm getting the same keys as tests). @dotsdl : I can replace these with the keys that I find, but you should be aware that something is different in your environment that is breaking the transferability of these keys.

I've added the faster to_dict implementations. I added them explicitly in SolventComponent and ExplicitMoleculeComponent rather than the generic component superclass because I figure it is possible that someone in the future will come up with a component that does contain other gufe tokenizables.

dotsdl · 2022-10-05T16:18:27Z

@dwhswenson that's concerning! I ran my tests on mskcc2, with openff-toolkit 0.11.0, rdkit 2022.03.05. Will try to isolate how this is happening.

Something very odd is going on, and I can't quite pin it down. May need another set of eyes here.

dotsdl · 2022-10-06T05:57:16Z

I'm pulling my hair out trying to pin down unstable .keys for AlchemicalNetwork, Transformation, NonTransformation and ProteinComponent examples in our tests. My suspicion is that something about ProteinComponent tokenization is nondeterministic and that's causing the other failures on up the tree.

What's really weird is I can get the tests to all pass locally, but after committing or pushing the files to my test host the tests fail again. It's spooky really, and I don't understand it.

dotsdl · 2022-10-06T06:06:37Z

@dwhswenson are there any aspects of the ProteinComponent._to_dict that may be nondeterministic? Or perhaps how the ProteinComponent.from_pdb_file pulls in its state in the first place?

dwhswenson · 2022-10-06T06:21:20Z

@dwhswenson are there any aspects of the ProteinComponent._to_dict that may be nondeterministic? Or perhaps how the ProteinComponent.from_pdb_file pulls in its state in the first place?

I haven't investigated in too much depth, but relevant info is that I match CI's Ubuntu keys on macOS. It can't be completely random, and can't be Linux-specific.

Needed deterministic `_to_dict` for `ProteinComponent` by sorting molecule props.

dotsdl · 2022-10-07T03:25:39Z

So even with sorting the call to self._rdkit.GetPropsAsDict in ProteinComponent._to_dict, we're still getting the shell game behavior I demonstrated earlier today. Fixing the keys based on what the tests give works initially, but after committing the changes the tests are immediately broken again. It's the strangest thing to observe.

Our working hypothesis is that some things in ProteinComponent._to_dict are not deterministic. Since we are pulling things out of RDKit, are we guaranteed to get the same ordering of atoms, bonds, and conformers? Should we be sorting any of these instead?

dwhswenson · 2022-10-07T05:40:28Z

Another suggestion: instead of tokenizing with the string rep, use json.dumps with sorted keys. It occurred to me on the flight that we wouldn't even have these dictionary order questions if we did that (and that's what I've always done before in this situation).

You can also compare those two string reps to see exactly what differs. (Do an assert test against that string, which you can dump to a file. pytest will give you the diff.) Might be easier than diving through details of rdkit code.

dotsdl · 2022-10-07T17:38:00Z

@richardjgowers can you have a look at the _to_dict implementation to see if there are any non-deterministic list generations in there?

Also removed optimizations for ExplicitMoleculeComponent and SolventComponent; they don't appear to really make a difference in my tokenization timings, and I'd rather avoid the additional complexity of having the overrides given this. Can re-add later if they prove vital for larger systems.

dotsdl · 2022-10-12T01:03:14Z

FYI @dwhswenson: in efd7172 I removed optimizations for ExplicitMoleculeComponent and SolventComponent to_dict; they don't appear to really make a difference in my tokenization timings, and I'd rather avoid the additional complexity of having the overrides given this. We can re-add later if they prove vital for larger systems, though.

dotsdl changed the title ~~Performance improvements to tokenization, deterministic sorting~~ [WIP] Performance improvements to tokenization, deterministic sorting Oct 4, 2022

dotsdl requested a review from dwhswenson October 4, 2022 16:38

Added stability checks for GufeTokenizable keys where appropriate

d2a7f02

richardjgowers approved these changes Oct 5, 2022

View reviewed changes

Add faster to_dict/from_dict for components

ac568f2

dotsdl added 2 commits October 5, 2022 18:34

Update to known keys for ProteinComponent on up

8691d99

Another attempt at key fixes

9e18859

Something very odd is going on, and I can't quite pin it down. May need another set of eyes here.

richardjgowers added this to the Next release milestone Oct 6, 2022

Finally got stable hashes working for ProteinComponent, on up

e1b1d0f

Needed deterministic `_to_dict` for `ProteinComponent` by sorting molecule props.

Merge branch 'main' into tokenization-fixes

681a2a9

dotsdl mentioned this pull request Oct 11, 2022

Protein rework #77

Merged

dotsdl added 2 commits October 11, 2022 17:23

Merge branch 'main' into tokenization-fixes

a2c5af6

dotsdl changed the title ~~[WIP] Performance improvements to tokenization, deterministic sorting~~ Performance improvements to tokenization, deterministic sorting Oct 12, 2022

dotsdl merged commit 749d3d0 into main Oct 12, 2022

dotsdl deleted the tokenization-fixes branch April 1, 2023 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements to tokenization, deterministic sorting #76

Performance improvements to tokenization, deterministic sorting #76

dotsdl commented Oct 4, 2022 •

edited

Loading

dotsdl commented Oct 4, 2022

codecov bot commented Oct 4, 2022 •

edited

Loading

dwhswenson commented Oct 4, 2022

dwhswenson commented Oct 4, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 5, 2022

richardjgowers left a comment

dwhswenson commented Oct 5, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 6, 2022

dotsdl commented Oct 6, 2022

dwhswenson commented Oct 6, 2022

dotsdl commented Oct 7, 2022

dwhswenson commented Oct 7, 2022

dotsdl commented Oct 7, 2022

dotsdl commented Oct 12, 2022

Performance improvements to tokenization, deterministic sorting #76

Performance improvements to tokenization, deterministic sorting #76

Conversation

dotsdl commented Oct 4, 2022 • edited Loading

dotsdl commented Oct 4, 2022

codecov bot commented Oct 4, 2022 • edited Loading

Codecov Report

dwhswenson commented Oct 4, 2022

dwhswenson commented Oct 4, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 5, 2022

richardjgowers left a comment

Choose a reason for hiding this comment

dwhswenson commented Oct 5, 2022

dotsdl commented Oct 5, 2022

dotsdl commented Oct 6, 2022

dotsdl commented Oct 6, 2022

dwhswenson commented Oct 6, 2022

dotsdl commented Oct 7, 2022

dwhswenson commented Oct 7, 2022

dotsdl commented Oct 7, 2022

dotsdl commented Oct 12, 2022

dotsdl commented Oct 4, 2022 •

edited

Loading

codecov bot commented Oct 4, 2022 •

edited

Loading