-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements to tokenization, deterministic sorting #76
Conversation
We now do the following: - tokenization via `_gufe_tokenize` is done of keyed-dict form, avoiding a long chain of serialization for e.g. `AlchemicalNetwork`s featuring `ProteinComponent`s - `GufeTokenizable.__lt__` is done on `self.key` value, not on the results from `hash(self)`, which does not deliver the same result across Python processes - explicit sorting in key places for `ChemicalSystem, `AlchemicalNetwork`, to ensure `.key` stability
Will add tests for this; a bit tricky since it requires spinning up separate Python processes to validate that keys don't change across processes. |
Codecov ReportBase: 96.96% // Head: 97.39% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #76 +/- ##
==========================================
+ Coverage 96.96% 97.39% +0.42%
==========================================
Files 26 26
Lines 1419 1420 +1
==========================================
+ Hits 1376 1383 +7
+ Misses 43 37 -6
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Obvious performance gains to be made: def to_dict(self, include_defaults=True):
dct = self._to_dict()
if not include_defaults:
... # as exists currently |
Can't you just hard-code a key? It shouldn't change across versions either. (Or if it does, we should be alerted with a failing test.) The implementation so far looks good to me. Should I add the overrides for |
@dwhswenson feel free to add any |
And yes, for |
Added stability checks in d2a7f02 for all @dwhswenson after you add the optimizations you have in mind for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
Anything that contains a I've added the faster |
@dwhswenson that's concerning! I ran my tests on |
Something very odd is going on, and I can't quite pin it down. May need another set of eyes here.
I'm pulling my hair out trying to pin down unstable What's really weird is I can get the tests to all pass locally, but after committing or pushing the files to my test host the tests fail again. It's spooky really, and I don't understand it. |
@dwhswenson are there any aspects of the |
I haven't investigated in too much depth, but relevant info is that I match CI's Ubuntu keys on macOS. It can't be completely random, and can't be Linux-specific. |
Needed deterministic `_to_dict` for `ProteinComponent` by sorting molecule props.
So even with sorting the call to Our working hypothesis is that some things in |
Another suggestion: instead of tokenizing with the string rep, use json.dumps with sorted keys. It occurred to me on the flight that we wouldn't even have these dictionary order questions if we did that (and that's what I've always done before in this situation). You can also compare those two string reps to see exactly what differs. (Do an assert test against that string, which you can dump to a file. pytest will give you the diff.) Might be easier than diving through details of rdkit code. |
@richardjgowers can you have a look at the |
Also removed optimizations for ExplicitMoleculeComponent and SolventComponent; they don't appear to really make a difference in my tokenization timings, and I'd rather avoid the additional complexity of having the overrides given this. Can re-add later if they prove vital for larger systems.
FYI @dwhswenson: in efd7172 I removed optimizations for |
We now do the following:
_gufe_tokenize
is done of keyed-dict form, avoidinga long chain of serialization for e.g.
AlchemicalNetwork
s featuringProteinComponent
sGufeTokenizable.__lt__
is done onself.key
value, not on the resultsfrom
hash(self)
, which does not deliver the same result acrossPython processes
ChemicalSystem
,AlchemicalNetwork
, to ensure.key
stability