-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: optimize import times #66
Conversation
I am glad to see progress on this and very glad to see your involvement in |
Thanks for the warning, this isn't particularly urgent so don't rush to get to this. |
friendly bump |
multipledispatch/dispatcher.py
Outdated
def ordering(self, value): | ||
self.__class__ = self._resolved_type | ||
self.reorder() | ||
self.ordering = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like fiendish devilry to me. I wonder if there is a less magical way to achieve the same result without significant cost.
For example, I believe that try-except blocks are quite cheap if there is no exception. I wonder if something like the following would work well enough:
def add(...):
del self._ordering
...
@property
def ordering(self):
try:
return self._ordering
except AttributeError:
self.reorder()
return self._ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try/except is actually somewhat expensive because it needs to setup a block and unwind it. In blaze we have many millions of calls to multiply dispatched functions so even small overhead gets magnified. The goal of this was to be have no overhead once the dispatcher is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't touch ordering that often though (caching should catch most calls), and when we do it's well nested within several other try-except blocks and function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a (very untrustworthy) microbenchmark
In [1]: %%timeit
...: try:
...: 1 + 1
...: except:
...: pass
...:
The slowest run took 96.99 times longer than the fastest. This could mean that an intermediate result is being cached.
100000000 loops, best of 3: 18.6 ns per loop
In [2]: %timeit 1 + 1
100000000 loops, best of 3: 10.8 ns per loop
Compare that to function calls
In [3]: def f():
...: pass
...:
In [4]: %timeit f()
The slowest run took 17.84 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 69.9 ns per loop
It's unclear to me that adding try-except block would significantly harm performance. Swapping out __class__
would, I suspect, have unexpected consequences when interacting with all of the other libraries of the ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llllllllll reminder that the ball is in your court here :)
In general the approach here seems quite nice. I can see how it would have a large positive impact without any requirements on the user and without much complexity. Another option, should you be interested, would be to maintain some data structure so that we could update the ordering cheaply when we find a new signature. Currently we clear out all intermediate state when we solve. Presumably there is something we could keep around to make incremental additions efficient. |
8eba021
to
13e7fbc
Compare
I like the idea of using a better data structure to optimize insertions, but I think that can exist alongside this PR. I have removed the type swapping code and replaced it with a simple property. |
Does this meet your operational performance constraints? |
This is still a considerable improvement, I think there is not much difference in import time between my old version and this new version. If there is, it is hard to measure reliably. |
You were also concerned about call-time costs. My hope is that caching handles almost all of these costs, but wanted to check. |
Ah, thanks for reminding me about the per-call cost. I will run a more serious benchmark using blaze and get back to you. |
@llllllllll So we have the benchmark things in now, you may want to just close and reopen to get those to run? |
13e7fbc
to
e6539ae
Compare
@mariusvniekerk thanks for setting up the benchmarks, here are the results for this branch (with and without the fiendish devilry), as well as your pytypes branch. (Note: this was all run on my laptop, I wouldn't read too much into the absolute time. I am just comparing the relative timings) Python 3.6
Python 2.7
Note: I have no idea why the min and median are messed up in py2. This convinces me that my weird class swapping is likely not worth it. It also makes me very concerned about the pytypes branch, note that the units are in |
Any outstanding blockers on merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple tiny comments. Happy to merge regardless though. Thank you for doing this and for the ping.
multipledispatch/dispatcher.py
Outdated
ordering = getattr(self, '_ordering', None) | ||
if ordering is None: | ||
ordering = self.reorder() | ||
return ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the following is marginally faster:
try:
return self._ordering
except AttributeError:
return self.reorder()
Asking foregiveness tends to introduce less overhead in Python than asking permission in the common case where permission is usually granted.
This may not be relevant though. I haven't done benchmarks in this particular case.
amb = ambiguities(self.funcs) | ||
if amb: | ||
on_ambiguity(self, amb) | ||
return od |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the temporary od
variable here? Perhaps just return self._ordering
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I had last profiled this on 2.7, the cost of entering the try/except was actually pretty high because it pushes and pops from the block-stack. I reran this test on 3.6 and the results have flipped, the try/except was faster. I will clean that up
amb = ambiguities(self.funcs) | ||
if amb: | ||
on_ambiguity(self, amb) | ||
return od |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.
updated |
This is in. Thanks for following up on this @llllllllll |
It is currently very expensive to import modules that use multiple dispatch because of all of the ordering that happens at module scope.
This change creates a notion of an unresolved dispatcher which will lazily resolve the order when needed. Dispatchers begin as regular dispatchers with no functions. When
add
is called, the cache is cleared and the object moves into the unresolved state. When theorder
attribute is looked up, the order is computed and the dispatcher moves back to the resolved state. This means that exact matches (resolved inself.funcs
) will not trigger a reorder. Reordering can be explicitly triggered by calling thereorder
method. This same machinery applies to theMethodDispatcher
.This is implemented by swapping the type under the dispatcher instance to avoid a branch on every call. One simple implementation would be to have a
self._unresolved
flag on the instance; however, this would be slower than the current solution. Most multupledispatch code starts with a large amount of dispatch definitions followed by a considerably larger amount of calls to the dispatcher. I wanted to keep the per-call overhead as low as possible. This implementation also helps by batching sequences of additions and only triggers a single reorder when used.As a small benchmark, this takes the import time of one component of our code from 7.502 seconds down to 3.210 seconds. Using the existing mutiple dispatch the top calls by total time were:
Now, multuple dispatch is not in the top 30.
Another example is a small blaze backend implemented for a custom data format we have. This backend uses 34 calls to
dispatch
, and we spend 2.171 seconds inrestart_ordering
. The total import time for this file was 4.524 seconds. With just this change, this module's import time is down to 0.347 seconds.