Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: optimize import times #66

Merged
merged 3 commits into from
Feb 26, 2018
Merged

Conversation

llllllllll
Copy link
Collaborator

It is currently very expensive to import modules that use multiple dispatch because of all of the ordering that happens at module scope.

This change creates a notion of an unresolved dispatcher which will lazily resolve the order when needed. Dispatchers begin as regular dispatchers with no functions. When add is called, the cache is cleared and the object moves into the unresolved state. When the order attribute is looked up, the order is computed and the dispatcher moves back to the resolved state. This means that exact matches (resolved in self.funcs) will not trigger a reorder. Reordering can be explicitly triggered by calling the reorder method. This same machinery applies to the MethodDispatcher.

This is implemented by swapping the type under the dispatcher instance to avoid a branch on every call. One simple implementation would be to have a self._unresolved flag on the instance; however, this would be slower than the current solution. Most multupledispatch code starts with a large amount of dispatch definitions followed by a considerably larger amount of calls to the dispatcher. I wanted to keep the per-call overhead as low as possible. This implementation also helps by batching sequences of additions and only triggers a single reorder when used.

As a small benchmark, this takes the import time of one component of our code from 7.502 seconds down to 3.210 seconds. Using the existing mutiple dispatch the top calls by total time were:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1742368    0.763    0.000    1.541    0.000 multipledispatch/conflict.py:7(supercedes)
   845865    0.588    0.000    1.185    0.000 multipledispatch/conflict.py:12(consistent)
      175    0.530    0.003    2.206    0.013 multipledispatch/conflict.py:24(ambiguities)
   815514    0.395    0.000    0.565    0.000 {map}
  1695752    0.343    0.000    1.847    0.000 multipledispatch/conflict.py:43(edge)
      175    0.264    0.002    2.131    0.012 multipledispatch/conflict.py:56(ordering)

Now, multuple dispatch is not in the top 30.

Another example is a small blaze backend implemented for a custom data format we have. This backend uses 34 calls to dispatch, and we spend 2.171 seconds in restart_ordering. The total import time for this file was 4.524 seconds. With just this change, this module's import time is down to 0.347 seconds.

@mrocklin
Copy link
Owner

I am glad to see progress on this and very glad to see your involvement in multipledispatch generally. However, I am also particularly busy this week and probably won't look at this until next Tuesday at the earliest. Please don't take this as a sign that I am anything less than very enthusiastic :)

@llllllllll
Copy link
Collaborator Author

Thanks for the warning, this isn't particularly urgent so don't rush to get to this.

@llllllllll
Copy link
Collaborator Author

friendly bump

def ordering(self, value):
self.__class__ = self._resolved_type
self.reorder()
self.ordering = value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like fiendish devilry to me. I wonder if there is a less magical way to achieve the same result without significant cost.

For example, I believe that try-except blocks are quite cheap if there is no exception. I wonder if something like the following would work well enough:

def add(...):
    del self._ordering
    ...

@property
def ordering(self):
    try:
        return self._ordering
    except AttributeError:
        self.reorder()
        return self._ordering

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try/except is actually somewhat expensive because it needs to setup a block and unwind it. In blaze we have many millions of calls to multiply dispatched functions so even small overhead gets magnified. The goal of this was to be have no overhead once the dispatcher is resolved.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't touch ordering that often though (caching should catch most calls), and when we do it's well nested within several other try-except blocks and function calls.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a (very untrustworthy) microbenchmark

In [1]: %%timeit
   ...: try:
   ...:     1 + 1
   ...: except:
   ...:     pass
   ...: 
The slowest run took 96.99 times longer than the fastest. This could mean that an intermediate result is being cached.
100000000 loops, best of 3: 18.6 ns per loop

In [2]: %timeit 1 + 1
100000000 loops, best of 3: 10.8 ns per loop

Compare that to function calls

In [3]: def f():
   ...:     pass
   ...: 

In [4]: %timeit f()
The slowest run took 17.84 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 69.9 ns per loop

It's unclear to me that adding try-except block would significantly harm performance. Swapping out __class__ would, I suspect, have unexpected consequences when interacting with all of the other libraries of the ecosystem.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llllllllll reminder that the ball is in your court here :)

@mrocklin
Copy link
Owner

In general the approach here seems quite nice. I can see how it would have a large positive impact without any requirements on the user and without much complexity.

Another option, should you be interested, would be to maintain some data structure so that we could update the ordering cheaply when we find a new signature. Currently we clear out all intermediate state when we solve. Presumably there is something we could keep around to make incremental additions efficient.

@llllllllll
Copy link
Collaborator Author

I like the idea of using a better data structure to optimize insertions, but I think that can exist alongside this PR. I have removed the type swapping code and replaced it with a simple property.

@mrocklin
Copy link
Owner

Does this meet your operational performance constraints?

@llllllllll
Copy link
Collaborator Author

This is still a considerable improvement, I think there is not much difference in import time between my old version and this new version. If there is, it is hard to measure reliably.

@mrocklin
Copy link
Owner

You were also concerned about call-time costs. My hope is that caching handles almost all of these costs, but wanted to check.

@llllllllll
Copy link
Collaborator Author

Ah, thanks for reminding me about the per-call cost. I will run a more serious benchmark using blaze and get back to you.

@mariusvniekerk
Copy link
Collaborator

@llllllllll So we have the benchmark things in now, you may want to just close and reopen to get those to run?

@llllllllll llllllllll closed this Feb 2, 2018
@llllllllll llllllllll reopened this Feb 2, 2018
Joe Jevnik added 2 commits February 2, 2018 15:15

Verified

This commit was signed with the committer’s verified signature.
scala-steward Scala Steward
@llllllllll
Copy link
Collaborator Author

llllllllll commented Feb 2, 2018

@mariusvniekerk thanks for setting up the benchmarks, here are the results for this branch (with and without the fiendish devilry), as well as your pytypes branch. (Note: this was all run on my laptop, I wouldn't read too much into the absolute time. I am just comparing the relative timings)

Python 3.6

master:
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                 StdDev                 Median                    IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]             784.0572 (1.0)       32,990.9381 (1.0)         913.7105 (1.0)         495.8431 (1.14)        888.0161 (1.0)          49.9422 (1.35)     933;2479    1,094.4385 (1.0)       55479           1
test_benchmark_call_single_dispatch[a]             791.0421 (1.01)      48,720.9763 (1.48)        960.3451 (1.05)        436.3127 (1.0)         905.0127 (1.02)         73.9237 (2.00)    2698;9789    1,041.2924 (0.95)      87109           1
test_benchmark_call_multiple_dispatch[val0]        928.9943 (1.18)      42,384.0247 (1.28)      1,107.9655 (1.21)        539.4652 (1.24)      1,044.0126 (1.18)         37.0201 (1.0)     3576;8080      902.5552 (0.82)      73476           1
test_benchmark_add_and_use_instance             69,356.9891 (88.46)    302,431.9885 (9.17)     87,918.5234 (96.22)    22,021.6821 (50.47)    76,741.0966 (86.42)    13,104.7273 (353.99)  1131;1163       11.3742 (0.01)       7249           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (with fiendish devilry):
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]             827.9458 (1.0)       20,279.8983 (1.12)        943.3185 (1.0)        464.8888 (1.0)         906.9918 (1.0)       28.9874 (1.07)     729;2655    1,060.0873 (1.0)       36588           1
test_benchmark_call_single_dispatch[a]             876.0253 (1.06)      18,173.9451 (1.0)         967.7661 (1.03)       481.0002 (1.03)        922.9407 (1.02)      40.0469 (1.48)       60;191    1,033.3076 (0.97)       2430           1
test_benchmark_call_multiple_dispatch[val0]        915.0244 (1.11)      26,053.9819 (1.43)      1,037.0130 (1.10)       516.1834 (1.11)      1,000.0076 (1.10)      27.0084 (1.0)     1740;6481      964.3081 (0.91)     103092           1
test_benchmark_add_and_use_instance             30,160.9980 (36.43)    107,357.0456 (5.91)     32,351.4916 (34.30)    4,398.7726 (9.46)     31,445.9903 (34.67)    603.9627 (22.36)    395;1085       30.9105 (0.03)      10070           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (without fiendish devilry):
------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                   IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[a]             810.0178 (1.0)       36,802.9578 (1.96)        967.4502 (1.02)       678.9058 (1.51)        918.0512 (1.03)        30.0352 (1.15)     331;1142    1,033.6449 (0.98)      16287           1
test_benchmark_call_single_dispatch[1]             812.9282 (1.00)      18,785.9405 (1.0)         949.2156 (1.0)        510.0139 (1.14)        895.0010 (1.0)         28.0561 (1.08)    2236;5610    1,053.5014 (1.0)       58443           1
test_benchmark_call_multiple_dispatch[val0]        913.0454 (1.13)      26,431.0511 (1.41)      1,033.0780 (1.09)       448.2621 (1.0)       1,003.0344 (1.12)        26.0770 (1.0)      865;5750      967.9812 (0.92)      67268           1
test_benchmark_add_and_use_instance             31,430.0414 (38.80)    112,092.0060 (5.97)     34,607.3238 (36.46)    6,440.4205 (14.37)    32,797.9214 (36.65)    1,010.0193 (38.73)    735;1668       28.8956 (0.03)      12506           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pytypes branch:
--------------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                 Max                Mean             StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]            7.9379 (1.05)      56.8990 (1.0)        8.9310 (1.0)       3.1097 (1.0)        8.2410 (1.0)       0.1900 (1.0)      629;2361      111.9695 (1.0)       14820           1
test_benchmark_call_single_dispatch[a]            7.5700 (1.0)      239.2530 (4.20)       9.6932 (1.09)      5.3742 (1.73)       8.2999 (1.01)      0.8687 (4.57)    1498;2791      103.1654 (0.92)      21355           1
test_benchmark_call_multiple_dispatch[val0]      15.6270 (2.06)     107.7730 (1.89)      19.7133 (2.21)      7.8631 (2.53)      16.2061 (1.97)      1.3144 (6.92)    2242;3514       50.7272 (0.45)      15357           1
test_benchmark_add_and_use_instance             229.0071 (30.25)    660.4010 (11.61)    287.5107 (32.19)    93.9847 (30.22)    245.0094 (29.73)    57.5915 (303.13)    517;519        3.4781 (0.03)       3252           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Python 2.7

master:
------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                   IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)       21,934.5093 (1.16)        806.1531 (1.11)       525.9095 (1.0)         953.6743 (1.0)          0.0000 (1.0)    17675;30524    1,240.4592 (0.90)      83887           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)       18,835.0677 (1.0)         728.6384 (1.0)        539.7561 (1.03)        953.6743 (1.0)        953.6743 (inf)      11750;21    1,372.4229 (1.0)       41528           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)       22,172.9279 (1.18)        731.4872 (1.00)       594.9043 (1.13)        953.6743 (1.0)        953.6743 (inf)      24387;70    1,367.0779 (1.00)      83887           1
test_benchmark_add_and_use_instance             46,968.4601 (inf)      148,057.9376 (7.86)     49,903.3053 (68.49)    6,154.7847 (11.70)    48,875.8087 (51.25)    1,192.0929 (inf)      355;1806       20.0388 (0.01)      11097           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (with fiendish devilry):
----------------------------------------------------------------------------------------------------------- benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                    Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)      25,987.6251 (1.36)        835.0917 (1.11)       625.9680 (1.0)         953.6743 (1.0)        0.0000 (1.0)    17302;30104    1,197.4733 (0.90)      83887           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)      19,073.4863 (1.0)         752.2889 (1.0)        652.8658 (1.04)        953.6743 (1.0)      953.6743 (inf)      13849;74    1,329.2766 (1.0)       49933           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)      25,987.6251 (1.36)        758.9269 (1.01)       734.6962 (1.17)        953.6743 (1.0)      953.6743 (inf)       5258;31    1,317.6500 (0.99)      18158           1
test_benchmark_add_and_use_instance             13,828.2776 (inf)      64,134.5978 (3.36)     15,705.1898 (20.88)    3,115.8844 (4.98)     15,020.3705 (15.75)      0.0000 (1.0)      784;7383       63.6732 (0.05)      21291           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (without fiendish devilry):
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)       20,027.1606 (1.05)        839.2593 (1.15)       503.8814 (1.0)         953.6743 (1.0)        0.0000 (1.0)    13809;25941    1,191.5269 (0.87)      76261           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)       19,073.4863 (1.0)         732.0398 (1.0)        555.6902 (1.10)        953.6743 (1.0)      953.6743 (inf)      16774;37    1,366.0460 (1.0)       58255           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)       22,172.9279 (1.16)        734.9983 (1.00)       636.3345 (1.26)        953.6743 (1.0)      953.6743 (inf)       5650;19    1,360.5474 (1.00)      19600           1
test_benchmark_add_and_use_instance             15,974.0448 (inf)      118,970.8710 (6.24)     17,288.2343 (23.62)    2,810.4606 (5.58)     16,927.7191 (17.75)    953.6743 (inf)      503;1002       57.8428 (0.04)      19240           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pytypes branch:
--------------------------------------------------------------------------------------------------- benchmark: 4 tests --------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                 Max                Mean             StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[a]            6.9141 (1.0)       71.0487 (2.03)       8.6218 (1.0)       2.1196 (1.08)       8.1062 (1.0)      0.9537 (1.0)       716;716      115.9847 (1.0)       19973           1
test_benchmark_call_single_dispatch[1]            7.8678 (1.14)      35.0475 (1.0)        8.8847 (1.03)      1.9700 (1.0)        8.8215 (1.09)     0.9537 (1.0)       487;487      112.5533 (0.97)      16913           1
test_benchmark_call_multiple_dispatch[val0]      14.7820 (2.14)      61.9888 (1.77)      16.5979 (1.93)      2.6539 (1.35)      15.9740 (1.97)     0.9537 (1.0)       515;619       60.2485 (0.52)      18158           1
test_benchmark_add_and_use_instance             222.9214 (32.24)    526.9051 (15.03)    242.9784 (28.18)    36.6928 (18.63)    231.9813 (28.62)    9.0599 (9.50)      276;605        4.1156 (0.04)       4203           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note: I have no idea why the min and median are messed up in py2.

This convinces me that my weird class swapping is likely not worth it. It also makes me very concerned about the pytypes branch, note that the units are in us for those benchmarks, but the others are in ns. Blaze does many millions of dispatches in the course of a real computation, so I would not be comfortable with this level of slowdown.

@llllllllll
Copy link
Collaborator Author

Any outstanding blockers on merging this?

Copy link
Owner

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple tiny comments. Happy to merge regardless though. Thank you for doing this and for the ping.

ordering = getattr(self, '_ordering', None)
if ordering is None:
ordering = self.reorder()
return ordering
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the following is marginally faster:

try:
    return self._ordering
except AttributeError:
    return self.reorder()

Asking foregiveness tends to introduce less overhead in Python than asking permission in the common case where permission is usually granted.

This may not be relevant though. I haven't done benchmarks in this particular case.

amb = ambiguities(self.funcs)
if amb:
on_ambiguity(self, amb)
return od
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the temporary od variable here? Perhaps just return self._ordering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.

Copy link
Collaborator Author

@llllllllll llllllllll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had last profiled this on 2.7, the cost of entering the try/except was actually pretty high because it pushes and pops from the block-stack. I reran this test on 3.6 and the results have flipped, the try/except was faster. I will clean that up

amb = ambiguities(self.funcs)
if amb:
on_ambiguity(self, amb)
return od
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.

@llllllllll
Copy link
Collaborator Author

updated

@mrocklin mrocklin merged commit f1146be into mrocklin:master Feb 26, 2018
@mrocklin
Copy link
Owner

This is in. Thanks for following up on this @llllllllll

richafrank added a commit to quantopian/zipline that referenced this pull request Sep 11, 2018
taoxuau pushed a commit to taoxuau/zipline that referenced this pull request Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants