ENH: optimize import times #66

llllllllll · 2017-10-24T20:59:20Z

It is currently very expensive to import modules that use multiple dispatch because of all of the ordering that happens at module scope.

This change creates a notion of an unresolved dispatcher which will lazily resolve the order when needed. Dispatchers begin as regular dispatchers with no functions. When add is called, the cache is cleared and the object moves into the unresolved state. When the order attribute is looked up, the order is computed and the dispatcher moves back to the resolved state. This means that exact matches (resolved in self.funcs) will not trigger a reorder. Reordering can be explicitly triggered by calling the reorder method. This same machinery applies to the MethodDispatcher.

This is implemented by swapping the type under the dispatcher instance to avoid a branch on every call. One simple implementation would be to have a self._unresolved flag on the instance; however, this would be slower than the current solution. Most multupledispatch code starts with a large amount of dispatch definitions followed by a considerably larger amount of calls to the dispatcher. I wanted to keep the per-call overhead as low as possible. This implementation also helps by batching sequences of additions and only triggers a single reorder when used.

As a small benchmark, this takes the import time of one component of our code from 7.502 seconds down to 3.210 seconds. Using the existing mutiple dispatch the top calls by total time were:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1742368    0.763    0.000    1.541    0.000 multipledispatch/conflict.py:7(supercedes)
   845865    0.588    0.000    1.185    0.000 multipledispatch/conflict.py:12(consistent)
      175    0.530    0.003    2.206    0.013 multipledispatch/conflict.py:24(ambiguities)
   815514    0.395    0.000    0.565    0.000 {map}
  1695752    0.343    0.000    1.847    0.000 multipledispatch/conflict.py:43(edge)
      175    0.264    0.002    2.131    0.012 multipledispatch/conflict.py:56(ordering)

Now, multuple dispatch is not in the top 30.

Another example is a small blaze backend implemented for a custom data format we have. This backend uses 34 calls to dispatch, and we spend 2.171 seconds in restart_ordering. The total import time for this file was 4.524 seconds. With just this change, this module's import time is down to 0.347 seconds.

mrocklin · 2017-10-24T21:41:50Z

I am glad to see progress on this and very glad to see your involvement in multipledispatch generally. However, I am also particularly busy this week and probably won't look at this until next Tuesday at the earliest. Please don't take this as a sign that I am anything less than very enthusiastic :)

llllllllll · 2017-10-24T21:53:05Z

Thanks for the warning, this isn't particularly urgent so don't rush to get to this.

llllllllll · 2017-11-15T01:28:41Z

friendly bump

mrocklin · 2017-11-16T00:43:55Z

multipledispatch/dispatcher.py

+    def ordering(self, value):
+        self.__class__ = self._resolved_type
+        self.reorder()
+        self.ordering = value


This looks like fiendish devilry to me. I wonder if there is a less magical way to achieve the same result without significant cost.

For example, I believe that try-except blocks are quite cheap if there is no exception. I wonder if something like the following would work well enough:

def add(...): del self._ordering ... @property def ordering(self): try: return self._ordering except AttributeError: self.reorder() return self._ordering

try/except is actually somewhat expensive because it needs to setup a block and unwind it. In blaze we have many millions of calls to multiply dispatched functions so even small overhead gets magnified. The goal of this was to be have no overhead once the dispatcher is resolved.

We don't touch ordering that often though (caching should catch most calls), and when we do it's well nested within several other try-except blocks and function calls.

Here is a (very untrustworthy) microbenchmark

In [1]: %%timeit ...: try: ...: 1 + 1 ...: except: ...: pass ...: The slowest run took 96.99 times longer than the fastest. This could mean that an intermediate result is being cached. 100000000 loops, best of 3: 18.6 ns per loop In [2]: %timeit 1 + 1 100000000 loops, best of 3: 10.8 ns per loop

Compare that to function calls

In [3]: def f(): ...: pass ...: In [4]: %timeit f() The slowest run took 17.84 times longer than the fastest. This could mean that an intermediate result is being cached. 10000000 loops, best of 3: 69.9 ns per loop

It's unclear to me that adding try-except block would significantly harm performance. Swapping out __class__ would, I suspect, have unexpected consequences when interacting with all of the other libraries of the ecosystem.

@llllllllll reminder that the ball is in your court here :)

mrocklin · 2017-11-16T00:46:04Z

In general the approach here seems quite nice. I can see how it would have a large positive impact without any requirements on the user and without much complexity.

Another option, should you be interested, would be to maintain some data structure so that we could update the ordering cheaply when we find a new signature. Currently we clear out all intermediate state when we solve. Presumably there is something we could keep around to make incremental additions efficient.

llllllllll · 2017-12-27T23:12:12Z

I like the idea of using a better data structure to optimize insertions, but I think that can exist alongside this PR. I have removed the type swapping code and replaced it with a simple property.

mrocklin · 2017-12-27T23:34:35Z

Does this meet your operational performance constraints?

llllllllll · 2017-12-27T23:43:12Z

This is still a considerable improvement, I think there is not much difference in import time between my old version and this new version. If there is, it is hard to measure reliably.

mrocklin · 2017-12-27T23:44:02Z

You were also concerned about call-time costs. My hope is that caching handles almost all of these costs, but wanted to check.

llllllllll · 2017-12-27T23:46:20Z

Ah, thanks for reminding me about the per-call cost. I will run a more serious benchmark using blaze and get back to you.

mariusvniekerk · 2018-02-02T16:40:55Z

@llllllllll So we have the benchmark things in now, you may want to just close and reopen to get those to run?

llllllllll · 2018-02-02T20:29:05Z

@mariusvniekerk thanks for setting up the benchmarks, here are the results for this branch (with and without the fiendish devilry), as well as your pytypes branch. (Note: this was all run on my laptop, I wouldn't read too much into the absolute time. I am just comparing the relative timings)

Python 3.6

master:
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                 StdDev                 Median                    IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]             784.0572 (1.0)       32,990.9381 (1.0)         913.7105 (1.0)         495.8431 (1.14)        888.0161 (1.0)          49.9422 (1.35)     933;2479    1,094.4385 (1.0)       55479           1
test_benchmark_call_single_dispatch[a]             791.0421 (1.01)      48,720.9763 (1.48)        960.3451 (1.05)        436.3127 (1.0)         905.0127 (1.02)         73.9237 (2.00)    2698;9789    1,041.2924 (0.95)      87109           1
test_benchmark_call_multiple_dispatch[val0]        928.9943 (1.18)      42,384.0247 (1.28)      1,107.9655 (1.21)        539.4652 (1.24)      1,044.0126 (1.18)         37.0201 (1.0)     3576;8080      902.5552 (0.82)      73476           1
test_benchmark_add_and_use_instance             69,356.9891 (88.46)    302,431.9885 (9.17)     87,918.5234 (96.22)    22,021.6821 (50.47)    76,741.0966 (86.42)    13,104.7273 (353.99)  1131;1163       11.3742 (0.01)       7249           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (with fiendish devilry):
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]             827.9458 (1.0)       20,279.8983 (1.12)        943.3185 (1.0)        464.8888 (1.0)         906.9918 (1.0)       28.9874 (1.07)     729;2655    1,060.0873 (1.0)       36588           1
test_benchmark_call_single_dispatch[a]             876.0253 (1.06)      18,173.9451 (1.0)         967.7661 (1.03)       481.0002 (1.03)        922.9407 (1.02)      40.0469 (1.48)       60;191    1,033.3076 (0.97)       2430           1
test_benchmark_call_multiple_dispatch[val0]        915.0244 (1.11)      26,053.9819 (1.43)      1,037.0130 (1.10)       516.1834 (1.11)      1,000.0076 (1.10)      27.0084 (1.0)     1740;6481      964.3081 (0.91)     103092           1
test_benchmark_add_and_use_instance             30,160.9980 (36.43)    107,357.0456 (5.91)     32,351.4916 (34.30)    4,398.7726 (9.46)     31,445.9903 (34.67)    603.9627 (22.36)    395;1085       30.9105 (0.03)      10070           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (without fiendish devilry):
------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                   IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[a]             810.0178 (1.0)       36,802.9578 (1.96)        967.4502 (1.02)       678.9058 (1.51)        918.0512 (1.03)        30.0352 (1.15)     331;1142    1,033.6449 (0.98)      16287           1
test_benchmark_call_single_dispatch[1]             812.9282 (1.00)      18,785.9405 (1.0)         949.2156 (1.0)        510.0139 (1.14)        895.0010 (1.0)         28.0561 (1.08)    2236;5610    1,053.5014 (1.0)       58443           1
test_benchmark_call_multiple_dispatch[val0]        913.0454 (1.13)      26,431.0511 (1.41)      1,033.0780 (1.09)       448.2621 (1.0)       1,003.0344 (1.12)        26.0770 (1.0)      865;5750      967.9812 (0.92)      67268           1
test_benchmark_add_and_use_instance             31,430.0414 (38.80)    112,092.0060 (5.97)     34,607.3238 (36.46)    6,440.4205 (14.37)    32,797.9214 (36.65)    1,010.0193 (38.73)    735;1668       28.8956 (0.03)      12506           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pytypes branch:
--------------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                 Max                Mean             StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[1]            7.9379 (1.05)      56.8990 (1.0)        8.9310 (1.0)       3.1097 (1.0)        8.2410 (1.0)       0.1900 (1.0)      629;2361      111.9695 (1.0)       14820           1
test_benchmark_call_single_dispatch[a]            7.5700 (1.0)      239.2530 (4.20)       9.6932 (1.09)      5.3742 (1.73)       8.2999 (1.01)      0.8687 (4.57)    1498;2791      103.1654 (0.92)      21355           1
test_benchmark_call_multiple_dispatch[val0]      15.6270 (2.06)     107.7730 (1.89)      19.7133 (2.21)      7.8631 (2.53)      16.2061 (1.97)      1.3144 (6.92)    2242;3514       50.7272 (0.45)      15357           1
test_benchmark_add_and_use_instance             229.0071 (30.25)    660.4010 (11.61)    287.5107 (32.19)    93.9847 (30.22)    245.0094 (29.73)    57.5915 (303.13)    517;519        3.4781 (0.03)       3252           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Python 2.7

master:
------------------------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                   IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)       21,934.5093 (1.16)        806.1531 (1.11)       525.9095 (1.0)         953.6743 (1.0)          0.0000 (1.0)    17675;30524    1,240.4592 (0.90)      83887           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)       18,835.0677 (1.0)         728.6384 (1.0)        539.7561 (1.03)        953.6743 (1.0)        953.6743 (inf)      11750;21    1,372.4229 (1.0)       41528           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)       22,172.9279 (1.18)        731.4872 (1.00)       594.9043 (1.13)        953.6743 (1.0)        953.6743 (inf)      24387;70    1,367.0779 (1.00)      83887           1
test_benchmark_add_and_use_instance             46,968.4601 (inf)      148,057.9376 (7.86)     49,903.3053 (68.49)    6,154.7847 (11.70)    48,875.8087 (51.25)    1,192.0929 (inf)      355;1806       20.0388 (0.01)      11097           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (with fiendish devilry):
----------------------------------------------------------------------------------------------------------- benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                    Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)      25,987.6251 (1.36)        835.0917 (1.11)       625.9680 (1.0)         953.6743 (1.0)        0.0000 (1.0)    17302;30104    1,197.4733 (0.90)      83887           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)      19,073.4863 (1.0)         752.2889 (1.0)        652.8658 (1.04)        953.6743 (1.0)      953.6743 (inf)      13849;74    1,329.2766 (1.0)       49933           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)      25,987.6251 (1.36)        758.9269 (1.01)       734.6962 (1.17)        953.6743 (1.0)      953.6743 (inf)       5258;31    1,317.6500 (0.99)      18158           1
test_benchmark_add_and_use_instance             13,828.2776 (inf)      64,134.5978 (3.36)     15,705.1898 (20.88)    3,115.8844 (4.98)     15,020.3705 (15.75)      0.0000 (1.0)      784;7383       63.6732 (0.05)      21291           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

my branch (without fiendish devilry):
------------------------------------------------------------------------------------------------------------ benchmark: 4 tests -----------------------------------------------------------------------------------------------------------
Name (time in ns)                                       Min                     Max                   Mean                StdDev                 Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_multiple_dispatch[val0]          0.0000 (1.0)       20,027.1606 (1.05)        839.2593 (1.15)       503.8814 (1.0)         953.6743 (1.0)        0.0000 (1.0)    13809;25941    1,191.5269 (0.87)      76261           1
test_benchmark_call_single_dispatch[1]               0.0000 (1.0)       19,073.4863 (1.0)         732.0398 (1.0)        555.6902 (1.10)        953.6743 (1.0)      953.6743 (inf)      16774;37    1,366.0460 (1.0)       58255           1
test_benchmark_call_single_dispatch[a]               0.0000 (1.0)       22,172.9279 (1.16)        734.9983 (1.00)       636.3345 (1.26)        953.6743 (1.0)      953.6743 (inf)       5650;19    1,360.5474 (1.00)      19600           1
test_benchmark_add_and_use_instance             15,974.0448 (inf)      118,970.8710 (6.24)     17,288.2343 (23.62)    2,810.4606 (5.58)     16,927.7191 (17.75)    953.6743 (inf)      503;1002       57.8428 (0.04)      19240           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

pytypes branch:
--------------------------------------------------------------------------------------------------- benchmark: 4 tests --------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                 Max                Mean             StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_call_single_dispatch[a]            6.9141 (1.0)       71.0487 (2.03)       8.6218 (1.0)       2.1196 (1.08)       8.1062 (1.0)      0.9537 (1.0)       716;716      115.9847 (1.0)       19973           1
test_benchmark_call_single_dispatch[1]            7.8678 (1.14)      35.0475 (1.0)        8.8847 (1.03)      1.9700 (1.0)        8.8215 (1.09)     0.9537 (1.0)       487;487      112.5533 (0.97)      16913           1
test_benchmark_call_multiple_dispatch[val0]      14.7820 (2.14)      61.9888 (1.77)      16.5979 (1.93)      2.6539 (1.35)      15.9740 (1.97)     0.9537 (1.0)       515;619       60.2485 (0.52)      18158           1
test_benchmark_add_and_use_instance             222.9214 (32.24)    526.9051 (15.03)    242.9784 (28.18)    36.6928 (18.63)    231.9813 (28.62)    9.0599 (9.50)      276;605        4.1156 (0.04)       4203           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note: I have no idea why the min and median are messed up in py2.

This convinces me that my weird class swapping is likely not worth it. It also makes me very concerned about the pytypes branch, note that the units are in us for those benchmarks, but the others are in ns. Blaze does many millions of dispatches in the course of a real computation, so I would not be comfortable with this level of slowdown.

llllllllll · 2018-02-23T19:15:43Z

Any outstanding blockers on merging this?

mrocklin

A couple tiny comments. Happy to merge regardless though. Thank you for doing this and for the ping.

mrocklin · 2018-02-23T22:35:44Z

multipledispatch/dispatcher.py

+        ordering = getattr(self, '_ordering', None)
+        if ordering is None:
+            ordering = self.reorder()
+        return ordering


I suspect that the following is marginally faster:

try: return self._ordering except AttributeError: return self.reorder()

Asking foregiveness tends to introduce less overhead in Python than asking permission in the common case where permission is usually granted.

This may not be relevant though. I haven't done benchmarks in this particular case.

mrocklin · 2018-02-23T22:36:26Z

multipledispatch/dispatcher.py

+        amb = ambiguities(self.funcs)
+        if amb:
+            on_ambiguity(self, amb)
+        return od


Why the temporary od variable here? Perhaps just return self._ordering?

This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.

llllllllll

When I had last profiled this on 2.7, the cost of entering the try/except was actually pretty high because it pushes and pops from the block-stack. I reran this test on 3.6 and the results have flipped, the try/except was faster. I will clean that up

llllllllll · 2018-02-23T23:53:12Z

multipledispatch/dispatcher.py

+        amb = ambiguities(self.funcs)
+        if amb:
+            on_ambiguity(self, amb)
+        return od


This is actually slightly faster because we don't need to go through the attribute lookup again. Not a big deal on individual calls but across all multiple dispatch calls it can add up slightly.

llllllllll · 2018-02-26T10:45:11Z

updated

mrocklin · 2018-02-26T13:16:33Z

This is in. Thanks for following up on this @llllllllll

#66) from README (#79)

See mrocklin/multipledispatch#66

mrocklin reviewed Nov 16, 2017

View reviewed changes

llllllllll force-pushed the optimize-setup branch from 8eba021 to 13e7fbc Compare December 27, 2017 23:10

llllllllll closed this Feb 2, 2018

llllllllll reopened this Feb 2, 2018

Joe Jevnik added 2 commits February 2, 2018 15:15

ENH: optimize import times

Verified

This commit was signed with the committer’s verified signature.

scala-steward Scala Steward

GPG key ID: 7FE176635A140508

Verified
Learn about vigilant mode

1f2bf9d

MAINT: remove fiendish devilry

e6539ae

llllllllll force-pushed the optimize-setup branch from 13e7fbc to e6539ae Compare February 2, 2018 20:23

mrocklin reviewed Feb 23, 2018

View reviewed changes

llllllllll mentioned this pull request Feb 25, 2018

Hooks for XArray operations pydata/xarray#1938

Open

llllllllll commented Feb 25, 2018

View reviewed changes

MAINT: replace getattr with try/except

1e892aa

mrocklin merged commit f1146be into mrocklin:master Feb 26, 2018

mrocklin pushed a commit that referenced this pull request Mar 31, 2018

Remove mention of halt_oredering and restart ordering (deprecated by PR

ec1dca8

#66) from README (#79)

richafrank added a commit to quantopian/zipline that referenced this pull request Sep 11, 2018

PERF: Bump multipledispatch for faster import times

19cc508

See mrocklin/multipledispatch#66

richafrank mentioned this pull request Sep 11, 2018

Bump libs quantopian/zipline#2292

Merged

taoxuau pushed a commit to taoxuau/zipline that referenced this pull request Aug 1, 2019

PERF: Bump multipledispatch for faster import times

4d5b678

See mrocklin/multipledispatch#66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: optimize import times #66

ENH: optimize import times #66

llllllllll commented Oct 24, 2017

mrocklin commented Oct 24, 2017

llllllllll commented Oct 24, 2017

llllllllll commented Nov 15, 2017

mrocklin Nov 16, 2017

llllllllll Nov 16, 2017

mrocklin Nov 16, 2017

mrocklin Nov 16, 2017

mrocklin Dec 27, 2017

mrocklin commented Nov 16, 2017

llllllllll commented Dec 27, 2017

mrocklin commented Dec 27, 2017

llllllllll commented Dec 27, 2017

mrocklin commented Dec 27, 2017

llllllllll commented Dec 27, 2017

mariusvniekerk commented Feb 2, 2018

llllllllll commented Feb 2, 2018 •

edited

Loading

llllllllll commented Feb 23, 2018

mrocklin left a comment

mrocklin Feb 23, 2018

mrocklin Feb 23, 2018

llllllllll Feb 23, 2018

llllllllll left a comment

llllllllll Feb 23, 2018

llllllllll commented Feb 26, 2018

mrocklin commented Feb 26, 2018

ENH: optimize import times #66

ENH: optimize import times #66

Conversation

llllllllll commented Oct 24, 2017

mrocklin commented Oct 24, 2017

llllllllll commented Oct 24, 2017

llllllllll commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Nov 16, 2017

llllllllll commented Dec 27, 2017

mrocklin commented Dec 27, 2017

llllllllll commented Dec 27, 2017

mrocklin commented Dec 27, 2017

llllllllll commented Dec 27, 2017

mariusvniekerk commented Feb 2, 2018

llllllllll commented Feb 2, 2018 • edited Loading

Python 3.6

Python 2.7

llllllllll commented Feb 23, 2018

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llllllllll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llllllllll commented Feb 26, 2018

mrocklin commented Feb 26, 2018

llllllllll commented Feb 2, 2018 •

edited

Loading