Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom quickening of calls #432

Closed
mdboom opened this issue Jul 26, 2022 Discussed in #428 · 0 comments
Closed

Allow custom quickening of calls #432

mdboom opened this issue Jul 26, 2022 Discussed in #428 · 0 comments

Comments

@mdboom
Copy link
Contributor

mdboom commented Jul 26, 2022

Discussed in #428

Originally posted by mdboom July 15, 2022
I'm starting to wonder what it would take to extend adaptive quickening to a deeply-polymorphic data structure library such as Numpy. Some of this started in discussions with @markshannon so I won't take all the credit/blame :)

In some sense, this is a further generalization of the original idea proposed in Allow custom BINARY_OP.

Unlike that proposal, which is still based on Python types, a general "dispatch to specialized function calls based on the Python types of the arguments" wouldn't be very helpful for Numpy. Numpy almost uses a single Python type ndarray, and most function calls dispatch to a specific C function implementation based on the ndarray's dtype, (C-like numeric type, but could also be a struct), stride and shape. This mapping is very specific, and something only Numpy can really address, but it has similar properties in that for the same call site, it would be typical to see the same dtype/stride/shape triple appearing over and over again, so calculating how to perform the computation each time is redundant work.

It's probably not worth discussing implementations yet, however here's my hand-wavy assumptions about what might be possible:

  • We introduce a new PyMethodDef type called METH_SPECIALIZABLE, which has extra parameters to handle the a new quickening API. There would also need to be some metadata to specify how many cache entries the method needs so space could be allocated in the bytecode.
  • When the CALL_ADAPTIVE bytecode specializes and it finds a METH_SPECIALIZABLE callable, it passes an argument to the method to tell it to also specialize itself based on the current arguments. The method could fill in a buffer of cache entries, one of which, in most cases, would be a function pointer to a more specialized call.
  • Subsequent calls to the method would go directly to the more specialized path. There would also be an API to unquicken if the type assumptions no longer hold.

Opportunity sizing:

I've run one simple experiment to see if this would be worth the effort. I hacked Numpy so that the results of all of the delegation within the np.multiply function are "cached" to a static variable after 8 calls, and from then on, only the core computation is performed each time (see my HACK-cache branch). Then I timed how long it takes to multiply two arrays of different sizes together:

timeit.timeit("np.multiply(x, y)", f"import numpy as np; x = np.ones({size}) * 2; y = np.ones({size}) * 4", number=int(1_000))

This gives us a sort of "best case scenario" if the scheme outlined above is even possible. This obviously ignores the overhead of the specialization / unspecialization checks.

Array size Numpy upstream Caching fork speedup
1 0.00064420 0.00026666 2.42x
10 0.00064755 0.00027479 2.36x
100 0.00073137 0.00034967 2.09x
1000 0.00147700 0.00102378 1.44x
10000 0.00908844 0.00855760 1.06x
100000 0.10119196 0.09983327 1.01x
1000000 1.16631011 1.34759164 0.87x
10000000 22.09005169 15.67308645 1.41x

This seems really promising for small arrays where the dispatch calculation dominates. These are an important class of operations -- "scalar" arrays are really common in real-world code. It certainly would be reasonable for Numpy to not quicken when the arrays are larger than a certain threshold.

The next experiment I hope to run is to take some real-world benchmarks and see how often call sites are non-polymorphic as a way to estimate how often we could expect specific call sites to be quickened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant