You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by mdboom July 15, 2022
I'm starting to wonder what it would take to extend adaptive quickening to a deeply-polymorphic data structure library such as Numpy. Some of this started in discussions with @markshannon so I won't take all the credit/blame :)
In some sense, this is a further generalization of the original idea proposed in Allow custom BINARY_OP.
Unlike that proposal, which is still based on Python types, a general "dispatch to specialized function calls based on the Python types of the arguments" wouldn't be very helpful for Numpy. Numpy almost uses a single Python type ndarray, and most function calls dispatch to a specific C function implementation based on the ndarray's dtype, (C-like numeric type, but could also be a struct), stride and shape. This mapping is very specific, and something only Numpy can really address, but it has similar properties in that for the same call site, it would be typical to see the same dtype/stride/shape triple appearing over and over again, so calculating how to perform the computation each time is redundant work.
It's probably not worth discussing implementations yet, however here's my hand-wavy assumptions about what might be possible:
We introduce a new PyMethodDef type called METH_SPECIALIZABLE, which has extra parameters to handle the a new quickening API. There would also need to be some metadata to specify how many cache entries the method needs so space could be allocated in the bytecode.
When the CALL_ADAPTIVE bytecode specializes and it finds a METH_SPECIALIZABLE callable, it passes an argument to the method to tell it to also specialize itself based on the current arguments. The method could fill in a buffer of cache entries, one of which, in most cases, would be a function pointer to a more specialized call.
Subsequent calls to the method would go directly to the more specialized path. There would also be an API to unquicken if the type assumptions no longer hold.
Opportunity sizing:
I've run one simple experiment to see if this would be worth the effort. I hacked Numpy so that the results of all of the delegation within the np.multiply function are "cached" to a static variable after 8 calls, and from then on, only the core computation is performed each time (see my HACK-cache branch). Then I timed how long it takes to multiply two arrays of different sizes together:
timeit.timeit("np.multiply(x, y)", f"import numpy as np; x = np.ones({size}) * 2; y = np.ones({size}) * 4", number=int(1_000))
This gives us a sort of "best case scenario" if the scheme outlined above is even possible. This obviously ignores the overhead of the specialization / unspecialization checks.
Array size
Numpy upstream
Caching fork
speedup
1
0.00064420
0.00026666
2.42x
10
0.00064755
0.00027479
2.36x
100
0.00073137
0.00034967
2.09x
1000
0.00147700
0.00102378
1.44x
10000
0.00908844
0.00855760
1.06x
100000
0.10119196
0.09983327
1.01x
1000000
1.16631011
1.34759164
0.87x
10000000
22.09005169
15.67308645
1.41x
This seems really promising for small arrays where the dispatch calculation dominates. These are an important class of operations -- "scalar" arrays are really common in real-world code. It certainly would be reasonable for Numpy to not quicken when the arrays are larger than a certain threshold.
The next experiment I hope to run is to take some real-world benchmarks and see how often call sites are non-polymorphic as a way to estimate how often we could expect specific call sites to be quickened.
The text was updated successfully, but these errors were encountered:
Discussed in #428
Originally posted by mdboom July 15, 2022
I'm starting to wonder what it would take to extend adaptive quickening to a deeply-polymorphic data structure library such as Numpy. Some of this started in discussions with @markshannon so I won't take all the credit/blame :)
In some sense, this is a further generalization of the original idea proposed in Allow custom
BINARY_OP
.Unlike that proposal, which is still based on Python types, a general "dispatch to specialized function calls based on the Python types of the arguments" wouldn't be very helpful for Numpy. Numpy almost uses a single Python type
ndarray
, and most function calls dispatch to a specific C function implementation based on thendarray
'sdtype
, (C-like numeric type, but could also be astruct
),stride
andshape
. This mapping is very specific, and something only Numpy can really address, but it has similar properties in that for the same call site, it would be typical to see the samedtype
/stride
/shape
triple appearing over and over again, so calculating how to perform the computation each time is redundant work.It's probably not worth discussing implementations yet, however here's my hand-wavy assumptions about what might be possible:
PyMethodDef
type calledMETH_SPECIALIZABLE
, which has extra parameters to handle the a new quickening API. There would also need to be some metadata to specify how many cache entries the method needs so space could be allocated in the bytecode.CALL_ADAPTIVE
bytecode specializes and it finds aMETH_SPECIALIZABLE
callable, it passes an argument to the method to tell it to also specialize itself based on the current arguments. The method could fill in a buffer of cache entries, one of which, in most cases, would be a function pointer to a more specialized call.Opportunity sizing:
I've run one simple experiment to see if this would be worth the effort. I hacked Numpy so that the results of all of the delegation within the
np.multiply
function are "cached" to a static variable after 8 calls, and from then on, only the core computation is performed each time (see my HACK-cache branch). Then I timed how long it takes to multiply two arrays of different sizes together:This gives us a sort of "best case scenario" if the scheme outlined above is even possible. This obviously ignores the overhead of the specialization / unspecialization checks.
This seems really promising for small arrays where the dispatch calculation dominates. These are an important class of operations -- "scalar" arrays are really common in real-world code. It certainly would be reasonable for Numpy to not quicken when the arrays are larger than a certain threshold.
The next experiment I hope to run is to take some real-world benchmarks and see how often call sites are non-polymorphic as a way to estimate how often we could expect specific call sites to be quickened.
The text was updated successfully, but these errors were encountered: