-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup DimShuffle and Reshape in C and Numba backends #1226
base: main
Are you sure you want to change the base?
Conversation
4e37832
to
8fb41d6
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1226 +/- ##
=======================================
Coverage 82.01% 82.01%
=======================================
Files 188 188
Lines 48561 48562 +1
Branches 8679 8679
=======================================
+ Hits 39826 39827 +1
Misses 6572 6572
Partials 2163 2163
|
8fb41d6
to
ebc4263
Compare
This was caused by 223ee15, which used the generic `PyArray_IntpConverter` to convert the shape numpy vector into a simple C-array for the Reshape operation. There seems to be no need for this change as the strides were correctly used Profiling suggests the previous changes caused a 7.5x slowdown. The benchmark detects only a 2.3x slowdown due to the PyTensor call overhead.
ebc4263
to
b24ee8b
Compare
b24ee8b
to
94c84f8
Compare
Introduced in e593b0a due to a bug when inputs had zero-strides. The bug can be fixed just by removing a block that assumed some `full`/`broadcasting` behavior by the operation, but this is not happening with DimShuffle.
6c93750
to
c243f36
Compare
@numba_basic.numba_njit(inline="always") | ||
def dimshuffle(x): | ||
return dimshuffle_inner(np.asarray(x), shuffle) | ||
return as_strided(x, shape=new_shape, strides=new_strides) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that since we only allow changing the order of the axis, and don't introduce new axes or change the total number of arrays this never produces arrays where the same location in memory is used multiple times?
I think it would be a bad idea to introduce cases where that could happen. The as_strided docs also contains a warning note about this: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.as_strided.html#numpy.lib.stride_tricks.as_strided
If that is in fact not an issue here, I think we should add a small note in the code here pointing out why that's not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does produce multiple arrays that use the same memory and introduces/ removes axis. That's why the Op is marked as a view op to signal it can't be mutated if the original input is needed elsewhere or protected (ie provided by the user)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't create self overlapping memory and use the original strides which is what the docs recommend. Now the user could provide such an input but that would break any inplace rewrite regardless of this Op
|
||
Py_DECREF(transposed_input); | ||
// Create the new array. | ||
*res = (PyArrayObject*)PyArray_New(&PyArray_Type, nd_out, dimensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really used to writing python C api code, so I could be very wrong, but shouldn't the decref of *res be just before this here? Otherwise we might be dropping an array that the caller still want's to use?
And the incref on input should only happen after we check that the new array creation was successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decref was always like this, I think PyTensor preallocates the output (with None or the last call if gc is turned off) so if you don't need it you can just decref it immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were saying something else... We don't use the original res anywhere? Why does it matter when we decref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have some checks between the call to decref and when we change the output pointer.
And in those checks we might exit early and never change the output pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think we should either:
- decref the thing pointed to by the output pointer and then change the output pointer to something else that we either increfed or created ourselves
- Write our results into the thing pointed to by the output pointer.
Mixing them doesn't sound like a good idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we never care for the original output even if we exit early no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check again but I be surprised if this was problematic as it was always like this since early Theano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find the code that calls this function yet, so I don't know for sure what happens. But I'd expect something that usually something like the following should happen:
generate the output array (or incref it if it already exists)
Call the function that does all the work
Check the return code. If the function failed, decref the outputs.
Now, if we return a failure code but decref the outputs ourselves, that'd be a double free, and could potentially segfault.
I guess it's also possible that it doesn't decref, but I think then we'd have a memory leak in those cases where we return an error but don't call decref ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so are you saying I should move the decref/incref after the new size check?
TLDR: Make views cheap again!
All backends
Remove the unused inplace option, less branching is always helpful
Python backend
Implementation based on
as_strided
that mimicks the old (and now current again) C-backend approach. This is debatable since the Python implementation has historically been there more for readability than performance. However it's useful for fast compiling graphs not to suck?Numba backend
Use
as_strided
which simplifies the implementation by a ton with two benefits:The example with non-contiguous arrays in #1111 has no penalty now, which is a 10,000x speedup (example is exaggerated with a very large array)
C backend
This was the original goal of this PR
There were two regressions caused by e593b0a and 223ee15.
The DimShuffle changes introduce a
PyArray_IntpConverter
to compute a new shape and used two calls to numpy C-functionsPyArray_Transpose
andPyArray_NewShape
. I suspect the slowdown comes mostly from the introduction ofPyArray_IntPConverter
but I couldn't find anything wrong with the simpler logic from Theano times, other than...The bug that motivated the changes had to do with a useless second pass on the dimensions calculations:
Which is baffling in that DimShuffle is not doing any broadcasting behavior, othen than expand_dims, which the first pass already handles correctly. Removing this loop fixes the original bug.
The Reshape changes were simpler, they introduced a generic
PyArray_IntpConverter
to convert the shape numpy vector into a simple C-array for the Reshape operation.The concern about strides in the commit message doesn't hold, because they were being used directly. I added a direct tests for odd strides just in case.
In the long term Reshape should take the separate shape entries as scalars, instead of a single vector. The user never defines a graph with a vector anyway, but with isolated entries, so the whole packing and unpacking is just overhead. This is tracked #881.
Profiling a simple function suggests the previous changes caused a 8.6-11x slowdown per Op call for the DimShuffle operation and a 5.4x slowdown for the Reshape operation. The new benchmarks detects only a 3.6x and 2.3x slowdown, respectively, due to the PyTensor call overhead.