Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup DimShuffle and Reshape in C and Numba backends #1226

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Feb 20, 2025

TLDR: Make views cheap again!

All backends

Remove the unused inplace option, less branching is always helpful

Python backend

Implementation based on as_strided that mimicks the old (and now current again) C-backend approach. This is debatable since the Python implementation has historically been there more for readability than performance. However it's useful for fast compiling graphs not to suck?

Numba backend

Use as_strided which simplifies the implementation by a ton with two benefits:

  1. Much faster compilation / caching. The new benchmark runs on 1.6s (includes many runs) after caching, vs 4.6s. When checking a simple function directly I saw wall times going down from 1.2 vs 600ms (first time after starting the interpreter) and 300ms vs 100ms in subsequent runs. This seems to scale with the number of DimShuffles in the graph?
  2. Slightly faster for contiguous arrays: the new benchmark roughly 1.1-1.3x faster
import pytensor
import pytensor.tensor as pt
import numpy as np

x = pt.tensor("x", shape=(1, 2, 3))
y = x.dimshuffle(2, "x", 1, "x")

out = pytensor.function([pytensor.In(x, borrow=True)], pytensor.Out(y, borrow=True), mode="NUMBA")
out.trust_input = True

x_test = np.zeros((1, 2, 3))
out(x_test)
%timeit out(x_test)

The example with non-contiguous arrays in #1111 has no penalty now, which is a 10,000x speedup (example is exaggerated with a very large array)

C backend

This was the original goal of this PR

There were two regressions caused by e593b0a and 223ee15.

The DimShuffle changes introduce a PyArray_IntpConverter to compute a new shape and used two calls to numpy C-functions PyArray_Transpose and PyArray_NewShape. I suspect the slowdown comes mostly from the introduction of PyArray_IntPConverter but I couldn't find anything wrong with the simpler logic from Theano times, other than...

The bug that motivated the changes had to do with a useless second pass on the dimensions calculations:

    /* set the strides of the broadcasted dimensions.
     * This algorithm is from numpy: PyArray_Newshape() in
     * cvs/numpy/numpy/core/src/multiarraymodule.c */
    if (nd_out > 0) {
        if (strides[nd_out - 1] == 0)
            strides[nd_out - 1] = PyArray_DESCR(basename)->elsize;
        for (npy_intp i = nd_out - 2; i > -1; --i) {
            if (strides[i] == 0)
                strides[i] = strides[i + 1] * dimensions[i + 1];
        }
    }

Which is baffling in that DimShuffle is not doing any broadcasting behavior, othen than expand_dims, which the first pass already handles correctly. Removing this loop fixes the original bug.

The Reshape changes were simpler, they introduced a generic PyArray_IntpConverter to convert the shape numpy vector into a simple C-array for the Reshape operation.

The concern about strides in the commit message doesn't hold, because they were being used directly. I added a direct tests for odd strides just in case.

In the long term Reshape should take the separate shape entries as scalars, instead of a single vector. The user never defines a graph with a vector anyway, but with isolated entries, so the whole packing and unpacking is just overhead. This is tracked #881.

Profiling a simple function suggests the previous changes caused a 8.6-11x slowdown per Op call for the DimShuffle operation and a 5.4x slowdown for the Reshape operation. The new benchmarks detects only a 3.6x and 2.3x slowdown, respectively, due to the PyTensor call overhead.

## Op
### DimShuffle: (8.6x slowdown)
Before: 1.74e-06s per call (class)
After: 2.02e-07s

### Reshape: (5.4x slowdown)
Before: 1.50e-06s per call (class)
After: 2.78e-07s

Copy link

codecov bot commented Feb 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.01%. Comparing base (3cdcfde) to head (94c84f8).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1226   +/-   ##
=======================================
  Coverage   82.01%   82.01%           
=======================================
  Files         188      188           
  Lines       48561    48562    +1     
  Branches     8679     8679           
=======================================
+ Hits        39826    39827    +1     
  Misses       6572     6572           
  Partials     2163     2163           
Files with missing lines Coverage Δ
pytensor/tensor/elemwise.py 89.04% <100.00%> (+0.04%) ⬆️
pytensor/tensor/shape.py 90.10% <100.00%> (-0.05%) ⬇️

This was caused by 223ee15, which used the generic `PyArray_IntpConverter` to convert the shape numpy vector into a simple C-array for the Reshape operation. There seems to be no need for this change as the strides were correctly used

Profiling suggests the previous changes caused a 7.5x slowdown. The benchmark detects only a 2.3x slowdown due to the PyTensor call overhead.
@ricardoV94 ricardoV94 changed the title Revert regression in Reshape c-impl speed Revert regression in DimShuffle and Reshape C-impl speed Feb 20, 2025
@ricardoV94 ricardoV94 marked this pull request as ready for review February 20, 2025 14:23
@ricardoV94 ricardoV94 changed the title Revert regression in DimShuffle and Reshape C-impl speed Speedup DimShuffle and Reshape in C and Numba backends Feb 21, 2025
@ricardoV94 ricardoV94 marked this pull request as draft February 21, 2025 12:43
Introduced in e593b0a due to a bug when inputs had zero-strides.

The bug can be fixed just by removing a block that assumed some `full`/`broadcasting` behavior by the operation, but this is not happening with DimShuffle.
@numba_basic.numba_njit(inline="always")
def dimshuffle(x):
return dimshuffle_inner(np.asarray(x), shuffle)
return as_strided(x, shape=new_shape, strides=new_strides)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that since we only allow changing the order of the axis, and don't introduce new axes or change the total number of arrays this never produces arrays where the same location in memory is used multiple times?
I think it would be a bad idea to introduce cases where that could happen. The as_strided docs also contains a warning note about this: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.as_strided.html#numpy.lib.stride_tricks.as_strided

If that is in fact not an issue here, I think we should add a small note in the code here pointing out why that's not a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does produce multiple arrays that use the same memory and introduces/ removes axis. That's why the Op is marked as a view op to signal it can't be mutated if the original input is needed elsewhere or protected (ie provided by the user)

Copy link
Member Author

@ricardoV94 ricardoV94 Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't create self overlapping memory and use the original strides which is what the docs recommend. Now the user could provide such an input but that would break any inplace rewrite regardless of this Op


Py_DECREF(transposed_input);
// Create the new array.
*res = (PyArrayObject*)PyArray_New(&PyArray_Type, nd_out, dimensions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really used to writing python C api code, so I could be very wrong, but shouldn't the decref of *res be just before this here? Otherwise we might be dropping an array that the caller still want's to use?
And the incref on input should only happen after we check that the new array creation was successful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decref was always like this, I think PyTensor preallocates the output (with None or the last call if gc is turned off) so if you don't need it you can just decref it immediately?

Copy link
Member Author

@ricardoV94 ricardoV94 Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were saying something else... We don't use the original res anywhere? Why does it matter when we decref?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have some checks between the call to decref and when we change the output pointer.
And in those checks we might exit early and never change the output pointer.

Copy link
Member

@aseyboldt aseyboldt Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think we should either:

  • decref the thing pointed to by the output pointer and then change the output pointer to something else that we either increfed or created ourselves
  • Write our results into the thing pointed to by the output pointer.

Mixing them doesn't sound like a good idea...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we never care for the original output even if we exit early no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check again but I be surprised if this was problematic as it was always like this since early Theano

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the code that calls this function yet, so I don't know for sure what happens. But I'd expect something that usually something like the following should happen:

generate the output array (or incref it if it already exists)
Call the function that does all the work
Check the return code. If the function failed, decref the outputs.

Now, if we return a failure code but decref the outputs ourselves, that'd be a double free, and could potentially segfault.
I guess it's also possible that it doesn't decref, but I think then we'd have a memory leak in those cases where we return an error but don't call decref ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so are you saying I should move the decref/incref after the new size check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants