Another attempt at supporting non-contiguous arrays #171

SyamGadde · 2018-02-21T17:35:47Z

NOTE: SUPERCEDED BY #172.

Inspired by:
https://lists.tiker.net/pipermail/pycuda/2016-December/004986.html
https://gitlab.tiker.net/inducer/pycuda/merge_requests/1

I tried a new approach to supporting non-contiguous arrays in PyCUDA (could be ported to PyOpenCL somewhat easily I think). The goals (some elicited by the above discussion and comments in the WIP) were:

element-wise support for arbitrarily-strided arrays (including negative strides)
backwards compatibility with current uses of get_elwise_module and get_elwise_range_module
limited performance impact on contiguous arrays
avoid use of '*', '%', and '/' in calculating indices, at least within the element-wise loop

The only way I could think to support all those goals was to delay compilation (and source generation) to call-time, to take advantage of knowledge of input array strides. Contiguous arrays get the kernels that PyCUDA has always given them, non-contiguous arrays get specialized kernels. The nice thing about doing this is that the actual shape and strides can be sent as '#define's to aid compiler optimization (could even help with the contiguous kernels, though have not tried that). The tricky thing about doing this is that some functions in the current implementation require the Module/Function before call-time, to get texrefs etc. So I basically implemented a Proxy class for SourceModule, called DeferredSourceModule which also defers the generation of the values created by get_function(), get_texref(), etc. until call-time.

To make this all work, indexing (for non-contiguous arrays at least) for an array A needs to be A[A_i], rather than A[i]. If it detects matching contiguous arrays as inputs, then X_i is '#define'd to be i, so kernels using the old method will still work (as long as the input arrays are contiguous and match in strides). No regexes needed to transform the user-specified kernel fragments, it's all directed by the user. Also, if you want to support non-contiguous arrays, you need to send the actual GPUArray objects, rather than their gpudata members to the call or prepare_ functions.

All existing tests succeed. More would probably need to be added if it made sense to integrate this into PyCUDA.

Positive side-effects: GPUArray.get(), GPUArray.set() and GPUArray.copy() now work for arbitrarily sliced/strided arrays.

The performance hit for contiguous arrays is around 15% for modest-sized arrays (i.e. the 1000x100 array tested by Keegan in the above discussion) and, looking at profiling output, I think the hit is due to detecting contiguity/order (in ElementwiseSourceModule.create_key()). This could probably be improved. Performance for non-contiguous arrays is infinitely better, given that they weren't supported before, but I've seen a 40% slowdown over the contiguous version for the b1[::2,:-1]**2 test Keegan tried, due to the need to calculate indexes at each iteration of the loop. It tries to do this in a smart way, by pre-calculating the per-thread step for each dimension, and only using division/modulo to calculate the starting indices for each thread before the loop.

Independently of whether these changes are merged in, I will continue to use and develop them to support some local needs, so comments are welcome. I hope this is useful!

…avior of users of get_elwise_module or get_elwise_range_module, to test backwards compatibility and performance.

…e it too.

…ctual GPUArrays and using ARRAY_i indices (rather than just i).

SyamGadde · 2018-02-21T18:50:07Z

Also, wanted to reference the issues this patch may address:

#15
#66
#121
#145
#151
#154
#162

inducer · 2018-02-22T21:03:35Z

Wow. This looks like an impressive bit of work. I agree with the objectives here, and I would like to see this (or something like it) get merged. Unfortunately, I won't be able to take a close look before mid-April (work deadlines). Sorry about the long delay. I've made a note to take a look then. In the meantime, I think it would be a good idea to solicit reviews from folks on the mailing list.

Most of all, thanks for working on this!

SyamGadde · 2018-02-22T21:10:23Z

Absolutely. I will subscribe to the list and post. Thanks!

…veld)

(Function may change based on kernel call arguments)

SyamGadde · 2018-03-01T18:57:30Z

Closing and creating a new pull request #172 with more fixes, more reasonably arranged.

Syam Gadde added 7 commits February 20, 2018 12:44

Fix slicing with negative stride.

7d09e44

Commit only DeferredSourceModule support without changing calling beh…

27bcf76

…avior of users of get_elwise_module or get_elwise_range_module, to test backwards compatibility and performance.

Smarter _new_like_me that handles discontiguous input. Have copy() us…

be4014c

…e it too.

Make sure key is hashable.

afc3251

Allow existing kernel calls to use non-contiguous arrays by sending a…

9cb80f7

…ctual GPUArrays and using ARRAY_i indices (rather than just i).

Fix variable names.

b38c75f

Non-contiguous is OK now.

1f6486b

Syam Gadde added 2 commits February 21, 2018 14:36

Forgot to remove non-contiguity check.

edcc44a

Allow setting scalars.

e23c943

grlee77 and others added 16 commits February 28, 2018 14:08

fix: update signature of gpuarray.reshape to match the GPUArray method

9cfaf97

Add get_texref() to ElementwiseKernel

080ec59

Update bpl-subset, possibly including pypy support

a7cb982

Make characterize.platform_bits work with Pypy (patch by Emanuel Riet…

fb10ffd

…veld)

Fix pytest script-based test invocation

71ec966

Fix DeferredFunction.__call__, and change modulelazy to deferredmod.

542cffa

Make sure 'texrefs' keyword arg is re-evaluated every time.

8d278fb

(Function may change based on kernel call arguments)

Store module in cache too.

5a4a2fa

Send grid and block to _delayed_get_function

05a5400

Fix comment.

11108fe

Add debug option to ElementwiseSourceModule.

1a6228c

Fix index calculation (found using _debug!)

b743698

Add shape to the key (so it needs to remain a tuple).

7e57a72

Remove unnecessary format key.

c5de070

Fix kernel name.

61bd908

Fix _array_like_helper to work with non-contiguous arrays.

33a0dd8

SyamGadde mentioned this pull request Mar 1, 2018

Another attempt at supporting non-contiguous arrays #172

Open

SyamGadde closed this Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another attempt at supporting non-contiguous arrays #171

Another attempt at supporting non-contiguous arrays #171

SyamGadde commented Feb 21, 2018 •

edited

Loading

SyamGadde commented Feb 21, 2018 •

edited

Loading

inducer commented Feb 22, 2018

SyamGadde commented Feb 22, 2018

SyamGadde commented Mar 1, 2018

Another attempt at supporting non-contiguous arrays #171

Another attempt at supporting non-contiguous arrays #171

Conversation

SyamGadde commented Feb 21, 2018 • edited Loading

SyamGadde commented Feb 21, 2018 • edited Loading

inducer commented Feb 22, 2018

SyamGadde commented Feb 22, 2018

SyamGadde commented Mar 1, 2018

SyamGadde commented Feb 21, 2018 •

edited

Loading

SyamGadde commented Feb 21, 2018 •

edited

Loading