Cuda array interface #326

blackwer · 2023-07-25T21:22:54Z

This PR is a proof of concept for handling GPU arrays from different python libraries -- it stands as a point of discussion for how we'd like to handle this in the future. This came up because pycuda, in addition to having issues with F ordering, also seems to block on its async calls, defeating the point of them entirely. I spent several hours trying to get them to not block, while it took about 10 minutes to get working async with the nvidia cuda bindings, but if someone has a working async example with pycuda, I'd like to see it.

Anyway... the ultimate goal here is to facilitate using streams to overlap computation with host <=> device transfers, which currently requires some insane hoop jumping as far as I can tell. And I just couldn't get it working with pycuda arrays. So now I have working numba/torch/pycuda arrays, mostly, probably.

First, pointers to device arrays are trivial at this point using:
https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
Therefore, this whole PR would be absolutely trivial if it wasn't for two things:

type comparisons. not all GPU array wrappers use numpy types (notably torch, but possibly others)
Constructors. By having the option of not providing an output array, we have to create GPU arrays ourselves, but then we have to decide the type of GPU array to use. So we need to grab the type of the input array and make an array using that type. But none of the GPU array APIs are the same, and the constructors are different for each library. Also torch does its own thing entirely.

We have talked about dlpack before in regards to this, but I don't believe it solves the problem of construction. And we don't actually need to do anything to the data, just pass the pointer to the shared library, so it's not clear that dlpack provides much of use here since that's handle fine by the cuda array interface standard. I could be misunderstanding something though.

Driver code here:

import sys
import numpy as np

def transfer_funcs(module: str):
    if module == 'numba':
        import numba.cuda
        to_gpu = numba.cuda.to_device
        def to_cpu(obj):
            return obj.copy_to_host()
    elif module == 'pycuda':
        import pycuda.autoinit
        from pycuda.gpuarray import to_gpu
        def to_cpu(obj):
            return obj.get()
    elif module == 'cupy':
        import cupy
        def to_cpu(obj):
            return obj.get()
        def to_gpu(obj):
            return cupy.array(obj)
    elif module == 'torch':
        import torch
        def to_gpu(obj):
            return torch.asarray(obj, device=torch.device('cuda'))
        def to_cpu(obj):
            return obj.cpu().numpy()
    else:
        errstr = f"Unsupported module: {module}"
        raise TypeError(errstr)

    return to_cpu, to_gpu


if len(sys.argv) != 3:
    print("Supply implementation: numba, pycuda, torch")
    print("Supply dtype: 32, 64")
    sys.exit(1)

to_cpu, to_gpu = transfer_funcs(sys.argv[1])

# breaks numba if imported before getting the gpu imports with my current glibc++/venv setup
import cufinufft

if sys.argv[2] == '32':
    dtype = np.float32
    complex_dtype = np.complex64
elif sys.argv[2] == '64':
    dtype = np.float64
    complex_dtype = np.complex128
else:
    errstr = f"Unsupported dtype: {sys.argv[2]}"
    raise TypeError(errstr)

# Set up parameters for problem.
N1, N2 = 59, 61                 # Size of uniform grid
M = 100                         # Number of nonuniform points
n_transf = 2                    # Number of input arrays
eps = 1e-6                      # Requested tolerance

np.random.seed(100)
# Generate coordinates of non-uniform points.
x = np.random.uniform(-np.pi, np.pi, size=M)
y = np.random.uniform(-np.pi, np.pi, size=M)

# Generate source strengths.
c = (np.random.standard_normal((n_transf, M))
     + 1j * np.random.standard_normal((n_transf, M)))

# Cast to desired datatype.
x = x.astype(dtype)
y = y.astype(dtype)
c = c.astype(complex_dtype)

# Initialize the plan and set the points.
plan = cufinufft.Plan(1, (N1, N2), n_transf, eps=eps, dtype=complex_dtype)
plan.setpts(to_gpu(x), to_gpu(y))

# Execute the plan, reading from the strengths array c and storing the
# result in fk_gpu.
fk_gpu = plan.execute(to_gpu(c))

# Retreive the result from the GPU.
fk = to_cpu(fk_gpu)

# Check accuracy of the transform at position (nt1, nt2).
nt1 = int(0.37 * N1)
nt2 = int(0.26 * N2)

for i in range(n_transf):
    # Calculate the true value of the type 1 transform at the uniform grid
    # point (nt1, nt2), which corresponds to the coordinate nt1 - N1 // 2 and
    # nt2 - N2 // 2.
    m, n = nt1 - N1 // 2, nt2 - N2 // 2
    fk_true = np.sum(c[i] * np.exp(1j * (m * x + n * y)))

    # Calculate the absolute and relative error.
    err = np.abs(fk[i, nt1, nt2] - fk_true)
    rel_err = err / np.max(np.abs(fk[i]))

    print(f"[{i}] Absolute error on mode [{nt1}, {nt2}] is {err:.3g}")
    print(f"[{i}] Relative error on mode [{nt1}, {nt2}] is {rel_err:.3g}")

    assert(rel_err < 10 * eps)

% python example2d1many.py numba 32

chaithyagr · 2023-07-26T17:11:26Z

I could help you with setting up the streams as i did the same for gpunufft...

blackwer · 2023-07-26T17:19:01Z

I could help you with setting up the streams as i did the same for gpunufft...

I already got cufinufft working in this regard, but thank you! I'll be publishing a working python example in in a separate PR soon hopefully.

That said, was the driver in python? If so, what python library were you using for cuda?

paquiteau · 2023-07-26T17:19:17Z

Hi there, I allow myself to shamelessly plug the work we have done in mri-nufft¹, which add support for CUDA array interface (I am more of cupy guy, but we also have torch support):

here for the pointer handling: https://github.com/mind-inria/mri-nufft/blob/master/src/mrinufft/operators/interfaces/utils/gpu_utils.py#L38-L43
Accessing the pointer of torch array is protected if you want to do some gradient computation, but a ~~bug~~ feature makes it possible to access it anyway. We will be happy to have this handled upstream.
and a example of driver is available there: https://github.com/mind-inria/mri-nufft/blob/master/src/mrinufft/operators/interfaces/cufinufft.py#L349-L386

As @chaithyagr just mentionned this would typically be a case where we want streams and concurrent copies happening (to see the potential performance gains, you can look at what I did in ²

blackwer · 2023-07-26T17:37:40Z

Accessing the pointer of torch array is protected if you want to do some gradient computation, but a ~~bug~~ feature makes it possible to access it anyway. We will be happy to have this handled upstream.

Thanks for this tip! Will implement this shortly.

and a example of driver is available there:

Thanks again!

As @chaithyagr just mentionned this would typically be a case where we want streams and concurrent copies happening (to see the potential performance gains, you can look at what I did in 2

Yeah. That's 100% the motivation here. Concurrent execution attempts were a net negative because most of the computation is memory bandwidth limited, so you ended up with even more thrashing. But I think concurrent copies/execution should be an easy win. I've already made some significant performance gains vs the current release version without this, but I'm hoping that exploiting streams with that could lead to a 1.5 to 2x improvement in throughput of y'alls workflow vs. the naive way. I'll ping you when I have everything working to my satisfaction :)

janden · 2023-08-07T14:00:08Z

@blackwer Thanks for putting this together! I haven't run anything yet, but it all looks quite reasonable. I had been thinking that specifying the framework would have to be done at Plan instantiation, but you're right in that it can always be autodetected when setpts and execute are called (in fact, a “fun” aspect of your approach is that you can set the points using cupy and execute the transform in torch …).

Some quick thoughts:

It might be good to rename _get_array_ctor to something else, since it returns the tuple of constructor and module name. Actually, since the constructor is not called in setpts, maybe it would be better to only call some variant of _get_module there.
It might be cleaner to factor out calls to x.size, x.dtype, is_contiguous into a separate wrapper class (mainly to deal with torch deviations) and just return that (_GpuArrayWrapper).
Assuming other libraries play ball (i.e., unlike pycuda), we can return to the original idea of _ensure_array_type by forcing a copy to contiguous if we're not.
Still need unit tests for all this, but that shouldn't be too bad.

Let me know what you think. I'm happy to get started on any of these things.

blackwer · 2023-08-07T14:09:38Z

@janden all of these sound great and I'll happily take help on any/all of them. There's also probably a reasonable way to just keep the dtype as a string and work from there, if you have ideas on that. There's weird hoops you have to jump no matter what though.

Leave this in a separate module for reuse in other places.

Needed for cupy.

Looks like `torch.asarray` was introduced in v1.11, so for backwards compatibility, we use `as_tensor` instead.

Should be using `_data`, not `data` since the former has been transformed to the appropriate dtype, stride, and shape.

Can't call `x.size`, need to use compatibility layer.

Avoids having to call `util.transfer_funcs` all the time and simplifies when we only want `to_gpu`.

Only for show right now, since we only test the `pycuda` framework pending updated interface.

Shows that these are *pycuda* examples, not for the other frameworks (to be added later).

Since we can run on any of the given frameworks, we no longer depend on the `pycuda` package.

janden · 2023-08-24T10:57:45Z

@blackwer So I was able to make the changes. On the design side, I ended up just creating a bunch of functions (see _compat.py) for dealing with the various array types instead of creating a class (this way, indexing, etc. stays the same and we only call out to the compatibility layer when needed). I also made sure the simple interfaces can handle the different frameworks (was only a one-line change!).

To simplify testing, I introduced a --framework option to pytest which lets the user choose which frameworks to test. This can be useful if (like me) you have a broken numba install locally and only want to test the other frameworks. It also lets you run each framework in a separate process to prevent collisions in contexts, etc. (again, numba is the odd one out here). I've set up Jenkins to run all of them and it seems ok.

janden · 2023-08-24T11:09:31Z

Tests are currently failing since we were assigned a K40 GPU (the version of Torch we're using doesn't come with precompiled kernels for sm35). Simplest solution is to force Jenkins to use the newer GPUs, but I don't recall the exact constraint to put in the Jenkinsfile.

See here for a test run using V100.

blackwer · 2023-08-25T04:20:26Z

@janden done!

janden · 2023-08-25T07:16:05Z

Thanks! Then I think we're ready for review here.

janden

Looks good to me obviously, but probably deserves another pair of eyes.

blackwer · 2023-08-30T23:22:48Z

I can't take a serious look for another week -- on vacation. Will go through it then

janden · 2023-08-31T07:43:05Z

I can't take a serious look for another week -- on vacation. Will go through it then

@blackwer No worries.

I'm going to bring in the Jenkins CI changes in a separate PR though, just to get those green check marks back.

blackwer requested a review from janden July 25, 2023 21:22

blackwer marked this pull request as draft July 25, 2023 21:23

blackwer mentioned this pull request Aug 2, 2023

allow user specified cuda streams for concurrent transfer/execution #330

Merged

This was referenced Aug 8, 2023

Can this library be applied to PyTorch? #307

Closed

Compatibility with Pytorch flatironinstitute/cufinufft#154

Closed

janden mentioned this pull request Aug 22, 2023

Cuda array interface (clone) #333

Closed

blackwer and others added 18 commits August 22, 2023 18:01

python: add hackish way to support more than pycuda arrays

7b07d0f

cuda-python: remove unused import

71fa58a

cuda-python: make gpu array handling slightly less hacky

98efe74

cuda: revert changes to exception strings

6dd63cb

cuda-python: check if array is none rather than its truthiness

e2cbfa2

cuda-python: handle torch arrays with gradient enabled

f8b04e3

cuda-python: simplify compatibility layer

6f984cb

cuda-python: factor out compatibility layer

1c437c2

Leave this in a separate module for reuse in other places.

cuda-python: missed torch import

926b3ad

cuda-python: multiple frameworks in unit tests

be2411d

tests: add other frameworks to Jenkinsfile

ae39257

tests: install all CUDA libraries

0154868

Needed for cupy.

tests: hardcode compute capability

9eebe55

tests: install latest torch version for cu110

4244540

cuda-python: use torch.as_tensor

15e3d0f

Looks like `torch.asarray` was introduced in v1.11, so for backwards compatibility, we use `as_tensor` instead.

cuda-python: copy if not contiguous

8371776

cuda-python: fix bug in Plan

8df027a

Should be using `_data`, not `data` since the former has been transformed to the appropriate dtype, stride, and shape.

tests: test non-contiguous arrays

ab64862

janden added 11 commits August 24, 2023 11:54

tests: test_type1_ordering for other frameworks

d046c75

cuda-python: fix bug in torch dimension check

9387e02

Can't call `x.size`, need to use compatibility layer.

tests: introduce to_gpu and to_cpu fixtures

709791e

Avoids having to call `util.transfer_funcs` all the time and simplifies when we only want `to_gpu`.

tests: parametrize test_error_checks

f31cfd0

tests: parametrize test_multi

3e32e01

Only for show right now, since we only test the `pycuda` framework pending updated interface.

cuda-python: rename examples

d9c3036

Shows that these are *pycuda* examples, not for the other frameworks (to be added later).

tests: only run examples from approved frameworks

abb43c3

cuda-python: make _simple torch-compatible

97b0e21

tests: fix wrong dtype check

eb153f8

tests: parametrize simple interfaces tests

56fad53

cuda-python: remove pycuda from requirements

3911ebc

Since we can run on any of the given frameworks, we no longer depend on the `pycuda` package.

janden force-pushed the cuda_array_interface branch from ecbddb6 to 3911ebc Compare August 24, 2023 10:53

blackwer added 2 commits August 24, 2023 16:24

Contrain jenkins to use v100 GPUs

8f2a643

cuda: jenkins: build with cuda arch 70 (v100)

e148b7d

janden marked this pull request as ready for review August 25, 2023 07:16

janden requested a review from ahbarnett August 30, 2023 09:32

janden approved these changes Aug 30, 2023

View reviewed changes

janden force-pushed the master branch 3 times, most recently from 2e637b9 to 0e5f3f3 Compare August 30, 2023 11:26

janden mentioned this pull request Aug 31, 2023

Constrain jenkins to use v100 GPUs #336

Merged

blackwer merged commit 5baa68c into flatironinstitute:master Sep 8, 2023

blackwer deleted the cuda_array_interface branch September 8, 2023 15:49

janden mentioned this pull request Sep 16, 2023

cufinufft support for F-ordering #298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda array interface #326

Cuda array interface #326

blackwer commented Jul 25, 2023 •

edited

Loading

chaithyagr commented Jul 26, 2023

blackwer commented Jul 26, 2023

paquiteau commented Jul 26, 2023 •

edited

Loading

blackwer commented Jul 26, 2023

janden commented Aug 7, 2023

blackwer commented Aug 7, 2023

janden commented Aug 24, 2023

janden commented Aug 24, 2023 •

edited

Loading

blackwer commented Aug 25, 2023

janden commented Aug 25, 2023

janden left a comment

blackwer commented Aug 30, 2023

janden commented Aug 31, 2023

Cuda array interface #326

Cuda array interface #326

Conversation

blackwer commented Jul 25, 2023 • edited Loading

chaithyagr commented Jul 26, 2023

blackwer commented Jul 26, 2023

paquiteau commented Jul 26, 2023 • edited Loading

Footnotes

blackwer commented Jul 26, 2023

janden commented Aug 7, 2023

blackwer commented Aug 7, 2023

janden commented Aug 24, 2023

janden commented Aug 24, 2023 • edited Loading

blackwer commented Aug 25, 2023

janden commented Aug 25, 2023

janden left a comment

Choose a reason for hiding this comment

blackwer commented Aug 30, 2023

janden commented Aug 31, 2023

blackwer commented Jul 25, 2023 •

edited

Loading

paquiteau commented Jul 26, 2023 •

edited

Loading

janden commented Aug 24, 2023 •

edited

Loading