Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda array interface #326

Merged
merged 37 commits into from
Sep 8, 2023

Conversation

blackwer
Copy link
Member

@blackwer blackwer commented Jul 25, 2023

@janden @DiamonDinoia

This PR is a proof of concept for handling GPU arrays from different python libraries -- it stands as a point of discussion for how we'd like to handle this in the future. This came up because pycuda, in addition to having issues with F ordering, also seems to block on its async calls, defeating the point of them entirely. I spent several hours trying to get them to not block, while it took about 10 minutes to get working async with the nvidia cuda bindings, but if someone has a working async example with pycuda, I'd like to see it.

Anyway... the ultimate goal here is to facilitate using streams to overlap computation with host <=> device transfers, which currently requires some insane hoop jumping as far as I can tell. And I just couldn't get it working with pycuda arrays. So now I have working numba/torch/pycuda arrays, mostly, probably.

First, pointers to device arrays are trivial at this point using:
https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
Therefore, this whole PR would be absolutely trivial if it wasn't for two things:

  1. type comparisons. not all GPU array wrappers use numpy types (notably torch, but possibly others)
  2. Constructors. By having the option of not providing an output array, we have to create GPU arrays ourselves, but then we have to decide the type of GPU array to use. So we need to grab the type of the input array and make an array using that type. But none of the GPU array APIs are the same, and the constructors are different for each library. Also torch does its own thing entirely.

We have talked about dlpack before in regards to this, but I don't believe it solves the problem of construction. And we don't actually need to do anything to the data, just pass the pointer to the shared library, so it's not clear that dlpack provides much of use here since that's handle fine by the cuda array interface standard. I could be misunderstanding something though.

Driver code here:

import sys
import numpy as np

def transfer_funcs(module: str):
    if module == 'numba':
        import numba.cuda
        to_gpu = numba.cuda.to_device
        def to_cpu(obj):
            return obj.copy_to_host()
    elif module == 'pycuda':
        import pycuda.autoinit
        from pycuda.gpuarray import to_gpu
        def to_cpu(obj):
            return obj.get()
    elif module == 'cupy':
        import cupy
        def to_cpu(obj):
            return obj.get()
        def to_gpu(obj):
            return cupy.array(obj)
    elif module == 'torch':
        import torch
        def to_gpu(obj):
            return torch.asarray(obj, device=torch.device('cuda'))
        def to_cpu(obj):
            return obj.cpu().numpy()
    else:
        errstr = f"Unsupported module: {module}"
        raise TypeError(errstr)

    return to_cpu, to_gpu


if len(sys.argv) != 3:
    print("Supply implementation: numba, pycuda, torch")
    print("Supply dtype: 32, 64")
    sys.exit(1)

to_cpu, to_gpu = transfer_funcs(sys.argv[1])

# breaks numba if imported before getting the gpu imports with my current glibc++/venv setup
import cufinufft

if sys.argv[2] == '32':
    dtype = np.float32
    complex_dtype = np.complex64
elif sys.argv[2] == '64':
    dtype = np.float64
    complex_dtype = np.complex128
else:
    errstr = f"Unsupported dtype: {sys.argv[2]}"
    raise TypeError(errstr)

# Set up parameters for problem.
N1, N2 = 59, 61                 # Size of uniform grid
M = 100                         # Number of nonuniform points
n_transf = 2                    # Number of input arrays
eps = 1e-6                      # Requested tolerance

np.random.seed(100)
# Generate coordinates of non-uniform points.
x = np.random.uniform(-np.pi, np.pi, size=M)
y = np.random.uniform(-np.pi, np.pi, size=M)

# Generate source strengths.
c = (np.random.standard_normal((n_transf, M))
     + 1j * np.random.standard_normal((n_transf, M)))

# Cast to desired datatype.
x = x.astype(dtype)
y = y.astype(dtype)
c = c.astype(complex_dtype)

# Initialize the plan and set the points.
plan = cufinufft.Plan(1, (N1, N2), n_transf, eps=eps, dtype=complex_dtype)
plan.setpts(to_gpu(x), to_gpu(y))

# Execute the plan, reading from the strengths array c and storing the
# result in fk_gpu.
fk_gpu = plan.execute(to_gpu(c))

# Retreive the result from the GPU.
fk = to_cpu(fk_gpu)

# Check accuracy of the transform at position (nt1, nt2).
nt1 = int(0.37 * N1)
nt2 = int(0.26 * N2)

for i in range(n_transf):
    # Calculate the true value of the type 1 transform at the uniform grid
    # point (nt1, nt2), which corresponds to the coordinate nt1 - N1 // 2 and
    # nt2 - N2 // 2.
    m, n = nt1 - N1 // 2, nt2 - N2 // 2
    fk_true = np.sum(c[i] * np.exp(1j * (m * x + n * y)))

    # Calculate the absolute and relative error.
    err = np.abs(fk[i, nt1, nt2] - fk_true)
    rel_err = err / np.max(np.abs(fk[i]))

    print(f"[{i}] Absolute error on mode [{nt1}, {nt2}] is {err:.3g}")
    print(f"[{i}] Relative error on mode [{nt1}, {nt2}] is {rel_err:.3g}")

    assert(rel_err < 10 * eps)
% python example2d1many.py numba 32

@blackwer blackwer requested a review from janden July 25, 2023 21:22
@blackwer blackwer marked this pull request as draft July 25, 2023 21:23
@chaithyagr
Copy link
Contributor

I could help you with setting up the streams as i did the same for gpunufft...

@blackwer
Copy link
Member Author

I could help you with setting up the streams as i did the same for gpunufft...

I already got cufinufft working in this regard, but thank you! I'll be publishing a working python example in in a separate PR soon hopefully.

That said, was the driver in python? If so, what python library were you using for cuda?

@paquiteau
Copy link

paquiteau commented Jul 26, 2023

Hi there, I allow myself to shamelessly plug the work we have done in mri-nufft1, which add support for CUDA array interface (I am more of cupy guy, but we also have torch support):

As @chaithyagr just mentionned this would typically be a case where we want streams and concurrent copies happening (to see the potential performance gains, you can look at what I did in 2

Footnotes

  1. https://github.com/mind-inria/mri-nufft

  2. https://github.com/flatironinstitute/cufinufft/issues/126

@blackwer
Copy link
Member Author

  • Accessing the pointer of torch array is protected if you want to do some gradient computation, but a bug feature makes it possible to access it anyway. We will be happy to have this handled upstream.

Thanks for this tip! Will implement this shortly.

  • and a example of driver is available there:

Thanks again!

As @chaithyagr just mentionned this would typically be a case where we want streams and concurrent copies happening (to see the potential performance gains, you can look at what I did in 2

Yeah. That's 100% the motivation here. Concurrent execution attempts were a net negative because most of the computation is memory bandwidth limited, so you ended up with even more thrashing. But I think concurrent copies/execution should be an easy win. I've already made some significant performance gains vs the current release version without this, but I'm hoping that exploiting streams with that could lead to a 1.5 to 2x improvement in throughput of y'alls workflow vs. the naive way. I'll ping you when I have everything working to my satisfaction :)

@janden
Copy link
Collaborator

janden commented Aug 7, 2023

@blackwer Thanks for putting this together! I haven't run anything yet, but it all looks quite reasonable. I had been thinking that specifying the framework would have to be done at Plan instantiation, but you're right in that it can always be autodetected when setpts and execute are called (in fact, a “fun” aspect of your approach is that you can set the points using cupy and execute the transform in torch …).

Some quick thoughts:

  1. It might be good to rename _get_array_ctor to something else, since it returns the tuple of constructor and module name. Actually, since the constructor is not called in setpts, maybe it would be better to only call some variant of _get_module there.
  2. It might be cleaner to factor out calls to x.size, x.dtype, is_contiguous into a separate wrapper class (mainly to deal with torch deviations) and just return that (_GpuArrayWrapper).
  3. Assuming other libraries play ball (i.e., unlike pycuda), we can return to the original idea of _ensure_array_type by forcing a copy to contiguous if we're not.
  4. Still need unit tests for all this, but that shouldn't be too bad.

Let me know what you think. I'm happy to get started on any of these things.

@blackwer
Copy link
Member Author

blackwer commented Aug 7, 2023

@janden all of these sound great and I'll happily take help on any/all of them. There's also probably a reasonable way to just keep the dtype as a string and work from there, if you have ideas on that. There's weird hoops you have to jump no matter what though.

janden added 11 commits August 24, 2023 11:54
Can't call `x.size`, need to use compatibility layer.
Avoids having to call `util.transfer_funcs` all the time and simplifies
when we only want `to_gpu`.
Only for show right now, since we only test the `pycuda` framework
pending updated interface.
Shows that these are *pycuda* examples, not for the other frameworks (to
be added later).
Since we can run on any of the given frameworks, we no longer depend on
the `pycuda` package.
@janden janden force-pushed the cuda_array_interface branch from ecbddb6 to 3911ebc Compare August 24, 2023 10:53
@janden
Copy link
Collaborator

janden commented Aug 24, 2023

@blackwer So I was able to make the changes. On the design side, I ended up just creating a bunch of functions (see _compat.py) for dealing with the various array types instead of creating a class (this way, indexing, etc. stays the same and we only call out to the compatibility layer when needed). I also made sure the simple interfaces can handle the different frameworks (was only a one-line change!).

To simplify testing, I introduced a --framework option to pytest which lets the user choose which frameworks to test. This can be useful if (like me) you have a broken numba install locally and only want to test the other frameworks. It also lets you run each framework in a separate process to prevent collisions in contexts, etc. (again, numba is the odd one out here). I've set up Jenkins to run all of them and it seems ok.

@janden
Copy link
Collaborator

janden commented Aug 24, 2023

Tests are currently failing since we were assigned a K40 GPU (the version of Torch we're using doesn't come with precompiled kernels for sm35). Simplest solution is to force Jenkins to use the newer GPUs, but I don't recall the exact constraint to put in the Jenkinsfile.

See here for a test run using V100.

@blackwer
Copy link
Member Author

@janden done!

@janden
Copy link
Collaborator

janden commented Aug 25, 2023

Thanks! Then I think we're ready for review here.

@janden janden marked this pull request as ready for review August 25, 2023 07:16
@janden janden requested a review from ahbarnett August 30, 2023 09:32
Copy link
Collaborator

@janden janden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me obviously, but probably deserves another pair of eyes.

@janden janden force-pushed the master branch 3 times, most recently from 2e637b9 to 0e5f3f3 Compare August 30, 2023 11:26
@blackwer
Copy link
Member Author

I can't take a serious look for another week -- on vacation. Will go through it then

@janden
Copy link
Collaborator

janden commented Aug 31, 2023

I can't take a serious look for another week -- on vacation. Will go through it then

@blackwer No worries.

I'm going to bring in the Jenkins CI changes in a separate PR though, just to get those green check marks back.

@blackwer blackwer merged commit 5baa68c into flatironinstitute:master Sep 8, 2023
@blackwer blackwer deleted the cuda_array_interface branch September 8, 2023 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants