-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda array interface #326
Cuda array interface #326
Conversation
I could help you with setting up the streams as i did the same for gpunufft... |
I already got That said, was the driver in python? If so, what python library were you using for cuda? |
Hi there, I allow myself to shamelessly plug the work we have done in mri-nufft1, which add support for CUDA array interface (I am more of cupy guy, but we also have torch support):
As @chaithyagr just mentionned this would typically be a case where we want streams and concurrent copies happening (to see the potential performance gains, you can look at what I did in 2 Footnotes |
Thanks for this tip! Will implement this shortly.
Thanks again!
Yeah. That's 100% the motivation here. Concurrent execution attempts were a net negative because most of the computation is memory bandwidth limited, so you ended up with even more thrashing. But I think concurrent copies/execution should be an easy win. I've already made some significant performance gains vs the current release version without this, but I'm hoping that exploiting streams with that could lead to a 1.5 to 2x improvement in throughput of y'alls workflow vs. the naive way. I'll ping you when I have everything working to my satisfaction :) |
@blackwer Thanks for putting this together! I haven't run anything yet, but it all looks quite reasonable. I had been thinking that specifying the framework would have to be done at Some quick thoughts:
Let me know what you think. I'm happy to get started on any of these things. |
@janden all of these sound great and I'll happily take help on any/all of them. There's also probably a reasonable way to just keep the dtype as a string and work from there, if you have ideas on that. There's weird hoops you have to jump no matter what though. |
Leave this in a separate module for reuse in other places.
Needed for cupy.
Looks like `torch.asarray` was introduced in v1.11, so for backwards compatibility, we use `as_tensor` instead.
Should be using `_data`, not `data` since the former has been transformed to the appropriate dtype, stride, and shape.
Can't call `x.size`, need to use compatibility layer.
Avoids having to call `util.transfer_funcs` all the time and simplifies when we only want `to_gpu`.
Only for show right now, since we only test the `pycuda` framework pending updated interface.
Shows that these are *pycuda* examples, not for the other frameworks (to be added later).
Since we can run on any of the given frameworks, we no longer depend on the `pycuda` package.
ecbddb6
to
3911ebc
Compare
@blackwer So I was able to make the changes. On the design side, I ended up just creating a bunch of functions (see To simplify testing, I introduced a |
Tests are currently failing since we were assigned a K40 GPU (the version of Torch we're using doesn't come with precompiled kernels for See here for a test run using V100. |
@janden done! |
Thanks! Then I think we're ready for review here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me obviously, but probably deserves another pair of eyes.
2e637b9
to
0e5f3f3
Compare
I can't take a serious look for another week -- on vacation. Will go through it then |
@blackwer No worries. I'm going to bring in the Jenkins CI changes in a separate PR though, just to get those green check marks back. |
@janden @DiamonDinoia
This PR is a proof of concept for handling GPU arrays from different python libraries -- it stands as a point of discussion for how we'd like to handle this in the future. This came up because
pycuda
, in addition to having issues withF
ordering, also seems to block on itsasync
calls, defeating the point of them entirely. I spent several hours trying to get them to not block, while it took about 10 minutes to get working async with the nvidia cuda bindings, but if someone has a working async example with pycuda, I'd like to see it.Anyway... the ultimate goal here is to facilitate using
streams
to overlap computation withhost <=> device
transfers, which currently requires some insane hoop jumping as far as I can tell. And I just couldn't get it working withpycuda
arrays. So now I have working numba/torch/pycuda arrays, mostly, probably.First, pointers to device arrays are trivial at this point using:
https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
Therefore, this whole PR would be absolutely trivial if it wasn't for two things:
numpy
types (notably torch, but possibly others)We have talked about
dlpack
before in regards to this, but I don't believe it solves the problem of construction. And we don't actually need to do anything to the data, just pass the pointer to the shared library, so it's not clear thatdlpack
provides much of use here since that's handle fine by thecuda array interface
standard. I could be misunderstanding something though.Driver code here: