-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mpi4py-fft with CuPy arrays? #14
Comments
Hi @leofang |
Thanks, @mikaem! I am not sure I fully follow, maybe I still miss something... Yes,
In terms of code, I am looking at the tutorial: import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape) From a CuPy developer/user's perspective, I expect that this code should just work after replacing |
My point is that there are Numpy arrays within the PFFT class, or more precisely the FFT class. The arrays you set up with DistArray are copied into these before the transforms start. So just changing DistArray does not change the underlying routines. Transforms are not in-place. Which is why I think you need to add a serial fft class for cupy just to get started. Is there a serial cupy fft library? |
What's the status of this? Support for subclassing has since been added to cupy (see the issue mentioned above) and cupy has FFTs as well. |
There are no plans as far as I know. But contributions are welcome😃 |
I didn't have bandwidth for this, so feel free to give it a shot and let us know how it goes! |
I started out with CuPyfying the DistArray class. There were a few issues, but I think so far nothing catastrophic. I would like to merge this back here eventually. I guess you would reasonably ask the changes to be tested. However, testing code on GPU is not straightforward on GitHub, at least without paying. So is there any chance of merging this here? Do you have a way of testing on GPUs? Otherwise, I guess I would not have to pay as much attention to cleaning up the code. That's why I ask already now. If you are interested, you can track my progress here. |
I won't have time before the holidays, unfortunately, but after that I could help test it locally. I agree too that it should be merged if everything works locally. For accessing GPUs in GitHub Actions, that's a longer discussion that won't happen over night, what I can share now is interested parties are working on this. |
Thanks for working on this and sharing your finding, @brownbaerchen, this is very encouraging!
I'd let @dalcinl or @mikaem comment on this, but my hunch is it's related to the CUDA aware MPI implementation -- which MPI did you use, and how did you build it?
I think this roughly explains the perf difference between CuPy + mpi4py-fft and cuFFTMp + NVSHMEM. If you share your source code of the mpi4py-fft integration, I could take a look next week and try to see if there's any improvement I can suggest. btw how did you use cuFFTMp in Python? |
Regarding the slow I didn't actually run cuFFTMp in Python. I may try to build bindings for that at some point. I just ran this file from the NVIDIA samples repository. My code is available on this branch.
It's basically copy-paste from the numpy backend. I also tried I don't construct plans myself because they are cached during the first transform and that is fine by me. But maybe there is a plan that avoids going to |
@brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python? |
Thanks for mentioning this! Since I haven't done any work with pybind11 or JAX before, this is a bit opaque to me. I am wondering why this was done with JAX and not with CuPy. I guess I have three options:
Any recommendations? |
Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling |
You're right, thanks for your input! I ignored this so far because it is only done once during setup. Indeed, just removing the detour to |
Cool, if you send a draft PR to this repo, it might be easier for me and the maintainers (Mikael/Lisandro) here to review/suggest changes, and Mikael/Lisandro might have thoughts on the PR scope and your custom
Totally forgot to follow up, sorry. Since you tried two flavors of MPI implementations, do you happen to know if they are both CUDA-aware (that is, you can just pass a GPU array to mpi4py and the underlying MPI impl knows how to handle GPU pointers)? If you already have mpi4py/CuPy in your local env, you can easily test them following Step 4 in conda-forge/openmpi-feedstock#128 (review). |
Hi |
@leofang, yes, both MPI implementations are built with CUDA support. I talked to the support here in Jülich and they also don't know why this is happening, but assured me that that I loaded the correct modules. It would be nice if someone could test this on a different machine maybe. Since it seems that NCCL is preferable over MPI anyways, maybe this doesn't matter too much, though. |
@brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of |
Since the output is quite lengthy, I uploaded the output of |
After @mhrywniak gave me some tipps on how to use NSIGHT, I could finally see where the expensive memory operations occur that I have been talking about. To recap: The launch of Also, I experimented with streams in the NCCL Any updates on cuFFTMp python bindings? |
Weird name, because ...
... normalization is about scaling with a real value, not a complex value. |
Good point @dalcinl! Indeed, multiplying a real view in-place results in a kernel called I noticed something else, though, but I am not sure if it is 100% correct. So please confirm the following: |
Definitely. Good catch! The NumPy backed was kind of a proof of concept, I think we did not really think about the proper, performant way to do it.
Of course.
None that I can think of. |
@dalcinl I had a tiny bit of time over the holidays to pick up this old task that has been on my mind for a couple of years. I noticed
DistArray
is the workhorse behind mpi4py-fft, however it requires subclassingnumpy.ndarray
mpi4py-fft/mpi4py_fft/distarray.py
Line 10 in ac510f8
which makes it not possible to use
cupy.ndarray
as a drop-in replacement, as CuPy currently does not support subclassing (cupy/cupy#3972).Is it absolutely necessary to make
DistArray
a subclass? Can it be tweaked (hopefully mildly) to become an array container to support both NumPy and CuPy arrays (and potentially other array/tensor libraries)? Could this change break backward compatibility? From my first glimpse it seems possible but I'd like to discuss further and see if I miss anything.The text was updated successfully, but these errors were encountered: