Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding target option #62

Closed
wants to merge 1 commit into from
Closed

Conversation

michel2323
Copy link
Contributor

@michel2323 michel2323 commented Apr 4, 2024

This adds a target option to the parallel function calls. For CUDA:

JACC.parallel_for(CUDABackend(), N, axpy, alpha, x_device_JACC, y_device_JACC)

The GPU packages provide these backends. JACC then defines ThreadsBackend() in addition to those.

Doing it this way should resolve precompilation error, while also resolving #56 . In addition, there is no need to set preferences anymore and the various backends can be used concurrently in a code. Also no need for a JACC.Array type. This tries to imitate the target offload pragma of OpenMP.

@PhilipFackler Let me know if there are any further issues with this solution.

Edit: These backends are also used by KernelAbstractions (except ThreadsBackend(), of course), so it would be easy now to write, for example, some GPU kernels in KA that don't require backend-specific functionality.

@williamfgc williamfgc requested a review from PhilipFackler April 4, 2024 16:56
@williamfgc
Copy link
Collaborator

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

@michel2323
Copy link
Contributor Author

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

I wouldn't say so? In how far? At some point, you have to pick a backend. But the same is true for OpenMP. In Julia you can do this at runtime.

@michel2323
Copy link
Contributor Author

michel2323 commented Apr 4, 2024

If your code only uses one backend, say CUDA, you could have a setup.jl where the user has to pick the backend. Or you could load all backend packages (CUDA.jl, AMDGPU.jl,...) and see which one is functional() (see tests). I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system, for example. The code with the parallel() calls is the same across all vendors.

@michel2323
Copy link
Contributor Author

michel2323 commented Apr 4, 2024

Ah, I see what you mean maybe: the array types CuArray are vendor specific. But there Julia provides already a wonderful solution with the Adapt package that also all backends support.

x = zeros(10)
dx = adapt(backend, x)

So in the case where backend=CUDABackend(), dx will be of type CuArray and you never have to (nor should) use the vendor specific types.

For @PhilipFackler this would also make it easier if there's a struct with mixed host and device types. He would only have to define a Adapt.adapt(backend, mystruct) function.

@williamfgc
Copy link
Collaborator

I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system

This is mostly a corner case that very rarely comes up, so we should focus on portable code across different vendors. I agree it's a nice to have, but enforcing a specific back end in the public API should be optional (maybe should be a macro?) for corner cases not the rule.

The back end selection follows Preferences.jl just like MPIPreferences.jl, so user code calling JACC (like those in tests) doesn't need to be touched from parallel_for(BackendX, ...) to parallel_for(BackendY,...), especially in code with several calls to parallel_for. In fact, they only need to set LocalPreferences and add an "import XBackend". We can discuss offline.

@michel2323
Copy link
Contributor Author

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional. I don't think it's such a corner case since one has at least host and device backends available, and I doubt one wants to run everything on a device.

@williamfgc
Copy link
Collaborator

I doubt one wants to run everything on a device

For those cases, the user should rely on Julia regular Arrays and CPU (host) if it's not worth porting, JACC is very targeted for performance portable code pieces.

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional.

That's what JACCPreferences sets, but via LocalPreferences, see this line. The least vendor/system info is exposed to the targeted users (domain scientists) the better.

@michel2323
Copy link
Contributor Author

michel2323 commented Apr 4, 2024

Let me add an example code:

# code in a setup jl or run by the user before running his code
using CUDA

if CUDA.functional()
    backend = CUDABackend()
else
    backend = ThreadsBackend()
end

# application code using JACC which is the same accross all vendors

using JACC

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = adapt(backend, x)
y = adapt(backend, y)

for i in 1:11
    @time JACC.parallel_for(backend, N, axpy, alpha, x, y)
end

# Copy to host
x = adapt(ThreadsBackend(),x)
y = adapt(ThreadsBackend(),y)

So the difference is whether to set preferences or set it in a setup.jl. The preferences solution breaks precompilation with the current API #53 . I don't know how else to resolve that.

@michel2323
Copy link
Contributor Author

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

@williamfgc
Copy link
Collaborator

williamfgc commented Apr 4, 2024

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

Yeah, that's the goal of JACC. Users should not interact with back ends (at most minimally like it's done today with Preferences). "JACC-aware" MPI would be a noble goal, though.

@michel2323
Copy link
Contributor Author

michel2323 commented Apr 5, 2024

Another stab at it. This defines a default_backend.

using JACC
# "Default backend is ThreadsBackend()"
println_default_backend()
using CUDA
# "Default backend is CUDABackend()"
println_default_backend()

And then there are parallel methods that pass this down. Of course, if multiple GPU packages are loaded by the user, this will pick whatever extension was compiled last.

Sorry, I really don't know how else to resolve the precompilation issue with Preferences. You can't redefine a method with the same arguments.

@michel2323
Copy link
Contributor Author

And now with Preferences support too. So the breaking change is that JACC.Array is gone. That is still the difficult bit as you cannot dispatch on JACC.Array with all backends and have precompilation working.

@williamfgc
Copy link
Collaborator

@michel2323 thanks, see discussion in #53 . I am asking @PhilipFackler how to reproduce the error as it's not showing in the current CI. I'd rather keep the public API as simple as possible since back ends can be handled internally and weak dependencies should provide the desired separation.

@williamfgc
Copy link
Collaborator

Ideally users should not deal with any detail in the code other than memory allocation and parallel_for and parallel_reduce. Otherwise, there is little advantage in using JACC if the programming model is not that simple (even adapt is too complex for end-users). Today, it works like this:

# Using CUDA triggers weak dependencies JACCCUDA and must match LocalPreferences.toml
using CUDA # the code should work just fine on CPU without this line 
using JACC # I don't know if there is a good way to just import a back end here (e.g. CUDA, AMDGPU, etc.)

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = JACC.Array(round.(rand(Float32, N) * 100))
y = JACC.Array(round.(rand(Float32, N) * 100))
alpha = 2.5

for i in 1:11
    @time JACC.parallel_for(N, axpy, alpha, x, y)
end

# Copy to host...perhaps implement JACC.to_host(x) to avoid deep copies on CPU host and device
x_h = Array(x)
y_h = Array(y)

@PhilipFackler
Copy link
Collaborator

Subsumed by #123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants