Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU docs #2510

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 109 additions & 99 deletions docs/src/guide/gpu.md
Original file line number Diff line number Diff line change
@@ -1,140 +1,150 @@
# GPU Support

Starting with v0.14, Flux doesn't force a specific GPU backend and the corresponding package dependencies on the users.
Thanks to the [package extension mechanism](https://pkgdocs.julialang.org/v1/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)) introduced in julia v1.9, Flux conditionally loads GPU specific code once a GPU package is made available (e.g. through `using CUDA`).
Most work on neural networks involves the use of GPUs, as they can typically perform the required computation much faster.
This page describes how Flux co-operates with various other packages, which talk to GPU hardware.

NVIDIA GPU support requires the packages `CUDA.jl` and `cuDNN.jl` to be installed in the environment. In the julia REPL, type `] add CUDA, cuDNN` to install them. For more details see the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) readme.
## Basic GPU use: from `Array` to `CuArray` with `cu`

AMD GPU support is available since Julia 1.9 on systems with ROCm and MIOpen installed. For more details refer to the [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) repository.
Julia's GPU packages work with special array types, in place of the built-in `Array`.
The most used is `CuArray` provided by [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl), for GPUs made by NVIDIA.
That package provides a function `cu` which converts an ordinary `Array` (living in CPu memory) to a `CuArray` (living in GPU memory).
Functions like `*` and broadcasting specialise so that, when given `CuArray`s, all the computation happens on the GPU:

Metal GPU acceleration is available on Apple Silicon hardware. For more details refer to the [Metal.jl](https://github.com/JuliaGPU/Metal.jl) repository. Metal support in Flux is experimental and many features are not yet available.
```julia
W = randn(3, 4) # some weights, on CPU: 3×4 Array{Float64, 2}
x = randn(4) # fake data
y = tanh.(W * x) # computation on the CPU

In order to trigger GPU support in Flux, you need to call `using CUDA`, `using AMDGPU` or `using Metal`
in your code. Notice that for CUDA, explicitly loading also `cuDNN` is not required, but the package has to be installed in the environment.
using CUDA

cu(W) isa CuArray{Float32}
(cW, cx) = (W, x) |> cu # move both to GPU
cy = tanh.(cW * cx) # computation on the GPU
```

!!! compat "Flux ≤ 0.13"
Old versions of Flux automatically installed CUDA.jl to provide GPU support. Starting from Flux v0.14, CUDA.jl is not a dependency anymore and has to be installed manually.
Notice that `cu` doesn't only move arrays, it also recurses into many structures, such as the tuple `(W, x)` above.
(Notice also that it converts Julia's default `Float64` numbers to `Float32`, as this is what most GPUs support efficiently -- it calls itself "opinionated". Flux defaults to `Float32` in all cases.)

To use CUDA with Flux, you can simply use `cu` to move both the model, and the data.
It will create a copy of the Flux model, with all of its parameter arrays moved to the GPU:

## Basic GPU Usage
```julia
using Pkg; Pkg.add(["CUDA", "cuDNN"]) # do this once

Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl), [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl), and [Metal.jl](https://github.com/JuliaGPU/Metal.jl).
Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.
using Flux, CUDA
CUDA.allowscalar(false) # recommended

For example, we can use `CUDA.CuArray` (with the `CUDA.cu` converter) to run our [basic example](@ref man-basics) on an NVIDIA GPU.
model = Dense(W, true, tanh) # wrap the same matrix W in a Flux layer
model(x) ≈ y # same result, still on CPU

(Note that you need to have CUDA available to use CUDA.CuArray – please see the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) instructions for more details.)
c_model = cu(model) # move all the arrays within model to the GPU
c_model(cx) # computation on the GPU
```

```julia
using CUDA
Notice that you need `using CUDA` (every time) but also `] add cuDNN` (once, when installing packages).
This is a quirk of how these packages are set up.
(The [`cuDNN.jl`](https://github.com/JuliaGPU/CUDA.jl/tree/master/lib/cudnn) sub-package handles operations such as convolutions, called by Flux via [NNlib.jl](https://github.com/FluxML/NNlib.jl).)

W = cu(rand(2, 5)) # a 2×5 CuArray
b = cu(rand(2))
Flux's `gradient`, and training functions like `setup`, `update!`, and `train!`, are all equally happy to accept GPU arrays and GPU models, and then perform all computations on the GPU.
It is recommended that you move the model to the GPU before calling `setup`.

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)
```julia
grads = Flux.gradient((f,x) -> sum(abs2, f(x)), model, x) # on CPU
c_grads = Flux.gradient((f,x) -> sum(abs2, f(x)), c_model, cx) # same result, all on GPU

x, y = cu(rand(5)), cu(rand(2)) # Dummy data
loss(x, y) # ~ 3
```
c_opt = Flux.setup(Adam(), c_model) # setup optimiser after moving model to GPU

Note that we convert both the parameters (`W`, `b`) and the data set (`x`, `y`) to cuda arrays. Taking derivatives and training works exactly as before.
Flux.update!(c_opt, c_model, c_grads[1]) # mutates c_model but not model
```

If you define a structured model, like a `Dense` layer or `Chain`, you just need to convert the internal parameters. Flux provides `fmap`, which allows you to alter all parameters of a model at once.
To move arrays and other objects back to the CPU, Flux provides a function `cpu`.
This is recommended when saving models, `Flux.state(c_model |> cpu)`, see below.

```julia
d = Dense(10 => 5, σ)
d = fmap(cu, d)
d.weight # CuArray
d(cu(rand(10))) # CuArray output

m = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)
m = fmap(cu, m)
m(cu(rand(10)))
cpu(cW) isa Array{Float32, 2}

model2 = cpu(c_model) # copy model back to CPU
model2(x)
```

As a convenience, Flux provides the `gpu` function to convert models and data to the GPU if one is available. By default, it'll do nothing. So, you can safely call `gpu` on some data or model (as shown below), and the code will not error, regardless of whether the GPU is available or not. If a GPU library (e.g. CUDA) loads successfully, `gpu` will move data from the CPU to the GPU. As is shown below, this will change the type of something like a regular array to a `CuArray`.
!!! compat "Flux ≤ 0.13"
Old versions of Flux automatically loaded CUDA.jl to provide GPU support. Starting from Flux v0.14, it has to be loaded separately. Julia's [package extensions](https://pkgdocs.julialang.org/v1/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)) allow Flux to automatically load some GPU-specific code when needed.
Comment on lines +70 to +71
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!!! compat "Flux ≤ 0.13"
Old versions of Flux automatically loaded CUDA.jl to provide GPU support. Starting from Flux v0.14, it has to be loaded separately. Julia's [package extensions](https://pkgdocs.julialang.org/v1/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)) allow Flux to automatically load some GPU-specific code when needed.


## Other GPU packages for AMD & Apple

Non-NVIDIA graphics cards are supported by other packages. Each provides its own function which behaves like `cu`.
AMD GPU support provided by [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl), on systems with ROCm and MIOpen installed.
This package has a function `roc` which converts `Array` to `ROCArray`:

```julia
julia> using Flux, CUDA

julia> m = Dense(10, 5) |> gpu
Dense(10 => 5) # 55 parameters

julia> x = rand(10) |> gpu
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.066846445
0.76706964

julia> m(x)
5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
-0.99992573
-0.547261
using Flux, AMDGPU
AMDGPU.allowscalar(false)

r_model = roc(model)
r_model(roc(x))

Flux.gradient((f,x) -> sum(abs2, f(x)), r_model, roc(x))
```

The analogue `cpu` is also available for moving models and data back off of the GPU.
Experimental support for Apple devices with M-series chips is provided by [Metal.jl](https://github.com/JuliaGPU/Metal.jl). This has a function [`mtl`](https://metal.juliagpu.org/stable/api/array/#Metal.mtl) which works like `cu`, converting `Array` to `MtlArray`:

```julia
julia> x = rand(10) |> gpu
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.8019236
0.7766742

julia> x |> cpu
10-element Vector{Float32}:
0.8019236
0.7766742
```
using Flux, Metal
Metal.allowscalar(false)

## Using device objects
m_model = mtl(model)
m_y = m_model(mtl(x))

In Flux, you can create `device` objects which can be used to easily transfer models and data to GPUs (and defaulting to using the CPU if no GPU backend is available).
These features are provided by [MLDataDevices.jl](https://github.com/LuxDL/MLDataDevices.jl) package, that Flux uses internally and re-exports.
Flux.gradient((f,x) -> sum(abs2, f(x)), m_model, mtl(x))
```

Device objects can be automatically created using the [`cpu_device`](@ref MLDataDevices.cpu_device) and [`gpu_device`](@ref MLDataDevices.gpu_device) functions. For instance, the `gpu` and `cpu` functions are just convenience functions defined as
!!! danger "Experimental"
Metal support in Flux is experimental and many features are not yet available.
AMD support is improving, but likely to have more rough edges than CUDA.

If you want your model to work with any brand of GPU, or none, then you may not wish to write `cu` everywhere.
One simple way to be generic is, at the top of the file, to un-comment one of several lines which import a package and assign its "adaptor" to the same name:

```julia
cpu(x) = cpu_device()(x)
gpu(x) = gpu_device()(x)
```
using CUDA: cu as device # after this, `device === cu`
# using AMDGPU: roc as device
# device = identity # do-nothing, for CPU

`gpu_device` performs automatic GPU device selection and returns a device object:
- If no GPU is available, it returns a `CPUDevice` object.
- If a LocalPreferences file is present, then the backend specified in the file is used. To set a backend, use `Flux.gpu_backend!(<backend_name>)`. If the trigger package corresponding to the device is not loaded (e.g. with `using CUDA`), then a warning is displayed.
- If no LocalPreferences option is present, then the first working GPU with loaded trigger package is used.
using Flux
model = Chain(...) |> device
```

Consider the following example, where we load the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package to use an NVIDIA GPU (`"CUDA"` is the default preference):
!!! note "Adapt.jl"
The functions `cu`, `mtl`, `roc` all use [Adapt.jl](https://github.com/JuliaGPU/Adapt.jl), to work within various wrappers.
The reason they work on Flux models is that `Flux.@layer Layer` defines methods of `Adapt.adapt_structure(to, lay::Layer)`.

```julia-repl
julia> using Flux, CUDA;

julia> device = gpu_device() # returns handle to an NVIDIA GPU if available
(::CUDADevice{Nothing}) (generic function with 4 methods)
## Automatic GPU choice with `gpu`

julia> model = Dense(2 => 3);
Flux also provides a more automatic way of choosing which GPU (or none) to use. This is the function `gpu`:
* By default it does nothing.
* If the package CUDA is loaded, and `CUDA.functional() === true`, then it behaves like `cu`.
* If the package AMDGPU is loaded, and `AMDGPU.functional() === true`, then it behaves like `roc`.
* If the package Metal is loaded, and `Metal.functional() === true`, then it behaves like `mtl`.
* If two differnet GPU packages are loaded, the first one takes priority.

julia> model.weight # the model initially lives in CPU memory
3×2 Matrix{Float32}:
-0.984794 -0.904345
0.720379 -0.486398
0.851011 -0.586942
For the most part, this means that a script which says `model |> gpu` and `data |> gpu` will just work.
It should always run, and if a GPU package is loaded (and finds the correct hardware) then that will be used.

julia> model = model |> device # transfer model to the GPU
Dense(2 => 3) # 9 parameters
The function `gpu` uses a lower-level function called `get_device()` from [MLDataDevices.jl](https://github.com/LuxDL/MLDataDevices.jl),
which checks what to do & then returns some device object. In fact, the entire implementation is just this:

julia> model.weight
3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
-0.984794 -0.904345
0.720379 -0.486398
0.851011 -0.586942
```julia
gpu(x) = gpu_device()(x)
cpu(x) = cpu_device()(x)
```


## Manually selecting devices

I thought there was a whole `Flux.gpu_backend!` and Preferences.jl story we had to tell??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_backend! affects the return from gpu_device like this:

  • If no GPU is available, it returns a CPUDevice object.
  • If a LocalPreferences file is present, then the backend specified in the file is used. If the trigger package corresponding to the device is not loaded, then a warning is displayed.
  • If no LocalPreferences file is present, then the first working GPU with loaded trigger package is used.

This is already described in the docstring of gpu_device. I think we shouldn't mention gpu_backend! at all in this guide because it is useless in practice.



## Transferring Training Data

In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:
Expand Down Expand Up @@ -178,7 +188,7 @@ In order to train the model using the GPU both model and the training data have

## Saving GPU-Trained Models

After the training process is done, one must always transfer the trained model back to the `cpu` memory scope before serializing or saving to disk. This can be done, as described in the previous section, with:
After the training process is done, we must always transfer the trained model back to the CPU memory before serializing or saving to disk. This can be done with `cpu`:
```julia
model = cpu(model) # or model = model |> cpu
```
Expand Down Expand Up @@ -280,11 +290,11 @@ Due to a limitation in `Metal.jl`, currently this kind of data movement across d
## Distributed data parallel training

!!! danger "Experimental"

Distributed support is experimental and could change in the future.


Flux supports now distributed data parallel training with `DistributedUtils` module.
Flux supports now distributed data parallel training with `DistributedUtils` module.
If you want to run your code on multiple GPUs, you have to install `MPI.jl` (see [docs](https://juliaparallel.org/MPI.jl/stable/usage/) for more info).

```julia-repl
Expand Down Expand Up @@ -352,7 +362,7 @@ DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(
julia> st_opt = Optimisers.setup(opt, model)
(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)

julia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0)
julia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0)
(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)
```

Expand All @@ -376,7 +386,7 @@ Epoch 3: Loss 0.012763695
Remember that in order to run it on multiple GPUs you have to run from CLI `mpiexecjl --project=. -n <np> julia <filename>.jl`,
where `<np>` is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.

By default `MPI.jl` MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more [here](https://juliaparallel.org/MPI.jl/stable/usage/#CUDA-aware-MPI-support) on custom installation and rebuilding `MPI.jl`.
By default `MPI.jl` MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more [here](https://juliaparallel.org/MPI.jl/stable/usage/#CUDA-aware-MPI-support) on custom installation and rebuilding `MPI.jl`.
Then test if your MPI is CUDA-aware by
```julia-repl
julia> import Pkg
Expand All @@ -390,7 +400,7 @@ julia> set_preferences!("Flux", "FluxDistributedMPICUDAAware" => true)
```

!!! warning "Known shortcomings"

We don't run CUDA-aware tests so you're running it at own risk.


Expand Down Expand Up @@ -424,4 +434,4 @@ julia> using Metal

julia> Metal.functional()
true
```
```
Loading