Add blog post on CUDA.jl 5.3 and 5.3. (#42)

JuliaGPU · Apr 26, 2024 · 95dcd9c · 95dcd9c
1 parent 0f3b777
commit 95dcd9c
Show file tree

Hide file tree

Showing 2 changed files with 213 additions and 5 deletions.
diff --git a/post/2023-11-07-cuda_5.1.md b/post/2023-11-07-cuda_5.1.md
@@ -10,11 +10,6 @@ abstract = """
 
 {{abstract}}
 
-# CUDA.jl 5.1: Unified memory and cooperative groups
-
-CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified
-memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which
-offer a more modular approach to kernel programming.
 
 ## Unified memory
 

diff --git a/post/2024-04-26-cuda_5.2_5.3.md b/post/2024-04-26-cuda_5.2_5.3.md
@@ -0,0 +1,213 @@
++++
+title = "CUDA.jl 5.2 and 5.3: Maintenance releases"
+author = "Tim Besard"
+external = true
+abstract = """
+  CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug
+  fixes and minor improvements, but also come with a number of interesting new
+  features. This blog post summarizes the changes in these releases."""
++++
+
+{{abstract}}
+
+
+## Profiler improvements
+
+CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia
+GPU applications without having to use NSight Systems or other external tools.
+The tool has seen continued development, mostly improving its robustness, but
+CUDA.jl now also provides a `@bprofile` equivalent that runs your application
+multiple times and reports on the time distribution of individual events:
+
+```julia-repl
+julia> CUDA.@bprofile CuArray([1]) .+ 1
+Profiler ran for 1.0 s, capturing 1427349 events.
+
+Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace)
+┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
+│ Time (%) │ Total time │  Calls │ Time distribution                     │ Name                    │
+├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
+│   19.27% │  192.67 ms │ 109796 │   1.75 µs ± 10.19  (  0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │
+│   17.08% │   170.8 ms │  54898 │   3.11 µs ± 0.27   (  2.15 ‥ 23.84)   │ cuLaunchKernel          │
+│   16.77% │  167.67 ms │  54898 │   3.05 µs ± 0.24   (  0.48 ‥ 16.69)   │ cuCtxSynchronize        │
+│   14.11% │  141.12 ms │  54898 │   2.57 µs ± 0.79   (  1.67 ‥ 70.57)   │ cuMemcpyHtoDAsync       │
+│    1.70% │   17.04 ms │  54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize     │
+└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘
+
+Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace)
+┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
+│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name               │
+├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
+│    6.66% │   66.61 ms │ 54898 │   1.21 µs ± 0.16   (  0.95 ‥ 1.67)    │ kernel             │
+│    2.08% │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device]   │
+└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘
+
+NVTX ranges:
+┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
+│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
+├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
+│   98.99% │  989.94 ms │ 54898 │  18.03 µs ± 49.88  ( 15.26 ‥ 10731.22) │ @bprofile.iteration │
+└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘
+```
+
+By default, `CUDA.@bprofile` runs the application for 1 second, but this can be
+adjusted using the `time` keyword argument.
+
+Display of the time distribution isn't limited to `CUDA.@bprofile`, and will
+also be used by `CUDA.@profile` when any operation is called more than once. For
+example, with the broadcasting example from above we allocate both the input
+`CuArray` and the broadcast result, which results in two calls to the allocator:
+
+```julia-repl
+julia> CUDA.@profile CuArray([1]) .+ 1
+
+Host-side activity:
+┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
+│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
+├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
+│   99.92% │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
+│    0.02% │   21.22 µs │     2 │  10.61 µs ± 6.57   (  5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │
+│    0.02% │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
+│    0.00% │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
+└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘
+```
+
+
+## Kernel launch debugging
+
+A common issue with CUDA programming is that kernel launches may fail when
+exhausting certain resources, such as shared memory or registers. This typically
+results in a cryptic error message, but CUDA.jl will now try to diagnose launch
+failures and provide a more helpful error message, as suggested by
+[@simonbyrne](https://github.com/simonbyrne):
+
+For example, when using more parameter memory than allowed by the architecture:
+
+```julia-repl
+julia> kernel(x) = nothing
+julia> @cuda kernel(ntuple(_->UInt64(1), 2^13))
+ERROR: Kernel invocation uses too much parameter memory.
+64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.
+```
+
+Or when using an invalid launch configuration, violating a device limit:
+
+```julia-repl
+julia> @cuda threads=2000 identity(nothing)
+ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
+caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
+```
+
+We also diagnose launch failures that involve kernel-specific limits, such as
+exceeding the number of threads that are allowed in a block (e.g., because of
+register use):
+
+```julia-repl
+julia> @cuda threads=1024 heavy_kernel()
+ERROR: Number of threads per block exceeds kernel limit (1024 > 512).
+caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
+```
+
+
+## Sorting improvements
+
+Thanks to [@xaellison](https://github.com/xaellison), our bitonic sorting
+implementation now supports sorting specific dimensions, making it possible to
+implement `sortperm` for multi-dimensional arrays:
+
+```julia-repl
+julia> A = cu([8 7; 5 6])
+2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
+ 8  7
+ 5  6
+
+julia> sortperm(A, dims = 1)
+2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
+ 2  4
+ 1  3
+
+julia> sortperm(A, dims = 2)
+2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
+ 3  1
+ 2  4
+```
+
+The bitonic kernel is now used for all sorting operations, in favor of the often
+slower quicksort implementation:
+
+```julia-repl
+# before (quicksort)
+julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
+  2.760 ms (30 allocations: 1.02 KiB)
+
+# after (bitonic sort)
+julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
+  246.386 μs (567 allocations: 13.66 KiB)
+
+# reference CPU time
+julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1)
+  4.795 ms (1030 allocations: 5.07 MiB)
+```
+
+
+## Unified memory fixes
+
+CUDA.jl 5.1 greatly improved support for unified memory, and this has continued
+in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting `CuArray`s we now
+correctly preserve the memory type of the input arrays. This means that if you
+broadcast a `CuArray` that is allocated as unified memory, the result will also
+be allocated as unified memory. In case of a conflict, e.g. broadcasting a
+unified `CuArray` with one backed by device memory, we will prefer unified
+memory:
+
+```julia-repl
+julia> cu([1]; host=true) .+ 1
+1-element CuArray{Int64, 1, Mem.HostBuffer}:
+ 2
+
+julia> cu([1]; host=true) .+ cu([2]; device=true)
+1-element CuArray{Int64, 1, Mem.UnifiedBuffer}:
+ 3
+```
+
+
+## Software updates
+
+Finally, we also did routine updates of the software stack, support the latest
+and greatest by NVIDIA. This includes support for **CUDA 12.4** (Update 1),
+**cuDNN 9**, and **cuTENSOR 2.0**. This latest release of cuTENSOR is noteworthy
+as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to
+follow this change. For more details, refer to the [cuTENSOR 2 migration
+guide](https://docs.nvidia.com/cuda/cutensor/latest/api_transition.html) by
+NVIDIA.
+
+Of course, cuTENSOR.jl also provides a high-level Julia API which has been
+mostly unaffected by these changes:
+
+```julia
+using CUDA
+A = CUDA.rand(7, 8, 3, 2)
+B = CUDA.rand(3, 2, 2, 8)
+C = CUDA.rand(3, 3, 7, 2)
+
+using cuTENSOR
+tA = CuTensor(A, ['a', 'f', 'b', 'e'])
+tB = CuTensor(B, ['c', 'e', 'd', 'f'])
+tC = CuTensor(C, ['b', 'c', 'a', 'd'])
+
+using LinearAlgebra
+mul!(tC, tA, tB)
+```
+
+This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and
+have to adapt to the new API, now is a good time to consider improving the
+high-level interface instead!
+
+
+## Future releases
+
+The next release of CUDA.jl is gearing up to be a much larger release, with
+significant changes to both the API and internals of the package. Although the
+intent is to keep these changes non-breaking, it is always possible that some
+code will be affected in unexpected ways, so we encourage users to test the
+upcoming release by simply running `] add CUDA#master` and report any issues.