diff --git a/post/2023-11-07-cuda_5.1.md b/post/2023-11-07-cuda_5.1.md index b299168..8df48de 100644 --- a/post/2023-11-07-cuda_5.1.md +++ b/post/2023-11-07-cuda_5.1.md @@ -10,11 +10,6 @@ abstract = """ {{abstract}} -# CUDA.jl 5.1: Unified memory and cooperative groups - -CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified -memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which -offer a more modular approach to kernel programming. ## Unified memory diff --git a/post/2024-04-26-cuda_5.2_5.3.md b/post/2024-04-26-cuda_5.2_5.3.md new file mode 100644 index 0000000..32d1ada --- /dev/null +++ b/post/2024-04-26-cuda_5.2_5.3.md @@ -0,0 +1,213 @@ ++++ +title = "CUDA.jl 5.2 and 5.3: Maintenance releases" +author = "Tim Besard" +external = true +abstract = """ + CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug + fixes and minor improvements, but also come with a number of interesting new + features. This blog post summarizes the changes in these releases.""" ++++ + +{{abstract}} + + +## Profiler improvements + +CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia +GPU applications without having to use NSight Systems or other external tools. +The tool has seen continued development, mostly improving its robustness, but +CUDA.jl now also provides a `@bprofile` equivalent that runs your application +multiple times and reports on the time distribution of individual events: + +```julia-repl +julia> CUDA.@bprofile CuArray([1]) .+ 1 +Profiler ran for 1.0 s, capturing 1427349 events. + +Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace) +┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐ +│ Time (%) │ Total time │ Calls │ Time distribution │ Name │ +├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤ +│ 19.27% │ 192.67 ms │ 109796 │ 1.75 µs ± 10.19 ( 0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │ +│ 17.08% │ 170.8 ms │ 54898 │ 3.11 µs ± 0.27 ( 2.15 ‥ 23.84) │ cuLaunchKernel │ +│ 16.77% │ 167.67 ms │ 54898 │ 3.05 µs ± 0.24 ( 0.48 ‥ 16.69) │ cuCtxSynchronize │ +│ 14.11% │ 141.12 ms │ 54898 │ 2.57 µs ± 0.79 ( 1.67 ‥ 70.57) │ cuMemcpyHtoDAsync │ +│ 1.70% │ 17.04 ms │ 54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize │ +└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘ + +Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace) +┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐ +│ Time (%) │ Total time │ Calls │ Time distribution │ Name │ +├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤ +│ 6.66% │ 66.61 ms │ 54898 │ 1.21 µs ± 0.16 ( 0.95 ‥ 1.67) │ kernel │ +│ 2.08% │ 20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device] │ +└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘ + +NVTX ranges: +┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐ +│ Time (%) │ Total time │ Calls │ Time distribution │ Name │ +├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤ +│ 98.99% │ 989.94 ms │ 54898 │ 18.03 µs ± 49.88 ( 15.26 ‥ 10731.22) │ @bprofile.iteration │ +└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘ +``` + +By default, `CUDA.@bprofile` runs the application for 1 second, but this can be +adjusted using the `time` keyword argument. + +Display of the time distribution isn't limited to `CUDA.@bprofile`, and will +also be used by `CUDA.@profile` when any operation is called more than once. For +example, with the broadcasting example from above we allocate both the input +`CuArray` and the broadcast result, which results in two calls to the allocator: + +```julia-repl +julia> CUDA.@profile CuArray([1]) .+ 1 + +Host-side activity: +┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐ +│ Time (%) │ Total time │ Calls │ Time distribution │ Name │ +├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤ +│ 99.92% │ 99.42 ms │ 1 │ │ cuMemcpyHtoDAsync │ +│ 0.02% │ 21.22 µs │ 2 │ 10.61 µs ± 6.57 ( 5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │ +│ 0.02% │ 17.88 µs │ 1 │ │ cuLaunchKernel │ +│ 0.00% │ 953.67 ns │ 1 │ │ cuStreamSynchronize │ +└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘ +``` + + +## Kernel launch debugging + +A common issue with CUDA programming is that kernel launches may fail when +exhausting certain resources, such as shared memory or registers. This typically +results in a cryptic error message, but CUDA.jl will now try to diagnose launch +failures and provide a more helpful error message, as suggested by +[@simonbyrne](https://github.com/simonbyrne): + +For example, when using more parameter memory than allowed by the architecture: + +```julia-repl +julia> kernel(x) = nothing +julia> @cuda kernel(ntuple(_->UInt64(1), 2^13)) +ERROR: Kernel invocation uses too much parameter memory. +64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2. +``` + +Or when using an invalid launch configuration, violating a device limit: + +```julia-repl +julia> @cuda threads=2000 identity(nothing) +ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024). +caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE) +``` + +We also diagnose launch failures that involve kernel-specific limits, such as +exceeding the number of threads that are allowed in a block (e.g., because of +register use): + +```julia-repl +julia> @cuda threads=1024 heavy_kernel() +ERROR: Number of threads per block exceeds kernel limit (1024 > 512). +caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE) +``` + + +## Sorting improvements + +Thanks to [@xaellison](https://github.com/xaellison), our bitonic sorting +implementation now supports sorting specific dimensions, making it possible to +implement `sortperm` for multi-dimensional arrays: + +```julia-repl +julia> A = cu([8 7; 5 6]) +2×2 CuArray{Int64, 2, Mem.DeviceBuffer}: + 8 7 + 5 6 + +julia> sortperm(A, dims = 1) +2×2 CuArray{Int64, 2, Mem.DeviceBuffer}: + 2 4 + 1 3 + +julia> sortperm(A, dims = 2) +2×2 CuArray{Int64, 2, Mem.DeviceBuffer}: + 3 1 + 2 4 +``` + +The bitonic kernel is now used for all sorting operations, in favor of the often +slower quicksort implementation: + +```julia-repl +# before (quicksort) +julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1) + 2.760 ms (30 allocations: 1.02 KiB) + +# after (bitonic sort) +julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1) + 246.386 μs (567 allocations: 13.66 KiB) + +# reference CPU time +julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1) + 4.795 ms (1030 allocations: 5.07 MiB) +``` + + +## Unified memory fixes + +CUDA.jl 5.1 greatly improved support for unified memory, and this has continued +in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting `CuArray`s we now +correctly preserve the memory type of the input arrays. This means that if you +broadcast a `CuArray` that is allocated as unified memory, the result will also +be allocated as unified memory. In case of a conflict, e.g. broadcasting a +unified `CuArray` with one backed by device memory, we will prefer unified +memory: + +```julia-repl +julia> cu([1]; host=true) .+ 1 +1-element CuArray{Int64, 1, Mem.HostBuffer}: + 2 + +julia> cu([1]; host=true) .+ cu([2]; device=true) +1-element CuArray{Int64, 1, Mem.UnifiedBuffer}: + 3 +``` + + +## Software updates + +Finally, we also did routine updates of the software stack, support the latest +and greatest by NVIDIA. This includes support for **CUDA 12.4** (Update 1), +**cuDNN 9**, and **cuTENSOR 2.0**. This latest release of cuTENSOR is noteworthy +as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to +follow this change. For more details, refer to the [cuTENSOR 2 migration +guide](https://docs.nvidia.com/cuda/cutensor/latest/api_transition.html) by +NVIDIA. + +Of course, cuTENSOR.jl also provides a high-level Julia API which has been +mostly unaffected by these changes: + +```julia +using CUDA +A = CUDA.rand(7, 8, 3, 2) +B = CUDA.rand(3, 2, 2, 8) +C = CUDA.rand(3, 3, 7, 2) + +using cuTENSOR +tA = CuTensor(A, ['a', 'f', 'b', 'e']) +tB = CuTensor(B, ['c', 'e', 'd', 'f']) +tC = CuTensor(C, ['b', 'c', 'a', 'd']) + +using LinearAlgebra +mul!(tC, tA, tB) +``` + +This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and +have to adapt to the new API, now is a good time to consider improving the +high-level interface instead! + + +## Future releases + +The next release of CUDA.jl is gearing up to be a much larger release, with +significant changes to both the API and internals of the package. Although the +intent is to keep these changes non-breaking, it is always possible that some +code will be affected in unexpected ways, so we encourage users to test the +upcoming release by simply running `] add CUDA#master` and report any issues.