Skip to content

Commit

Permalink
Deploying to gh-pages from @ 0f3b777 πŸš€
Browse files Browse the repository at this point in the history
  • Loading branch information
maleadt committed Apr 21, 2024
1 parent df2cd9a commit 17a23d2
Show file tree
Hide file tree
Showing 144 changed files with 387 additions and 11,301 deletions.
94 changes: 92 additions & 2 deletions post/2023-09-19-cuda_5.0/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@

<link rel="stylesheet" href="/css/bootstrap.min.css">


<link rel="stylesheet" href="/libs/highlight/github.min.css">


<style>
.hljs {
Expand Down Expand Up @@ -148,14 +149,103 @@ <h1>CUDA.jl 5.0: Integrated profiler and task synchronization changes</h1>
<!-- Content appended here -->

<p>CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions.</p>
<p>This post is located at <a href="https://info.juliahub.com/cuda-jl-5-0-changes">https://info.juliahub.com/cuda-jl-5-0-changes</a></p>
<h2 id="integrated_profiler"><a href="#integrated_profiler" class="header-anchor">Integrated profiler</a></h2>
<p>The most exciting new feature in CUDA.jl 5.0 is <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2024">the new integrated profiler</a>, which is similar to the <code>@profile</code> macro from the Julia standard library. The profiler can be used by simply prefixing any code that uses the CUDA libraries with <code>CUDA.@profile</code>:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@profile CUDA.rand&#40;1&#41;.&#43;1
Profiler ran for 268.46 Β΅s, capturing 21 events.

Host-side activity: calling CUDA APIs took 230.79 Β΅s &#40;85.97&#37; of the trace&#41;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time &#40;&#37;&#41; β”‚ Time β”‚ Calls β”‚ Avg time β”‚ Min time β”‚ Max time β”‚ Name β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 76.47&#37; β”‚ 205.28 Β΅s β”‚ 1 β”‚ 205.28 Β΅s β”‚ 205.28 Β΅s β”‚ 205.28 Β΅s β”‚ cudaLaunchKernel β”‚
β”‚ 5.42&#37; β”‚ 14.54 Β΅s β”‚ 2 β”‚ 7.27 Β΅s β”‚ 5.01 Β΅s β”‚ 9.54 Β΅s β”‚ cuMemAllocFromPoolAsync β”‚
β”‚ 2.93&#37; β”‚ 7.87 Β΅s β”‚ 1 β”‚ 7.87 Β΅s β”‚ 7.87 Β΅s β”‚ 7.87 Β΅s β”‚ cuLaunchKernel β”‚
β”‚ 0.36&#37; β”‚ 953.67 ns β”‚ 2 β”‚ 476.84 ns β”‚ 0.0 ns β”‚ 953.67 ns β”‚ cudaGetLastError β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Device-side activity: GPU was busy for 2.15 Β΅s &#40;0.80&#37; of the trace&#41;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time &#40;&#37;&#41; β”‚ Time β”‚ Calls β”‚ Avg time β”‚ Min time β”‚ Max time β”‚ Name β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0.44&#37; β”‚ 1.19 Β΅s β”‚ 1 β”‚ 1.19 Β΅s β”‚ 1.19 Β΅s β”‚ 1.19 Β΅s β”‚ _Z13gen_sequencedI17curandS β‹―
β”‚ 0.36&#37; β”‚ 953.67 ns β”‚ 1 β”‚ 953.67 ns β”‚ 953.67 ns β”‚ 953.67 ns β”‚ _Z16broadcast_kernel15CuKer β‹―
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
1 column omitted
1-element CuArray&#123;Float32, 1, CUDA.Mem.DeviceBuffer&#125;:
1.7242923</code></pre>
<p>The output shown above is a summary of what happened during the execution of the code. It is split into two sections: <strong>host-side activity</strong>, i.e., API calls to the CUDA libraries, and the resulting <strong>device-side activity</strong>. As part of each section, the output shows the time spent and the ratio to the total execution time. These ratios are important, and a good tool to quickly assess the performance of your code. For example, in the above output, we see that most of the time is spent on the host calling the CUDA libraries, and only very little time is actually spent computing things on the GPU. This indicates that the GPU is severely underutilized, which can be solved by increasing the problem size.</p>
<p>Instead of a summary, it is also possible to view a <strong>chronological trace</strong> by passing the <code>trace&#61;true</code> keyword argument:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@profile trace&#61;true CUDA.rand&#40;1&#41;.&#43;1;
Profiler ran for 262.98 Β΅s, capturing 21 events.

Host-side activity: calling CUDA APIs took 227.21 Β΅s &#40;86.40&#37; of the trace&#41;
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ID β”‚ Start β”‚ Time β”‚ Name β”‚ Details β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ 6.44 Β΅s β”‚ 9.06 Β΅s β”‚ cuMemAllocFromPoolAsync β”‚ 4 bytes, device memory β”‚
β”‚ 7 β”‚ 19.31 Β΅s β”‚ 715.26 ns β”‚ cudaGetLastError β”‚ - β”‚
β”‚ 8 β”‚ 22.41 Β΅s β”‚ 204.09 Β΅s β”‚ cudaLaunchKernel β”‚ - β”‚
β”‚ 9 β”‚ 227.21 Β΅s β”‚ 0.0 ns β”‚ cudaGetLastError β”‚ - β”‚
β”‚ 14 β”‚ 232.7 Β΅s β”‚ 3.58 Β΅s β”‚ cuMemAllocFromPoolAsync β”‚ 4 bytes, device memory β”‚
β”‚ 18 β”‚ 250.34 Β΅s β”‚ 7.39 Β΅s β”‚ cuLaunchKernel β”‚ - β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Device-side activity: GPU was busy for 2.38 Β΅s &#40;0.91&#37; of the trace&#41;
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ID β”‚ Start β”‚ Time β”‚ Threads β”‚ Blocks β”‚ Regs β”‚ Name β‹―
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 8 β”‚ 225.31 Β΅s β”‚ 1.19 Β΅s β”‚ 64 β”‚ 64 β”‚ 38 β”‚ _Z13gen_sequencedI17curandStateXORWOWfiXa β‹―
β”‚ 18 β”‚ 257.73 Β΅s β”‚ 1.19 Β΅s β”‚ 1 β”‚ 1 β”‚ 18 β”‚ _Z16broadcast_kernel15CuKernelContext13Cu β‹―
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
1 column omitted</code></pre>
<p>Here, we can see a list of events that the profiler captured. Each event has a unique ID, which can be used to corelate host-side and device-side events. For example, we can see that event 8 on the host is a call to <code>cudaLaunchKernel</code>, which corresponds to to the execution of a CURAND kernel on the device.</p>
<p>The integrated profiler is a great tool to quickly assess the performance of your GPU application, identify bottlenecks, and find opportunities for optimization. For complex applications, however, it is still recommended to use NVIDIA&#39;s NSight Systems or Compute profilers, which provide a more detailed, graphical view of what is happening on the GPU.</p>
<h2 id="synchronization_on_worker_threads"><a href="#synchronization_on_worker_threads" class="header-anchor">Synchronization on worker threads</a></h2>
<p>Another noteworthy change affects how tasks are synchronized. To enable concurrent execution, i.e., to make it possible for other Julia tasks to execute while waiting for the GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a significant source of latency, at least 25us per invocation but sometimes <em>much</em> longer, and have also been slated for deprecation and eventual removal from the CUDA toolkit.</p>
<p>Instead, on Julia 1.9 and later, CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2025">now uses</a> worker threads to wait for GPU operations to finish. This mechanism is significantly faster, taking around 5us per invocation, but more importantly offers a much more reliable and predictable latency. You can observe this mechanism using the integrated profiler:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; CUDA.rand&#40;1024, 1024, 1024&#41;
julia&gt; CUDA.@profile trace&#61;true CUDA.@sync a .&#43; a
Profiler ran for 12.29 ms, capturing 527 events.

Host-side activity: calling CUDA APIs took 11.75 ms &#40;95.64&#37; of the trace&#41;
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ID β”‚ Start β”‚ Time β”‚ Thread β”‚ Name β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5 β”‚ 6.91 Β΅s β”‚ 13.59 Β΅s β”‚ 1 β”‚ cuMemAllocFromPoolAsync β”‚
β”‚ 9 β”‚ 36.72 Β΅s β”‚ 199.56 Β΅s β”‚ 1 β”‚ cuLaunchKernel β”‚
β”‚ 525 β”‚ 510.69 Β΅s β”‚ 11.75 ms β”‚ 2 β”‚ cuStreamSynchronize β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜</code></pre>
<p>For some users, this may still be too slow, so we have added two mechanisms that disable nonblocking synchronization and simply block the calling thread until the GPU operation finishes. The first is a global setting, which can be enabled by setting the <code>nonblocking_synchronization</code> preference to <code>false</code>, which can be done using Preferences.jl. <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2060">The second</a> is a fine-grained flag to pass to synchronization functions: <code>synchronize&#40;x; blocking&#61;true&#41;</code>, <code>CUDA.@sync blocking&#61;true
...</code>, etc. Both these mechanisms should <em>not</em> be used widely, and are only intended for use in latency-critical code, e.g., when benchmarking or profiling.</p>
<h2 id="local_toolkit_discovery"><a href="#local_toolkit_discovery" class="header-anchor">Local toolkit discovery</a></h2>
<p>One of the breaking changes involves <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2058">how local toolkits are discovered</a>, when opting out of the use of artifacts. Previously, this could be enabled by calling <code>CUDA.set_runtime_version&#33;&#40;&quot;local&quot;&#41;</code>, which generated a <code>version &#61; &quot;local&quot;</code> preference. We are now changing this into two separate preferences, <code>version</code> and <code>local</code>, where the <code>version</code> preference overrides the version of the CUDA toolkit, and the <code>local</code> preference independently indicates whether to use a local CUDA toolkit or not.</p>
<p>Concretely, this means that you will now need to call <code>CUDA.set_runtime_version&#33;&#40;local_toolkit&#61;true&#41;</code> to enable the use of a local toolkit. The toolkit version will be auto-detected, but can be overridden by also passing a version: <code>CUDA.set_runtime_version&#33;&#40;version; local_toolkit&#61;true&#41;</code>. This may be necessary when CUDA is not available during precompilation, e.g., on the log-in node of a cluster, or when building a container image.</p>
<h2 id="raised_minimum_requirements"><a href="#raised_minimum_requirements" class="header-anchor">Raised minimum requirements</a></h2>
<p>Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit version is now 11.4, but this cannot be enforced by the package manager. As a result, if you need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or below. <a href="https://github.com/JuliaGPU/CUDA.jl/blob/master/README.md">The README</a> will maintain a table of supported CUDA toolkit versions.</p>
<p>Most users will not be affected by this change: If you use the artifact-provided CUDA toolkit, you will automatically get the latest version supported by your CUDA driver.</p>
<h2 id="other_changes"><a href="#other_changes" class="header-anchor">Other changes</a></h2>
<ul>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2034">Support for CUDA 12.2</a>;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2040">Memory limits</a> are now enforced by CUDA, resulting in better performance;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1946">Support for Julia 1.10</a> &#40;with help from <a href="https://github.com/dkarrasch">@dkarrasch</a>&#41;;</p>
</li>
<li><p>Support for batched <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1975"><code>gemm</code></a>, <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1981"><code>gemv</code></a> and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2063"><code>svd</code></a> &#40;by <a href="https://github.com/lpawela">@lpawela</a> and <a href="https://github.com/nikopj">@nikopj</a>.</p>
</li>
</ul>
<!-- CONTENT ENDS HERE -->
</main>
</div> <!-- class="container" -->




<script src="/libs/highlight/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();hljs.configure({tabReplace: ' '});</script>



<footer id=footer class="mt-auto text-center text-muted">
<div class=container>Made with <a href=https://franklinjl.org>Franklin.jl</a> and <a href=https://julialang.org>the Julia programming language</a>.</div>
Expand Down
Loading

0 comments on commit 17a23d2

Please sign in to comment.