A6000 Ada benchmark results and thoughts #43

classner · 2024-12-30T06:38:58Z

Thanks for this amazing project! This 'issue' is mostly meant for conversation than fixing a specific thing. I just tested krnl on an A6000 Ada and thought I'd share some findings and thoughts.

The following are the timings I got comparing its performance with CUDA:

(I marked cells with up to 2x slowdown yellow, anything above red.)

Clearly, the malloc operation benefits from a very different implementation (agreeing with your findings on the project README). This seems similar to the OpenCL implementation. Does the CUDA interface here use an inefficient allocator? Either way, this performance is great!
The upload performance is competitive in the range tested! The 10m range seems not ideal, but I assume a bit of tweaking can get it there!
The download performance here really breaks down. Do you have ideas on why this could be? I looked at the implementation in the engine file and saw that it uses chunked downloading - what do you think happens here? And what could CUDA do better?
Zeroing is nearly the same for the 64M element test, but is much slower for fewer elements. Again, I looked at the implementation and it looks like it's implemented as a kernel call setting individual elements. I assume CUDA is using an intrinsic to zero the memory, which is a special case here.
SAXPY is again approaching nearly similar performance for 64M elements, yet is much slower before. What could be happening here? Bad working group sizes? Generally more kernel call overhead? It does stand out that KRNL consistently never reaches as low a time for any kernel calls as CUDA.

What do you think?

charles-r-earp · 2024-12-31T04:20:33Z

Hello @classner ,

Thanks for the data and the feedback.

The benchmarks don't provide their own allocator. It just creates a buffer in krnl, ocl, or cust. So it can be inferred that the OpenCL driver tested provides a finer grained allocator than the CUDA driver. I wouldn't interpret this as a deficiency for CUDA, just reflective of the fact that an allocator is something extra to implement, whereas in krnl it's included.
krnl uses 2 different transfer modes, direct copy and staging buffer. Staging buffers require host visible device memory. Because this is often limited, it is very conservative, and allocates only 2 32 MB buffers. Where the source and / or destination buffers are host visible, it copies directly, because this bypasses the GPU, meaning it can be done in parallel CPU threads (where the GPU typically can only do one copy at a time). Unfortunately there is no way to opt in to either strategy. Adding the ability to explicitly allocate memory that is or isn't host visible was something I considered, but ends up being quite difficult when you add in the layers of abstraction on top, plus the fact that this depends on the hardware / driver.
Yes transfers via a staging buffer use 2 32 MB buffers. As in 2, I wonder if it's actually doing a direct copy, which can be slower. Direct copies could be done in parallel, but I decided against spawning more threads than necessary (it only uses a single background thread to run the device).
Correct. There is an builtin function in Vulkan, but it's slower than a compute shader. See 5.
Vulkan has higher overhead than CUDA, in command buffer recording, submit, and waiting for completion. With smaller workloads, this can exceed the execution time. krnl and vulkano add additional overhead, in part because they provide greater safety guarantees, as well as for portability and simplicity. Work group size is potentially something that can be tuned.

classner · 2025-01-06T05:04:14Z

Hi @charles-r-earp !

Thanks for the detailed response! Wow, impressive that the function for zeroing in Vulkan is slower than a compute shader! And I take it optimizing the staging buffer copy process would likely involve device specific benchmarking?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A6000 Ada benchmark results and thoughts #43

A6000 Ada benchmark results and thoughts #43

classner commented Dec 30, 2024

charles-r-earp commented Dec 31, 2024

classner commented Jan 6, 2025

A6000 Ada benchmark results and thoughts #43

A6000 Ada benchmark results and thoughts #43

Comments

classner commented Dec 30, 2024

charles-r-earp commented Dec 31, 2024

classner commented Jan 6, 2025