Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A6000 Ada benchmark results and thoughts #43

Open
classner opened this issue Dec 30, 2024 · 2 comments
Open

A6000 Ada benchmark results and thoughts #43

classner opened this issue Dec 30, 2024 · 2 comments

Comments

@classner
Copy link

Hi @charles-r-earp !

Thanks for this amazing project! This 'issue' is mostly meant for conversation than fixing a specific thing. I just tested krnl on an A6000 Ada and thought I'd share some findings and thoughts.

The following are the timings I got comparing its performance with CUDA:

Screenshot 2024-12-29 at 10 30 50 PM

(I marked cells with up to 2x slowdown yellow, anything above red.)

  1. Clearly, the malloc operation benefits from a very different implementation (agreeing with your findings on the project README). This seems similar to the OpenCL implementation. Does the CUDA interface here use an inefficient allocator? Either way, this performance is great!
  2. The upload performance is competitive in the range tested! The 10m range seems not ideal, but I assume a bit of tweaking can get it there!
  3. The download performance here really breaks down. Do you have ideas on why this could be? I looked at the implementation in the engine file and saw that it uses chunked downloading - what do you think happens here? And what could CUDA do better?
  4. Zeroing is nearly the same for the 64M element test, but is much slower for fewer elements. Again, I looked at the implementation and it looks like it's implemented as a kernel call setting individual elements. I assume CUDA is using an intrinsic to zero the memory, which is a special case here.
  5. SAXPY is again approaching nearly similar performance for 64M elements, yet is much slower before. What could be happening here? Bad working group sizes? Generally more kernel call overhead? It does stand out that KRNL consistently never reaches as low a time for any kernel calls as CUDA.

What do you think?

@charles-r-earp
Copy link
Owner

Hello @classner ,

Thanks for the data and the feedback.

  1. The benchmarks don't provide their own allocator. It just creates a buffer in krnl, ocl, or cust. So it can be inferred that the OpenCL driver tested provides a finer grained allocator than the CUDA driver. I wouldn't interpret this as a deficiency for CUDA, just reflective of the fact that an allocator is something extra to implement, whereas in krnl it's included.

  2. krnl uses 2 different transfer modes, direct copy and staging buffer. Staging buffers require host visible device memory. Because this is often limited, it is very conservative, and allocates only 2 32 MB buffers. Where the source and / or destination buffers are host visible, it copies directly, because this bypasses the GPU, meaning it can be done in parallel CPU threads (where the GPU typically can only do one copy at a time). Unfortunately there is no way to opt in to either strategy. Adding the ability to explicitly allocate memory that is or isn't host visible was something I considered, but ends up being quite difficult when you add in the layers of abstraction on top, plus the fact that this depends on the hardware / driver.

  3. Yes transfers via a staging buffer use 2 32 MB buffers. As in 2, I wonder if it's actually doing a direct copy, which can be slower. Direct copies could be done in parallel, but I decided against spawning more threads than necessary (it only uses a single background thread to run the device).

  4. Correct. There is an builtin function in Vulkan, but it's slower than a compute shader. See 5.

  5. Vulkan has higher overhead than CUDA, in command buffer recording, submit, and waiting for completion. With smaller workloads, this can exceed the execution time. krnl and vulkano add additional overhead, in part because they provide greater safety guarantees, as well as for portability and simplicity. Work group size is potentially something that can be tuned.

@classner
Copy link
Author

classner commented Jan 6, 2025

Hi @charles-r-earp !

Thanks for the detailed response! Wow, impressive that the function for zeroing in Vulkan is slower than a compute shader! And I take it optimizing the staging buffer copy process would likely involve device specific benchmarking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants