Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs #1192

Closed
jon-chuang opened this issue Apr 26, 2023 · 5 comments
Closed

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs #1192

jon-chuang opened this issue Apr 26, 2023 · 5 comments
Labels

Comments

@jon-chuang
Copy link
Contributor

See https://developer.nvidia.com/blog/cuda-graphs/ for reference.

One can take one of two approaches:

  1. Within operator.
  2. Spanning multiple operators (operator fusion)
@jon-chuang
Copy link
Contributor Author

Any thoughts @slaren ?

@jon-chuang
Copy link
Contributor Author

Looking at #1129 (comment)

It seems that inter-operator fusion is required.

This means we need a concept of a device tensor. Looks like we are slowly reimplementing PyTorch...

@slaren
Copy link
Member

slaren commented Apr 26, 2023

I don't think that we launch enough kernels for this to make a meaningful difference.

@dfyz
Copy link
Collaborator

dfyz commented Apr 27, 2023

Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use GPU only for large GEMMs that take at least a couple of milliseconds.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants