-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(CuBLAS): explore reduction in launch overhead via CUDA graphs #1192
Comments
Any thoughts @slaren ? |
Looking at #1129 (comment) It seems that inter-operator fusion is required. This means we need a concept of a device tensor. Looks like we are slowly reimplementing PyTorch... |
I don't think that we launch enough kernels for this to make a meaningful difference. |
Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use GPU only for large GEMMs that take at least a couple of milliseconds. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
See https://developer.nvidia.com/blog/cuda-graphs/ for reference.
One can take one of two approaches:
The text was updated successfully, but these errors were encountered: