-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13
Comments
Hi there! While we don't officially support Volta GPUs, you might find success replacing (If no config is specified on the command line, the testbed uses |
That's unfortunate. Apologies, but it's currently not in our scope to add dedicated implementations for older GPUs. In part, because earlier GPU's performance would benefit much less from a fully fused implementation due to them being more compute rather than memory-bound than newer GPUs on the small-MLP workload. All that said, I would still be more than happy to merge code contributions that improve compatibility. |
Well, Volta100 also has 32GB memory which is not too bad, and should also be able to support Tensor Cores, but with compute 70. I am not sure which part restricts to use arch > 75.
I have no idea which caused the issue. Not sure where to start and would be nice to learn more from you @Tom94 |
tiny-cuda-nn's CutlassMLP had supported V100 tensor core ops once in the past (hence my suggestion above) and I am not 100% sure where it might have broken in the interim. I had stopped explicitly supporting it after all (few at the time) colleagues who depended on the framework moved on to newer GPUs. The top of A shot in the dark, which would be amazing if it works: does the codebase compile when you replace
with
? Perhaps (though unlikely) my old maze of compile-time conditionals survived the last year intact enough to still support Note: you likely also have to comment out lines 78/79 in
as well as set the environment variable If it turns out to be this simple, I'm of course happy to automate all of these manual steps upstream. |
Thanks for the very detailed reply! |
Confirmed that the above change (using SmArch = cutlass::arch::Sm70;) worked, and "comment out list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES 53 670 "86+PTX")" seems not necessary. I tested both CutlassMLP and FullyFusedMLP, on nerf fox (also on image fitting task) the speed is: Also, I also tested "tiny-cuda-nn" repo on its own. The example (https://github.com/NVlabs/tiny-cuda-nn#example-learning-a-2d-image) worked for CutlassMLP, however, the loss diverged for FullyFusedMLP. So something is still different between CutlassMLP and FullyFusedMLP which the latter is still not working normally. |
will it work on gtx 1650 max q turing architecture ? |
@MultiPath wow that's great. Especially the fact that I'll head over to tiny-cuda-nn and automate this upon CMake detecting a 70-arch. Regarding the failure in the 2D learning example: may I ask you test one more thing? Could you go into
with
(lines 284 and 586)? In the off-chance that this helps, how far (in powers of two) can you increase |
Thanks for checking! I'll configure Regarding the PyTorch ETA: apologies for not being more forthcoming, but all I can say is "it's done when it's done". Could be quick -- or it could be a while -- hard for me to predict. |
Thanks for the response! |
Thanks again for the help with troubleshooting. I pushed a fix which should add compatibility to V100 GPUs as well as earlier ones (through regular full-precision matmuls powered by CUTLASS): #22 Even though CI successfully builds these and I can run them locally, I don't have GTX 1000-series, K80, or V100 GPUs available to test with. I'd appreciate a confirm/deny whether instant-ngp actually runs on any of these now. |
Can confirm it runs, see #33. |
The newest version could run success in my V100 GPU. |
Hi
Thanks for sharing the great work. Although the code can be compiled, it fails when running the testbed.
Does it mean that there is no way to support running this code on some older GPUs, e.g. Tesla V100 which seems architecture 70.
Are there any alternatives?
Thanks
The text was updated successfully, but these errors were encountered: