Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

Closed
MultiPath opened this issue Jan 15, 2022 · 15 comments

Comments

@MultiPath
Copy link

Hi

Thanks for sharing the great work. Although the code can be compiled, it fails when running the testbed.
Does it mean that there is no way to support running this code on some older GPUs, e.g. Tesla V100 which seems architecture 70.
Are there any alternatives?

Thanks

@Tom94
Copy link
Collaborator

Tom94 commented Jan 15, 2022

Hi there! While we don't officially support Volta GPUs, you might find success replacing FullyFusedMLP (which relies on Turing+ TensorCores) with the more general CutlassMLP in the network config.

(If no config is specified on the command line, the testbed uses configs/<mode>/base.json.)

@MultiPath
Copy link
Author

It can run with CutlassMLP and comment out the raised error in the testbed.
However, I found the loss was completely frozen. I also tried tiny-nn-cuda repo and run the same test. The results look similar

image
For example the above image, loss=0.206 and never decreased.
I ran python scripts/run.py --scene data/image/ --mode image --network base.json

@Tom94
Copy link
Collaborator

Tom94 commented Jan 16, 2022

That's unfortunate. Apologies, but it's currently not in our scope to add dedicated implementations for older GPUs.

In part, because earlier GPU's performance would benefit much less from a fully fused implementation due to them being more compute rather than memory-bound than newer GPUs on the small-MLP workload.

All that said, I would still be more than happy to merge code contributions that improve compatibility.

@Tom94 Tom94 closed this as completed Jan 16, 2022
@MultiPath
Copy link
Author

MultiPath commented Jan 16, 2022

Well, Volta100 also has 32GB memory which is not too bad, and should also be able to support Tensor Cores, but with compute 70. I am not sure which part restricts to use arch > 75.

All that said, I would still be more than happy to merge code contributions that improve compatibility.

I have no idea which caused the issue. Not sure where to start and would be nice to learn more from you @Tom94

@Tom94
Copy link
Collaborator

Tom94 commented Jan 16, 2022

tiny-cuda-nn's CutlassMLP had supported V100 tensor core ops once in the past (hence my suggestion above) and I am not 100% sure where it might have broken in the interim. I had stopped explicitly supporting it after all (few at the time) colleagues who depended on the framework moved on to newer GPUs.

The top of dependencies/tiny-cuda-nn/include/tiny-cuda-nn/cutlass_matmul.h contains a number of template configurations that are conditional on compute capability. Reason being that the tensor cores on different GPU generations behave subtly differently and support differently sized operands.

A shot in the dark, which would be amazing if it works: does the codebase compile when you replace

#ifdef TCNN_AMPERE
using SmArch = typename std::conditional<std::is_same<network_precision_t, float>::value, cutlass::arch::Sm75, cutlass::arch::Sm80>::type;
#else
using SmArch = cutlass::arch::Sm75;
#endif

with

using SmArch = cutlass::arch::Sm70;

? Perhaps (though unlikely) my old maze of compile-time conditionals survived the last year intact enough to still support Sm70.

Note: you likely also have to comment out lines 78/79 in dependencies/tiny-cuda-nn/CMakeLists.txt

# Remove unsupported architectures
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES 53 60 61 70 "86+PTX")

as well as set the environment variable TCNN_CUDA_ARCHITECTURES=70 if your arch isn't detected right automatically. Fingers crossed!

If it turns out to be this simple, I'm of course happy to automate all of these manual steps upstream.

@MultiPath
Copy link
Author

Thanks for the very detailed reply!

@MultiPath
Copy link
Author

MultiPath commented Jan 17, 2022

@Tom94

Confirmed that the above change (using SmArch = cutlass::arch::Sm70;) worked, and "comment out list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES 53 670 "86+PTX")" seems not necessary.

I tested both CutlassMLP and FullyFusedMLP, on nerf fox (also on image fitting task) the speed is:
96.55step/s and 110.26step/s on my server, respectively.
Both losses deceased and learned reasonable output.
However, the results FullyFusedMLP has 2x larger loss than CutlassMLP, and the output image is also relatively worse.

Also, I also tested "tiny-cuda-nn" repo on its own. The example (https://github.com/NVlabs/tiny-cuda-nn#example-learning-a-2d-image) worked for CutlassMLP, however, the loss diverged for FullyFusedMLP.

So something is still different between CutlassMLP and FullyFusedMLP which the latter is still not working normally.

@1kaiser
Copy link

1kaiser commented Jan 17, 2022

will it work on gtx 1650 max q turing architecture ?

@Tom94
Copy link
Collaborator

Tom94 commented Jan 17, 2022

@MultiPath wow that's great. Especially the fact that FullyFusedMLP works at least for instant-ngp out-of-the-box! I'm also positively surprised about the overall speed -- on my 3090 I get 169.03step/s, which not even 2x faster. Very glad that this works!

I'll head over to tiny-cuda-nn and automate this upon CMake detecting a 70-arch.

Regarding the failure in the 2D learning example: may I ask you test one more thing? Could you go into dependencies/tiny-cuda-nn/src/fully_fused_mlp.cu and replace the two occurrences of

	const int N_ITERS = WIDTH >= 256 ? 2 : 8;
	const uint32_t BLOCK_DIM_Z = (INFERENCE && WIDTH == 128) ? 2 : 1;

with

	const int N_ITERS = 1;
	const uint32_t BLOCK_DIM_Z = 1;

(lines 284 and 586)? In the off-chance that this helps, how far (in powers of two) can you increase N_ITERS until things stop working? Much appreciated!

@MultiPath
Copy link
Author

Hi I tried and changed two places.
Unfortunately, it does not solve the issue I met in the 2d image learining example. The following is the screenshot:
1642403403(1)
The loss did not go down and diverge after 5k steps.

On the other hand. using cutlass MLP is fine:
1642403478(1)

Also, it would be nice to know the ETA you mentioned about the pytorch binding for tiny-cuda-nn (even just cutlassMLP forward/backward alone) would be super nice!
I think it would be super helpful for many existing NeRF-based systems. I tried to taka a read this weekend. Unfortunately, it is quite out of my experiences for code at this level...

@Tom94
Copy link
Collaborator

Tom94 commented Jan 17, 2022

Thanks for checking! I'll configure FullyFusedMLP to automatically fall back to CutlassMLP for arch 70 and lower in that case. This is unfortunately ~20% slower according to your numbers from above, but imho much better than sporadic failure. At least until we have a better grasp on the precise conditions of how and why FullyFusedMLP fails.

Regarding the PyTorch ETA: apologies for not being more forthcoming, but all I can say is "it's done when it's done". Could be quick -- or it could be a while -- hard for me to predict.

@MultiPath
Copy link
Author

Thanks for the response!

@Tom94
Copy link
Collaborator

Tom94 commented Jan 17, 2022

Thanks again for the help with troubleshooting. I pushed a fix which should add compatibility to V100 GPUs as well as earlier ones (through regular full-precision matmuls powered by CUTLASS): #22

Even though CI successfully builds these and I can run them locally, I don't have GTX 1000-series, K80, or V100 GPUs available to test with. I'd appreciate a confirm/deny whether instant-ngp actually runs on any of these now.

@Tom94 Tom94 reopened this Jan 17, 2022
@myagues
Copy link
Contributor

myagues commented Jan 17, 2022

Can confirm it runs, see #33.

@Tom94 Tom94 closed this as completed Jan 17, 2022
@endlesswho
Copy link

Thanks again for the help with troubleshooting. I've got the following PR, which should add compatibility to V100 GPUs as well as earlier ones (through regular full-precision matmuls powered by CUTLASS): #22

Even though CI successfully builds these and I can run them locally, I don't have GTX 1000-series, K80, or V100 GPUs available to test with. I'd appreciate a confirm/deny whether instant-ngp actually runs on any of these now.

The newest version could run success in my V100 GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants