Skip to content

Update GPU Compute Capacity support to match tensorflow #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rnett opened this issue Jan 30, 2021 · 10 comments
Closed

Update GPU Compute Capacity support to match tensorflow #200

rnett opened this issue Jan 30, 2021 · 10 comments

Comments

@rnett
Copy link
Contributor

rnett commented Jan 30, 2021

When trying to test stuff on GPU (on Linux) on 0.3.0-SNAPSHOT, it takes a while to initialize, before giving me:

2021-01-29 19:15:28.239332: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-01-29 19:15:28.239537: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Exception in thread "main" org.tensorflow.exceptions.TensorFlowException: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid
	at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:101)
	at org.tensorflow.EagerSession.allocate(EagerSession.java:357)
	at org.tensorflow.EagerSession.<init>(EagerSession.java:327)

This is with a 1070 (compute 6.1) that was successfully recognized earlier:

2021-01-29 19:15:28.229419: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.797GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s

After some diging, I found tensorflow/tensorflow#41990, tensorflow/tensorflow#41132 (comment), and tensorflow/tensorflow#41892 (comment).

The last two in particular imply that the issue is that our binaries aren't being built with support for compute capacity 6.1, and sure enough, we don't: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/build.sh#L25-L32

export TF_CUDA_COMPUTE_CAPABILITIES="3.5,7.0"

As per the 2nd and 3rd links, and https://www.tensorflow.org/install/gpu#hardware_requirements, the other tensorflow binaries (Python, C, C++, etc) support 3.5, 5.0, 6.0, 7.0, 7.5, 8.0 and higher than 8.0. Imo, we should do the same, ideally in a way we don't have to update when it changes (will simply not exporting it work? The defaults are specified in https://github.com/tensorflow/tensorflow/blob/master/.bazelrc#L600). This will likely increase build times though, which I think we already have issues with.

@rnett
Copy link
Contributor Author

rnett commented Jan 30, 2021

As per #149 (comment), this is not possible due to build times. I'm leaving the issue up as a reminder to do it when possible.

@mikaelhg
Copy link

mikaelhg commented Feb 2, 2021

One part of the solution should be to provide a much better error message, which accurately describes what's happening and how to fix the issue.

@mikaelhg
Copy link

mikaelhg commented Feb 2, 2021

Also, would it be possible to change build.sh to allow the passthrough of the TF_CUDA_COMPUTE_CAPABILITIES environmental variable value, rather than explicitly overwrite the value? This would allow parametrized custom builds without patching the build.sh file for each build.

@rnett
Copy link
Contributor Author

rnett commented Feb 2, 2021

I don't think we can do much about the error message, since it comes from the native code. But not overwriting the variable sounds easy enough. You can set it in a bazelrc file, too, which takes precedence over this.

@mikaelhg
Copy link

When you get more build resources, please add compute capability 3.7 for K80 on the default support list, as it's very widely used.

@karllessard
Copy link
Collaborator

karllessard commented Mar 13, 2021

@rnett , I've replaced the CUDA capabilities by this line that I've picked from the .bazelrc, is that right? It is still building but looks successful so far, if you agree then I'll merge this change to master and ask you to do some test on your hardware with the latest snapshots.

I think at some point we'll need to make better usage of the configuration in this .bazelrc file but du right now, since I'm just about to release 0.3.0, I prefer doing minimal changes. I might do a 0.3.1 release right after just to align our build options with the official ones.

@rnett
Copy link
Contributor Author

rnett commented Mar 13, 2021

Seems fine to me. We may have to change it eventually, once the next gen NVIDIA GPUs launch, but we should be good for a while. Would be nice if we could get SIG BUILD or whoever's responsible for it to publicize their setup/arguments.

@karllessard
Copy link
Collaborator

karllessard commented Mar 13, 2021

I guess they use the configs in the main repo, check at the release_*_* ones? (which we copy in our repo, btw). I've quickly tried to enable them in our build but had a bunch a errors so I'll keep working on this but later.

@rnett
Copy link
Contributor Author

rnett commented Mar 13, 2021

Huh TIL, I would've expected that to come up in the issues I was looking at.

@karllessard
Copy link
Collaborator

List of capabilities as been updated to sm_35,sm_50,sm_60,sm_70,sm_75,compute_80 in 0.3.0, which should cover probably everything we expect to run on CUDA 11

Closing this issue for now, please reopen if some capabilities are still not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants