-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libdevice not found during training using default conda environment on Ubuntu 22.04.2 with a RTX A4000 #61
Comments
Does this happen only on very specific machines? Is it considered a bug in |
I don't know how specific this would be to Ubuntu or to the GPU I was running on, since the Stackoverflow question doesn't specify which GPU they had when that problem started. However, I saw that the only CI job that uses TF 2.11 installs the CPU version of Tensorflow:
Which might be the reason why the CI pipeline isn't catching the problem. I understand that it might not be practical to run a CI job with a machine with a GPU, but, if you have the resources for it, it might something to consider, since the code in this repo is most likely going to be ran with a GPU and it is exactly my GPU setup that failed. This will allow you be able to catch these nasty CUDA-related bugs by doing so, at least so that people are aware of which versions might fail in the future. For now, I think just having this thread might be enough to help anyone that stumbles upon this issue, but it'd be even better to put it as a warning on the readme as some people might not always look up past issues before opening a new one. |
Yes, I guess that's expected; I wouldn't expect the standard CI agents to have a GPU. That being said, I noticed that the Python 3.8 build seems to be installing CUDA libraries and GPU-enabled Tensorflow... I need to take a closer look to understand what's going on here. Coming back to your issue, the Tensorflow install guide has a section called "Ubuntu 22.04" which seems to talk about the exact problem you're having? They mention a way to fix it which does not involve downgrading Tensorflow. If this is indeed a fix, then maybe I can mention that in the |
(For me the instructions from the Tensorflow website resolve the issue, and training under |
This PR expands on the Tensorflow troubleshooting section in `README.md`, taking into account how installing the newest versions on Ubuntu 22.04 requires extra care (fixes #61). On top of this, I also relaxed the pin on `protobuf` version in `setup.py`; I'm not sure why the lower bound was introduced, but some of the `tensorflow` versions actually conflict with it.
Hello, just to let you know that when running
molecule-generation train
following theReadme.md
, with the default conda environment, on Ubuntu 22.04.2 with a RTX A4000 fails by not finding libdevice, log below.I've found that pinning Tensorflow to version 2.10 instead of 2.11 (latest version and installed automatically at time of writing) as per this stackoverflow question fixes it.
If you wish, I can open a PR to pin the TF version to be 2.10 or lower until this is fixed upstream as it was also cited as a solution for #56 , or else I'm at least posting this here so that other people can find this error and solution more easily.
Error Log
Conda Environment before pip install
When I re-created the environment without the restriction this is the dependency list shown before installing
molecule-generation
:Conda environment after pip install
And after running
pip install molecule-generation
:Tasks
The text was updated successfully, but these errors were encountered: