Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow CPU vs GPU #68

Closed
mikegerber opened this issue Feb 25, 2020 · 19 comments
Closed

Tensorflow CPU vs GPU #68

mikegerber opened this issue Feb 25, 2020 · 19 comments

Comments

@mikegerber
Copy link
Contributor

mikegerber commented Feb 25, 2020

  1. https://github.com/OCR-D/ocrd_all#conflicting-requirements states that ocrd_calamari would depend on tensorflow-gpu 1.14.x, but it depends on 1.15.2 since recently.

  2. There is also still some solvable(!) problem/confusion about the different TensorFlow flavours. For tensorflow 1.15.*, one can simply depend on tensorflow-gpu == 1.15.* for CPU and GPU support. I am not aware of any issues using tensorflow-gpu's CPU fallback on CPU, I use it every day. (There was some source of additional confusion because TF changed their recommendation for 1.15 only.)

  3. I just recently discovered that one can depend on an approximate version, e.g. tensorflow-gpu ~= 1.15.2 or tensorflow == 1.15.*

TL&DR: My recommendation would be that our TF1 projects just use tensorflow-gpu == 1.15.* for CPU and GPU support and be done with this problem.

@bertsky
Copy link
Collaborator

bertsky commented Feb 25, 2020

1. https://github.com/OCR-D/ocrd_all#conflicting-requirements states that

Yes, that section needs to be updated (cf. #35). But the real problem is that TF2 dependencies are lurking everywhere, so we will very soon have the unacceptable state that no catch-all venv (satisfying both TF1 and TF2 modules) is possible anymore. By then, a new solution needs to be in place, which (at least partially) isolates venvs from each other again.

2\. For tensorflow 1.15.*, one can simply depend on `tensorflow-gpu == 1.15.*` _for CPU **and** GPU_ support. I am not aware of any issues using `tensorflow-gpu`'s CPU fallback on CPU

But isn't that equally true for using tensorflow == 1.15.*? It is the variant with a -gpu suffix that is going to be dropped eventually IIUC.

@mikegerber
Copy link
Contributor Author

  1. For tensorflow 1.15.*, one can simply depend on tensorflow-gpu == 1.15.* for CPU and GPU support. I am not aware of any issues using tensorflow-gpu's CPU fallback on CPU

But isn't that equally true for using tensorflow == 1.15.*? It is the variant with a -gpu suffix that is going to be dropped eventually IIUC.

Nah, they had recommended tensorflow-gpu for TF2 CPU+GPU but changed it again to just tensorflow 🤣 So if tensorflow == 1.15.* has GPU support I am happy with that convention, too.

@stweil
Copy link
Collaborator

stweil commented Feb 25, 2020

Is there a chance to upgrade everything to Tensorflow 2?

@bertsky
Copy link
Collaborator

bertsky commented Feb 25, 2020

Is there a chance to upgrade everything to Tensorflow 2?

Code migration is not so difficult – yes, that could be streamlined in a coordinated PR effort. But IIRC the hard problem is that models will be incompatible and thus have to be retrained. This is something that the module providers have to decide on whether and when it is prudent themselves. And it's highly unlikely the time frames will converge.

@mikegerber
Copy link
Contributor Author

mikegerber commented Feb 25, 2020

Of course there is a chance, it just involves quite a bit of work. For a maintained software like
ocrd_calamari:

  • Training a new model for a week (done)
  • Updating
  • Testing
  • Proper evaluation (no regression?)

This stuff is a. not super high on priority lists because of effort vs. benefit, b. takes time and c. sometimes depends on other software involved. ocrd_all will always have to deal with version conflicts.

And I imagine there are research projects that have no maintainance anymore or maybe just some poor PhD student with other priorities.

@mikegerber
Copy link
Contributor Author

But isn't that equally true for using tensorflow == 1.15.*?

I do not get GPU support with that, only CPU. With tensorflow-gpu == 1.15.* I have no issues. But I'll try again after lunch, to make sure.

@mikegerber mikegerber changed the title Outdated info about conflicting requirements Tensorflow CPU vs GPU Feb 25, 2020
@stweil
Copy link
Collaborator

stweil commented Feb 25, 2020

But IIRC the hard problem is that models will be incompatible and thus have to be retrained.

Maybe existing models can be converted, too?

@mikegerber
Copy link
Contributor Author

But IIRC the hard problem is that models will be incompatible and thus have to be retrained.
Maybe existing models can be converted, too?

In some cases this is possible. But not for e.g. Calamari 0.3.5 → 1.0, unless they support it.

@mikegerber
Copy link
Contributor Author

But isn't that equally true for using tensorflow == 1.15.*?

I do not get GPU support with that, only CPU. With tensorflow-gpu == 1.15.* I have no issues. But I'll try again after lunch, to make sure.

Alright, these are my results using the below script:

== tensorflow==1.15.*, CUDA_VISIBLE_DEVICES='0'
Already using interpreter /usr/bin/python3
2020-02-25 17:21:35.205395: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:21:35.220274: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:21:35.220640: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564f5da0e220 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:21:35.220655: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
GPU available: False
== tensorflow==1.15.*, CUDA_VISIBLE_DEVICES=''
Already using interpreter /usr/bin/python3
2020-02-25 17:21:55.577941: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:21:55.593243: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:21:55.593497: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5594505bb720 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:21:55.593532: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
GPU available: False
== tensorflow-gpu==1.15.*, CUDA_VISIBLE_DEVICES='0'
Already using interpreter /usr/bin/python3
2020-02-25 17:22:27.264675: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:22:27.281148: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:22:27.281383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f6815f70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:27.281398: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-25 17:22:27.282909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-25 17:22:27.424313: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f68a56b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:27.424336: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-02-25 17:22:27.424711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
2020-02-25 17:22:27.424872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-25 17:22:27.425769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-02-25 17:22:27.426610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-02-25 17:22:27.426867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-02-25 17:22:27.428707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-02-25 17:22:27.430106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-02-25 17:22:27.433060: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-25 17:22:27.433717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-02-25 17:22:27.433752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-25 17:22:27.434268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-25 17:22:27.434279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2020-02-25 17:22:27.434284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
2020-02-25 17:22:27.434897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 6786 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
GPU available: True
== tensorflow-gpu==1.15.*, CUDA_VISIBLE_DEVICES=''
Already using interpreter /usr/bin/python3
2020-02-25 17:22:58.971329: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:22:58.987226: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:22:58.987497: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cc0be40d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:58.987526: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-25 17:22:58.989005: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-25 17:22:58.992375: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-02-25 17:22:58.992396: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: b-pc30533
2020-02-25 17:22:58.992402: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: b-pc30533
2020-02-25 17:22:58.992431: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.59.0
2020-02-25 17:22:58.992449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.59.0
2020-02-25 17:22:58.992455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.59.0
GPU available: False

Script:

#!/bin/sh
for package in "tensorflow==1.15.*" "tensorflow-gpu==1.15.*"; do
  for CUDA_VISIBLE_DEVICES in "0" ""; do

    echo "== $package, CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'"

    export CUDA_VISIBLE_DEVICES

    venv=/tmp/tmp.$RANDOM
    virtualenv --quiet -p /usr/bin/python3 $venv
    . $venv/bin/activate

    pip3 install --quiet --upgrade pip
    pip3 install --quiet "$package"

    python3 -c 'import tensorflow as tf; print("GPU available:", tf.test.is_gpu_available())'

  done
done

@mikegerber
Copy link
Contributor Author

mikegerber commented Feb 25, 2020

So, tensorflow-gpu==1.15.* is the right choice for TF1, it gives GPU and CPU support. (The script does not check for CPU support, I know that -gpu works for CPU too)

@bertsky
Copy link
Collaborator

bertsky commented Feb 26, 2020

So, tensorflow-gpu==1.15.* is the right choice for TF1, it gives GPU and CPU support. (The script does not check for CPU support, I know that -gpu works for CPU too)

Indeed! We should open issues/PRs to all directly or indirectly affected module repos.

(Strange though, I have a clear memory of getting GPU support out of a tensorflow PyPI release. But maybe that was in an Nvidia Docker image, or TF 2.)

@mikegerber
Copy link
Contributor Author

mikegerber commented Mar 4, 2020

(Strange though, I have a clear memory of getting GPU support out of a tensorflow PyPI release. But maybe that was in an Nvidia Docker image, or TF 2.)

Behaviour changed between releases, so that explains it:

https://web.archive.org/web/diff/20191015141958/20191208214348/https://www.tensorflow.org/install/pip

image
(Left: October 2019, right: February 2020)

kba added a commit to OCR-D/OLD_ocrd_anybaseocr that referenced this issue Mar 6, 2020
@stweil
Copy link
Collaborator

stweil commented Apr 23, 2020

With tensorflow-gpu == 1.15.* I have no issues.

Bad news: With tensorflow-gpu==1.15.* I have issues because it does not work on macOS. tensorflow==1.15.* works fine there.

@bertsky
Copy link
Collaborator

bertsky commented Apr 23, 2020

With tensorflow-gpu == 1.15.* I have no issues.

Bad news: With tensorflow-gpu==1.15.* I have issues because it does not work on macOS. tensorflow==1.15.* works fine there.

These TF devs keep driving me mad. I thought we had this solved by now.

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?
Or should we build our own TF wheels under the correct name for macOS and include them in the supply chain?

@stweil
Copy link
Collaborator

stweil commented Apr 23, 2020

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?

Yes, that is possible. Of course there remains the conflict between TF1 and TF2, so the resulting installation won't work.

@stweil
Copy link
Collaborator

stweil commented Apr 23, 2020

Building TF is a nightmare. It takes days for ARM, and I expect many hours for macOS.

@bertsky
Copy link
Collaborator

bertsky commented Apr 23, 2020

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?

Yes, that is possible. Of course there remains the conflict between TF1 and TF2, so the resulting installation won't work.

I don't think this is the right approach. First of all, you don't discriminate the version you are delegating to. And second, this requires to install tensorflow from the same base version (which yes, then makes it impossible to have both TF1 and TF2 installed at the same time).

I was thinking along the lines of modifying the name in the official wheel.

@bertsky
Copy link
Collaborator

bertsky commented Apr 23, 2020

Building TF is a nightmare. It takes days for ARM, and I expect many hours for macOS.

I know. And it never quite works out of the box as documented (at least for me). Too fast to die, too slow to live.

But building from scratch trivially gives you whatever package name you want. (So we could have tensorflow for TF2 and tensorflow-gpu for TF1 – even if it does not have actual GPU support on macOS.) But I am still more inclined to the wheel patching approach.

@kba your thoughts?

@bertsky
Copy link
Collaborator

bertsky commented Aug 20, 2020

So, except for ARM and macOS and Python 3.8 support (it just keeps growing) – which we should probably discuss in #147 – I think this has been solved by #118. @mikegerber can we close?

@bertsky bertsky closed this as completed Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants