Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard Callback profile_batch causes Segmentation Fault #3149

Open
rodyt opened this issue Jan 18, 2020 · 8 comments
Open

Tensorboard Callback profile_batch causes Segmentation Fault #3149

rodyt opened this issue Jan 18, 2020 · 8 comments

Comments

@rodyt
Copy link

rodyt commented Jan 18, 2020

Environment information

Suggestion: Fix conflicting installations

Namely:
pip uninstall tensorboard tensorflow tensorflow-estimator tensorflow-gpu
pip install tensorflow # or tensorflow-gpu, or tf-nightly, ...

Issue description

System:

  • Ubuntu 18.04
  • TF2.0

Error:

I am using tf.keras and the model.fit() function to train my model. I have added the tf.keras.callbacks.TensorBoard callback to my .fit() call.

I am encountering an issue where if:

  • profile_batch=2 --> Segmentation Fault (core dumped) when a call is made to .fit()
  • profile_batch=0 --> No segfault (disabled profiling)

Note: This error only occurs sometimes; Run the model.fit() a couple of times to reproduce the error; This error happens independent of the directory being logged to

Background Info:

Currently, the input to my model is in the form of a tf.data.Dataset. The documentation for the callback says:

profile_batch: Profile the batch to sample compute characteristics. By default, it will profile the second batch. Set profile_batch=0 to disable profiling. Must run in TensorFlow eager mode.

I use the dataset.map() function in my input pipeline to transform my input data. However, since .map() does not execute eagerly, I wrap it around a tf.py_function, which should make it execute eagerly (I've verified that the map function does in fact run eagerly after using py_function).

However, sometimes, I still get a Segmentation Fault error, as described above. On the occasion where the Segmentation Fault does not occur, the profile logged to the Tensorboard is a .profile-empty file.

#2084 may be related.

@wolleric
Copy link

wolleric commented Jun 24, 2021

Not sure if I should open my own ticket, but I experience a very similar issue.

RHEL 7.9
TF 2.4.1
CUDA 11.0
CUDNN 8.0.2
4x NVIDIA TITAN V, 12GB RAM each
MirroredStrategy

TF Profiler segfaults immediately after the last batch given in the profile_batch range, right after logging "Profiler session collecting data." This happens at every training with batch_size >= 32, but not with batch_size 16, no matter which range I select in profile_batch. If I disable profiling, everything is fine.

If I run the entire script in gdb, it does not segfault but simply hangs. If I interrupt it then, the backtrace shows that it gets stuck waiting in libpthread, shortly after a call to cuptiActivityFlushAll. I read somewhere that this can happen if you try to flush an (already) empty buffer. Is this perhaps a synchronization issue?

EDIT1: It does happen sometimes with batch_size 16. Seems entirely gone with batch_size 8. Fyi, I usually train with batch_size 256. One single sample is 360 000 bytes (image 200x200x9, uint8).

@183amir
Copy link

183amir commented Jun 24, 2021

I have this issue as well. It happens with tensorflow 2.5 and also nightly. If I run with python -q -X faulthandler ..., here is what I get:

 14/400 [>.............................] - ETA: 9:13 - loss: 7.0166
2021-06-24 19:06:02.117170: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
 15/400 [>.............................] - ETA: 10:38 - loss: 7.0113
 16/400 [>.............................] - ETA: 10:33 - loss: 7.0191
 17/400 [>.............................] - ETA: 10:28 - loss: 7.0009
 18/400 [>.............................] - ETA: 10:24 - loss: 6.9822
 19/400 [>.............................] - ETA: 10:20 - loss: 6.9481
 20/400 [>.............................] - ETA: 10:17 - loss: 6.9383
 21/400 [>.............................] - ETA: 10:13 - loss: 6.9172
 22/400 [>.............................] - ETA: 10:10 - loss: 6.8959
 23/400 [>.............................] - ETA: 10:09 - loss: 6.8701
 24/400 [>.............................] - ETA: 10:06 - loss: 6.8507
 25/400 [>.............................] - ETA: 10:03 - loss: 6.8337
 26/400 [>.............................] - ETA: 10:04 - loss: 6.8356
 27/400 [=>............................] - ETA: 10:01 - loss: 6.8093
 28/400 [=>............................] - ETA: 9:59 - loss: 6.8078
 29/400 [=>............................] - ETA: 9:56 - loss: 6.7863
 30/400 [=>............................] - ETA: 9:53 - loss: 6.7700
2021-06-24 19:06:30.062064: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
Fatal Python error: Segmentation fault

Thread 0x00007f258b7fe700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 576 in _handle_results
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f258bfff700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 528 in _handle_tasks
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41c8ff9700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/selectors.py", line 415 in select
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/connection.py", line 931 in wait
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 499 in _wait_for_updates
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 519 in _handle_workers
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41c97fa700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 114 in worker
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41c9ffb700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 114 in worker
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41ca7fc700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 114 in worker
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41caffd700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 114 in worker
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f41cb7fe700 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/multiprocessing/pool.py", line 114 in worker
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 870 in run
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/conda/envs/hardening/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f486e404740 (most recent call first):
  File "/conda/envs/hardening/lib/python3.8/site-packages/tensorflow/python/profiler/profiler_v2.py", line 154 in stop
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 2575 in _stop_profiler
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 2468 in _stop_trace
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 2437 in on_train_batch_end
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 353 in _call_batch_hook_helper
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 315 in _call_batch_end_hook
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 295 in _call_batch_hook
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/callbacks.py", line 435 in on_train_batch_end
  File "/conda/envs/hardening/lib/python3.8/site-packages/keras/engine/training.py", line 1194 in fit

more info

dpkg -l | grep cuda
ii  custom-pack-cuda                          2019.01.10a                          all          Custom Configuration Framework - nVidia CUDA stack                                                                                            
ii  libcuda1:amd64                            455.32.00-1                          amd64        NVIDIA CUDA Driver Library                                                                                                                    
ii  libcudart11.0:amd64                       11.1.0-1                             amd64        NVIDIA CUDA Runtime Library                                                                                                                   
ii  nvidia-cuda-dev:amd64                     11.1.0-1                             amd64        NVIDIA CUDA development files                                                                                                                 
ii  nvidia-cuda-toolkit                       11.1.0-1                             amd64        NVIDIA CUDA development toolkit                                                                                                               
ii  python-pycuda                             2018.1.1-3                           amd64        Python module to access Nvidia‘s CUDA parallel computation API                                                                                
ii  python3-pycuda                            2018.1.1-3                           amd64        Python 3 module to access Nvidia‘s CUDA parallel computation API                                                                              

dpkg -l | grep cupti
ii  libcupti-dev:amd64                        11.1.0-1                             amd64        NVIDIA CUDA Profiler Tools Interface development files
ii  libcupti11.1:amd64                        11.1.0-1                             amd64        NVIDIA CUDA Profiler Tools Interface runtime library

@tanzhenyu
Copy link

This is happening with cuda-11.1 with tf2.4, 2.5 and 2.6

@trisolaran
Copy link
Contributor

I believe that's cuda-11.1 problem, specifically cupti 11.1. I think that is confirmed and fixed. by NVIDIA.
internally we use cuda-11.3

on the another hand, why profile is enabled by default is debatable.

@tanzhenyu
Copy link

I believe that's cuda-11.1 problem, specifically cupti 11.1. I think that is confirmed and fixed. by NVIDIA.
internally we use cuda-11.3

on the another hand, why profile is enabled by default is debatable.

You're right, upgrading cuda version helps. Thanks.

@wolleric
Copy link

wolleric commented Sep 5, 2021

I believe that's cuda-11.1 problem, specifically cupti 11.1. I think that is confirmed and fixed. by NVIDIA.
internally we use cuda-11.3
on the another hand, why profile is enabled by default is debatable.

You're right, upgrading cuda version helps. Thanks.

Are you saying that one can combine tf2.4 with cuda-11.3? I am still on cuda-11.0, as recommended on TF's GPU support page.

@ggosiang
Copy link

I believe that's cuda-11.1 problem, specifically cupti 11.1. I think that is confirmed and fixed. by NVIDIA.
internally we use cuda-11.3
on the another hand, why profile is enabled by default is debatable.

You're right, upgrading cuda version helps. Thanks.

I've got the same problem with TF 2.5.0 + CUDA 11.2, changed the profile_batch to 0 can work

However, the suggested solution is to upgrade CUDA to 11.3

According to https://www.tensorflow.org/install/source?hl=sk#gpu, seems the combination of TF 2.5.0 + CUDA 11.3 does not work, am I right?

@Artem-B
Copy link
Member

Artem-B commented Sep 10, 2021

According to https://www.tensorflow.org/install/source?hl=sk#gpu, seems the combination of TF 2.5.0 + CUDA 11.3 does not work, am I right?

No, the list does not say that other combinations do not work.
As mentioned above, our internal TF builds have been using CUDA-11.3 for a while now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants