-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorboard Callback profile_batch causes Segmentation Fault #3149
Comments
Not sure if I should open my own ticket, but I experience a very similar issue. RHEL 7.9 TF Profiler segfaults immediately after the last batch given in the profile_batch range, right after logging "Profiler session collecting data." This happens at every training with batch_size >= 32, but not with batch_size 16, no matter which range I select in profile_batch. If I disable profiling, everything is fine. If I run the entire script in gdb, it does not segfault but simply hangs. If I interrupt it then, the backtrace shows that it gets stuck waiting in libpthread, shortly after a call to cuptiActivityFlushAll. I read somewhere that this can happen if you try to flush an (already) empty buffer. Is this perhaps a synchronization issue? EDIT1: It does happen sometimes with batch_size 16. Seems entirely gone with batch_size 8. Fyi, I usually train with batch_size 256. One single sample is 360 000 bytes (image 200x200x9, uint8). |
I have this issue as well. It happens with tensorflow 2.5 and also nightly. If I run with
more info
|
This is happening with cuda-11.1 with tf2.4, 2.5 and 2.6 |
I believe that's cuda-11.1 problem, specifically cupti 11.1. I think that is confirmed and fixed. by NVIDIA. on the another hand, why profile is enabled by default is debatable. |
You're right, upgrading cuda version helps. Thanks. |
Are you saying that one can combine tf2.4 with cuda-11.3? I am still on cuda-11.0, as recommended on TF's GPU support page. |
I've got the same problem with TF 2.5.0 + CUDA 11.2, changed the However, the suggested solution is to upgrade CUDA to 11.3 According to https://www.tensorflow.org/install/source?hl=sk#gpu, seems the combination of TF 2.5.0 + CUDA 11.3 does not work, am I right? |
No, the list does not say that other combinations do not work. |
Environment information
Suggestion: Fix conflicting installations
Namely:
pip uninstall tensorboard tensorflow tensorflow-estimator tensorflow-gpu
pip install tensorflow # or
tensorflow-gpu
, ortf-nightly
, ...Issue description
System:
Error:
I am using tf.keras and the
model.fit()
function to train my model. I have added thetf.keras.callbacks.TensorBoard
callback to my.fit()
call.I am encountering an issue where if:
.fit()
Note: This error only occurs sometimes; Run the
model.fit()
a couple of times to reproduce the error; This error happens independent of the directory being logged toBackground Info:
Currently, the input to my model is in the form of a
tf.data.Dataset
. The documentation for the callback says:I use the
dataset.map()
function in my input pipeline to transform my input data. However, since.map()
does not execute eagerly, I wrap it around atf.py_function
, which should make it execute eagerly (I've verified that themap
function does in fact run eagerly after usingpy_function
).However, sometimes, I still get a Segmentation Fault error, as described above. On the occasion where the Segmentation Fault does not occur, the profile logged to the Tensorboard is a .
profile-empty
file.#2084 may be related.
The text was updated successfully, but these errors were encountered: