Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence warning when iterating over a dataset #62963

Closed
p-s-p-s opened this issue Feb 14, 2024 · 28 comments
Assignees
Labels
comp:ops OPs related issues type:bug Bug

Comments

@p-s-p-s
Copy link

p-s-p-s commented Feb 14, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.16

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

There is a warning which appears after the last iteration over a dataset:
W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

This warning report was introduced by this commit: 04fb826
I believe that simple iteration over a dataset shouldn't cause such behavior.

Standalone code to reproduce the issue

import tensorflow as tf

range_ds = tf.data.Dataset.range(10)

for d in range_ds:
   print(d)

Relevant log output

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
2024-02-15 08:27:36.782604: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
@sushreebarsa
Copy link
Contributor

@p-s-p-s I tried to replicate the issue on colab and didn't face the error reported. Could you check this gist and let us know?
Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Feb 19, 2024
@p-s-p-s
Copy link
Author

p-s-p-s commented Feb 19, 2024

@sushreebarsa tf 2.15 is not affected, but 2.16 and 2.17 are.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Feb 19, 2024
@p-s-p-s
Copy link
Author

p-s-p-s commented Feb 20, 2024

@sushreebarsa The reason you couldn't reproduce the error in colab is because the warnings are suppressed by default. Could you please check this colab https://colab.research.google.com/drive/1JuQriKXe-aJBAbValQK-8BFGtktzw4IW?usp=sharing ?

@sushreebarsa
Copy link
Contributor

@p-s-p-s TF v2.15 is the latest stable version so error is not appearing there.
We recommend you to use the stable TF version.
Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Feb 20, 2024
@p-s-p-s
Copy link
Author

p-s-p-s commented Feb 21, 2024

@sushreebarsa I reported this issue in order to make it fixed before 2.16 release. Moreover, tf 2.15 with 04fb826 applied is also internally affected by this issue.

>>> import tensorflow as tf
2024-02-21 10:47:33.109381: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 10:47:33.109409: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 10:47:33.110044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-21 10:47:33.113595: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> range_ds = tf.data.Dataset.range(10)
2024-02-21 10:47:54.344759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22462 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
>>> 
>>> for d in range_ds:
...    print(d)
... 
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
2024-02-21 10:47:56.048435: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
>>> tf.__version__
'2.15.0'

I am not sure what causes the problem, but as a symptomatic solution it is possible to disable some warnings like this:

  if (!absl::StrContains(status.message(), "End of sequence")) {
    LOG(WARNING) << "Local rendezvous is aborting with status: " << status;
  }

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Feb 21, 2024
@sushreebarsa
Copy link
Contributor

@sachinprasadhs I was able to replicate the issue reported here, please have a look. Thank you!

@SomeUserName1
Copy link

SomeUserName1 commented Mar 4, 2024

Can confirm this issue with tf-nightly '2.17.0-dev20240210'

2024-02-26 04:03:26.379054: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence  
	 [[{{node IteratorGetNext}}]]   
	 [[IteratorGetNext/_4]]  
2024-02-26 04:03:26.379063: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 381694510697024129
2024-02-26 04:03:26.379073: I tensorflow/core/framework/local_rendezvous.cc:422] Local rendezvous recv item cancelled. Key hash: 6451170228096927380

@pedro-curto
Copy link

Hello, I'd like to look into this issue and try to fix it, if that is possible.

@obriensystems
Copy link

obriensystems commented Mar 10, 2024

Issue running default tensorflow training job after docker rebuild only on RTX-A4500

from tensorflow/tensorflow:latest-gpu


[+] Building 137.8s (9/9) FINISHED                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                   0.0s
 => => transferring dockerfile: 285B                                                                                   0.0s
 => [internal] load .dockerignore                                                                                      0.0s
 => => transferring context: 2B                                                                                        0.0s
 => [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu                                            1.2s
 => [auth] tensorflow/tensorflow:pull token for registry-1.docker.io                                                   0.0s
 => [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu@sha256:4ab9ffddd6ffacc9251ac6439f431eb38d66200d3f52397b5d  135.7s


         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 316ms/step - accuracy: 0.1601 - loss: 8.1802
Epoch 3/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - accuracy: 0.2835 - loss: 7.41432024-03-10 04:24:44.885256: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.885302: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:44.896965: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.897036: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 307ms/step - accuracy: 0.2795 - loss: 7.2638

ObrienlabsDev/machine-learning#16

@salaki
Copy link

salaki commented Mar 12, 2024

My understanding is the error will not affect the execution but the iterator was not usable after the error? https://stackoverflow.com/questions/53930242/how-to-fix-a-outofrangeerror-end-of-sequence-error-when-training-a-cnn-with-t

@p-s-p-s
Copy link
Author

p-s-p-s commented Mar 13, 2024

@salaki
I didn't observe any negative impact, except it is quite annoying to receive this warning every time you iterate over a dataset. As a temporary fix for 2.16.1 I just commented out this line in tensorflow/core/framework/local_rendezvous.cc
// LOG(WARNING) << "Local rendezvous is aborting with status: " << status;
and recompiled TF from source.

@ManfredLange
Copy link

ManfredLange commented Mar 31, 2024

Another example to reproduce this issue with Python 3.12.2 and TensorFlow 2.16.1 is the fourth installment of the introductory videos, "TensorFlow ML Zero to Hero". The fourth part uses this notebook. When training every second epoch falls over with the issue reported here.

When I switch to TensorFlow 2.15.1, I also have to downgrade to Python version 3.11.8 which is something I'd like to avoid. Ideally TensorFlow 2.15.1 should be made available to the most recent stable release of Python at least until a newer stable version of TensorFlow becomes available. Combo Python 3.11.8 and TensorFlow 2.15.1 works for the given notebook.

Here is the link to that notebook that I mentioned. It runs fine online but not locally if using Python 3.12.2 and TensorFlow 2.16.1.

https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/Course%202%20-%20Part%208%20-%20Lesson%202%20-%20Notebook%20(RockPaperScissors).ipynb

This link is also accessible form the description in the video at https://www.youtube.com/watch?v=u2TjZzNuly8

I hope having another example to reproduce the problem helps with resolving this issue. Keep up the good work!

@mcourteaux
Copy link

mcourteaux commented Apr 12, 2024

@google-admin @goolge Just please fire all these "issue triagers". They are a waste of our time, and a waste of your money. All they do is copy paste the code in collab with blindfolds, fuck it up with a 90% chance, and tell you you're wrong. They are a disgrace to our intellect.

@AmmarkoV
Copy link

Same issue on a larger training project :
screen215

I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes :
strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0')
with strategy.scope():

@mpcallanan
Copy link
Contributor

Hi all, this didn't make it our (tf.data team's) way until just now, when an internal user flagged it. This should be fixed with 4924ec6.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@bast0320
Copy link

The error is even on the official tf website, so hopefully it will soon be fixed.
https://www.tensorflow.org/tutorials/quickstart/advanced

@latexalpha
Copy link

Same issue on a larger training project : screen215

I don't know if it helps or if it is related think one of the recent additions to the code was to use strategies and scopes : strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0') with strategy.scope():

In my circumstance, the distribution strategy is not related to this problem after my double check.

@Bchi1994
Copy link

Bchi1994 commented Apr 22, 2024

Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate()

import sys
from matplotlib import pyplot
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from keras.optimizers import SGD
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import numpy as np

physical_devices = tf.config.list_physical_devices('GPU')
try:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:

Invalid device or cannot modify virtual devices once initialized.

pass

define cnn model

def define_model():
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='sigmoid'))

compile model

opt = SGD(learning_rate=0.001, momentum=0.9)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
return model

create data generator

datagen = ImageDataGenerator(rescale=1.0/255.0)
model = define_model()

prepare iterators

train_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/',
class_mode='binary', batch_size=64, target_size=(200, 200))
test_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/test1/',
class_mode='binary', batch_size=64, target_size=(200, 200))

fit model

history = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1)

evaluate model

_, acc = model.evaluate(test_it, verbose=1)
print('> %.3f' % (acc * 100.0))

@rytis-paskauskas
Copy link

rytis-paskauskas commented May 23, 2024

I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling model.evaluate(ds). That just doesn't look like it's safe to ignore it. Example:

    919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
/usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546

PS: I DO have enough data.

@flacle
Copy link

flacle commented Jun 21, 2024

I made this disappear by simply using .repeat() and not using .cache() on my training and validation data batches.

My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10

Why would cache() be related to the issue? Did you see a difference with cache True or False? See here https://stackoverflow.com/a/78583999, likely a combination of .repeat() and setting the steps per epoch correctly.

@andremfreitas
Copy link

I made this disappear by simply using .repeat() and not using .cache() on my training and validation data batches.
My script has a very generic input pipeline based on the tensorflow semantic segmentation tutorial with tf 2.16 and python 3.10

Why would cache() be related to the issue? Did you see a difference with cache True or False? See here https://stackoverflow.com/a/78583999, likely a combination of .repeat() and setting the steps per epoch correctly.

In my case I am not using .cache() neither .repeat() and I still see this error

@luvwinnie
Copy link

same here. using repeat() still see this error.

@miticollo
Copy link

I can reproduce the warning on Python 3.12 and TF 2.16. In addition, when my (custom) dataset has this 'issue' then I also get messages when calling model.evaluate(ds). That just doesn't look like it's safe to ignore it. Example:

    919/Unknown 1s 2ms/step - loss: 1.05462024-05-23 10:15:02.410200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
/usr/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
927/927 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 1.0546

PS: I DO have enough data.

@rytis-paskauskas That warning is not related to TensorFlow, but Keras. During training and evaluation your code is wrapped in a with statement (see here and here). From now on I will consider only training the same happens for evaluation. At the first epoch, if your datasets (train, valid and test) are tf.data.Dataset, Keras can't establish how many steps (batches) are required to complete it. But don't worry, Keras "counts" batches in the first epoch and uses it for the next epochs.

In particular, enumerate_epoch uses num_batches to know how many batches there are in a epoch. But at the first epoch, this property returns None because dataset.cardinality < 0. Indeed, at the beginning your output is 919/Unknown then 927/927. This is possible because on_train_batch_end the ProgBar callback is updated.

@arianmaghsoudnia
Copy link

That warning is not related to TensorFlow, but Keras.

@miticollo I don't think that's the case. Please take a look at the same catch_stop_iteration function code in older Keras versions, even in Keras 2. It seems almost the same with only attribute name changes and not functional modifications.

@miticollo
Copy link

@arianmaghsoudnia In that comment, I referenced only to this warning:

Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.

and not to

W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

The first warning comes from Keras and, as I explained above, it can be ignored. While the latter warning comes from TensorFlow.

@arianmaghsoudnia
Copy link

arianmaghsoudnia commented Oct 23, 2024

@miticollo You're correct about distinguishing between the logs from TensorFlow and Keras. However, the Keras warning appears because there's an underlying issue on the TensorFlow side. The core problem is that the OUT_OF_RANGE exception should not have been triggered in the first place, as it wasn't in earlier versions of TensorFlow. The changes that closed this issue merely treat the symptoms by downgrading the warning to an info log in TensorFlow. Unfortunately, they don't resolve the root cause of the problem.
In the end I understand this is a bit out of the scope of this issue. I hope that this related open issue will get attention.

@tashrifbillah
Copy link

tashrifbillah commented Oct 27, 2024

I am getting this issue on CPU from Python 3.11 and Tensorflow 2.16.1:

Epoch 1/5
1/9 ━━━━━━━━━━━━━━━━━━━━ 1:54 14s/step - loss: 3.76182024-10-27 17:32:14.388601: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0,49] = 43 is not in [0, 43)
         [[{{function_node __inference_one_step_on_data_18465}}{{node functional_1/embedding_1/GatherV2}}]]
Traceback (most recent call last):
...
...
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Training crashes during epoch 1.


Edit:
I had an index mismatch in my training labels. It is all good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues type:bug Bug
Projects
None yet