-
Notifications
You must be signed in to change notification settings - Fork 729
Closed
Description
- Have I specified the code to reproduce the issue (Yes, No): No
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows),
Interactive Notebook, Google Cloud, etc): KubeFlow - TensorFlow version: 2.5
- TFX Version: 1.0.0rc1
- Python version: 3.7
- Python dependencies (from
pip freezeoutput): beam 2.30.0
When running ExampleGen, the beam job always fails with ENHANCE_YOUR_CALM when runner is DirectRunner with multi_processing or multi_threading. The job will sometimes fails when using Flink runner as well, and the risk seems to increase with bigger work loads.
These are the logs from failure:
2021/06/23 14:47:06 Initializing python harness: /opt/apache/beam/boot --id=4-2 --logging_endpoint=localhost:36357 --artifact_endpoint=localhost:44659 --provision_endpoint=localhost:37689 --control_endpoint=localhost:43073
2021/06/23 14:47:06 Executing: python -m apache_beam.runners.worker.sdk_worker_main
2021/06/23 14:47:06 Executing: python -m apache_beam.runners.worker.sdk_worker_main
2021-06-23 14:47:08.237727: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.237789: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.279632: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.279686: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.368395: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.368450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.541890: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.541945: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
E0623 15:03:00.000649888 1693 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:13:22.421257499 1774 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:16:39.047971445 1746 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:18:15.583498355 1738 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1624461495.583975730","description":"Error received from peer ipv4:127.0.0.1:42335","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>
2021/06/24 08:12:35 Python exited: <nil>
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1624461399.048610081","description":"Error received from peer ipv4:127.0.0.1:33077","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1624461202.421775372","description":"Error received from peer ipv4:127.0.0.1:35309","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1624460580.001188999","description":"Error received from peer ipv4:127.0.0.1:39125","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
This is becoming a blocker for migrating to newer TFX versions.