Skip to content

Beam worker fails with "ENHANCE_YOUR_CALM" in TFX 1.0.0rc1 #3961

@ConverJens

Description

@ConverJens
  • Have I specified the code to reproduce the issue (Yes, No): No
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows),
    Interactive Notebook, Google Cloud, etc): KubeFlow
  • TensorFlow version: 2.5
  • TFX Version: 1.0.0rc1
  • Python version: 3.7
  • Python dependencies (from pip freeze output): beam 2.30.0

When running ExampleGen, the beam job always fails with ENHANCE_YOUR_CALM when runner is DirectRunner with multi_processing or multi_threading. The job will sometimes fails when using Flink runner as well, and the risk seems to increase with bigger work loads.

These are the logs from failure:

2021/06/23 14:47:06 Initializing python harness: /opt/apache/beam/boot --id=4-2 --logging_endpoint=localhost:36357 --artifact_endpoint=localhost:44659 --provision_endpoint=localhost:37689 --control_endpoint=localhost:43073
2021/06/23 14:47:06 Executing: python -m apache_beam.runners.worker.sdk_worker_main
2021/06/23 14:47:06 Executing: python -m apache_beam.runners.worker.sdk_worker_main
2021-06-23 14:47:08.237727: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.237789: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.279632: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.279686: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.368395: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.368450: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.541890: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.541945: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
E0623 15:03:00.000649888    1693 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:13:22.421257499    1774 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:16:39.047971445    1746 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:18:15.583498355    1738 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1624461495.583975730","description":"Error received from peer ipv4:127.0.0.1:42335","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>

2021/06/24 08:12:35 Python exited: <nil>
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1624461399.048610081","description":"Error received from peer ipv4:127.0.0.1:33077","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>

Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1624461202.421775372","description":"Error received from peer ipv4:127.0.0.1:35309","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"
>

Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1624460580.001188999","description":"Error received from peer ipv4:127.0.0.1:39125","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Socket closed","grpc_status":14}"

This is becoming a blocker for migrating to newer TFX versions.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions