Skip to content

Conversation

@stefanistrate
Copy link
Contributor

R: @aaltay

I find myself running longer jobs locally via the DirectRunner and I keep getting errors like this:

E0810 10:28:22.716036000 123145578897408 chttp2_transport.cc:1081]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
    target=lambda: self._read_inputs(elements_iterator),
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 570, in _read_inputs
    for elements in elements_iterator:
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 382, in __next__
    return self._next()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 374, in _next
    request = self._look_for_request()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 358, in _look_for_request
    _raise_rpc_error(self._state)
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
    raise rpc_error
grpc.RpcError
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
I0810 10:28:22.788511 123145827430400 local_job_service.py:340] Worker: severity: ERROR timestamp {   seconds: 1628587702   nanos: 724023103 } message: "Failed to read inputs in the data plane.\nTraceback (most recent call last):\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py\", line 57
0, in _read_inputs\n    for elements in elements_iterator:\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 416, in __next__\n    return self._next()\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 803, in _next\n    raise self\ngrpc._channel._MultiThread
edRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1628587702.717555000\",\"description\":\"Error received from peer ipv6:[::1]:57093\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1063,\"grpc_message\":\"Too many pi
ngs\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py\", line 570, in _read_inputs\n    for elements in elements_iterator:\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", lin
e 416, in __next__\n    return self._next()\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 803, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_stri
ng = \"{\"created\":\"@1628587702.717555000\",\"description\":\"Error received from peer ipv6:[::1]:57093\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1063,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>\n" log_location: "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py:57
7" thread: "read_grpc_client_inputs"
    self.run()
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 587, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 570, in _read_inputs
    for elements in elements_iterator:
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Too many pings"
        debug_error_string = "{"created":"@1628587702.717555000","description":"Error received from peer ipv6:[::1]:57093","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Too many pings","grpc_status":14}"

I understand that the DirectRunner has its limitations and, in my case, I'm not using it for production jobs. However, it would be nice if some of the ping limits are relaxed a bit. Currently, the gRPC server has fairly strict limits:

GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA = 2
GRPC_ARG_HTTP2_MAX_PING_STRIKES = 2

My PR removes these 2 limits completely wherever grpc.server() is invoked (Python SDK only). I'm not sure this is the best approach, but I'm happy to adjust it based on your feedback.

Bonus request: it would be even nicer if Beam could allow passing options to the gRPC client & server via some flags.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #15314 (1d0a16f) into master (2fd9875) will decrease coverage by 0.00%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15314      +/-   ##
==========================================
- Coverage   83.81%   83.80%   -0.01%     
==========================================
  Files         441      444       +3     
  Lines       59745    60477     +732     
==========================================
+ Hits        50075    50683     +608     
- Misses       9670     9794     +124     
Impacted Files Coverage Δ
...ache_beam/runners/portability/local_job_service.py 80.00% <50.00%> (-0.36%) ⬇️
...e_beam/runners/portability/abstract_job_service.py 83.70% <100.00%> (+0.09%) ⬆️
...nners/portability/fn_api_runner/worker_handlers.py 79.44% <100.00%> (ø)
...thon/apache_beam/runners/worker/channel_factory.py 75.00% <100.00%> (ø)
...hon/apache_beam/runners/worker/worker_pool_main.py 59.25% <100.00%> (+0.50%) ⬆️
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py 56.81% <0.00%> (-12.95%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 87.50% <0.00%> (-3.11%) ⬇️
...ache_beam/runners/interactive/recording_manager.py 96.55% <0.00%> (-2.43%) ⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py 94.02% <0.00%> (-2.24%) ⬇️
sdks/python/apache_beam/internal/metrics/metric.py 90.00% <0.00%> (-1.49%) ⬇️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fd9875...1d0a16f. Read the comment docs.

@pabloem pabloem requested a review from tvalentyn August 19, 2021 17:44
@tvalentyn
Copy link
Contributor

hi @angoenka - could you please take a look at this change?
@stefanistrate - you'd need to fix autoformatter warnings on this change eventually, see: https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks

@aaltay aaltay requested a review from angoenka August 31, 2021 22:58
@aaltay
Copy link
Member

aaltay commented Aug 31, 2021

I missed this review request earlier. Sorry for the delay and thank you for the contribution.

@ibzib ibzib changed the title Relax ping limits on the gRPC server [BEAM-12448] Relax ping limits on the gRPC server Sep 10, 2021
@ibzib
Copy link

ibzib commented Sep 10, 2021

R: @ibzib

@aaltay aaltay requested a review from ibzib September 23, 2021 19:04
@stefanistrate
Copy link
Contributor Author

I've just run yapf on my code to fix formatting issues. Those tests are passing now.

@tvalentyn
Copy link
Contributor

07:38:58 The command '/bin/sh -c mkdir -p /usr/local/gcloud &&     cd /usr/local/gcloud &&     curl -s -O https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz &&     tar -xf google-cloud-sdk.tar.gz &&     /usr/local/gcloud/google-cloud-sdk/install.sh &&     rm google-cloud-sdk.tar.gz' returned a non-zero code: 1
07:38:58 
07:38:58 > Task :sdks:python:container:py36:docker FAILED
07:39:00 

perhaps it's a flake, let's retry.

@tvalentyn
Copy link
Contributor

Run Portable_Python PreCommit

Copy link

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants