optimize grpc settings #36528

liferoad · 2025-10-15T16:22:56Z

Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.

Ref: https://github.com/grpc/grpc/blob/master/doc/keepalive.md

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.

gemini-code-assist · 2025-10-15T16:23:09Z

Summary of Changes

Hello @liferoad, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes GRPC channel settings within the Apache Beam Python SDK. The primary goal is to enhance performance and maintain connection stability by reducing the frequency of keepalive pings and adjusting various ping-related parameters. These changes are expected to minimize network overhead without compromising the reliability of GRPC communication.

Highlights

GRPC Keepalive Time Adjustment: The grpc.keepalive_time_ms setting has been increased from 20 seconds to 30 seconds. This change aims to reduce the frequency of keepalive pings, thereby decreasing unnecessary network traffic.
GRPC Ping Behavior Configuration: New GRPC options have been introduced to fine-tune ping behavior. These include allowing unlimited pings without data (grpc.http2.max_pings_without_data set to 0), permitting keepalive pings even when no active calls are present (grpc.keepalive_permit_without_calls set to True), and establishing minimum intervals for receiving and sending pings without data (grpc.http2.min_recv_ping_interval_without_data_ms to 5 minutes and grpc.http2.min_sent_ping_interval_without_data_ms to 10 seconds).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-10-15T18:10:23Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

…tion stability Add various grpc keepalive and ping-related options to prevent connection drops during long-running operations. The new settings help maintain active connections and detect failures faster.

Increase grpc.keepalive_time_ms from 30s to 60s and grpc.http2.min_sent_ping_interval_without_data_ms from 10s to 30s to reduce network overhead and improve performance

codecov · 2025-10-15T22:58:15Z

Codecov Report

❌ Patch coverage is 19.35484% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.92%. Comparing base (af748d0) to head (aec1d2a).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...hon/apache_beam/ml/rag/enrichment/milvus_search.py	12.50%	21 Missing ⚠️
sdks/python/apache_beam/io/filebasedsink.py	33.33%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #36528      +/-   ##
============================================
- Coverage     56.93%   56.92%   -0.01%     
  Complexity     3393     3393              
============================================
  Files          1222     1222              
  Lines        186815   186843      +28     
  Branches       3544     3544              
============================================
+ Hits         106365   106369       +4     
- Misses        77078    77102      +24     
  Partials       3372     3372

Flag	Coverage Δ
python	`80.98% <19.35%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-10-16T14:07:31Z

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

… temp dir Add fallback logic when initialization result is EmptySideInput to create a temporary directory instead. This prevents potential issues when the pipeline initialization phase returns an empty collection.

sdks/python/apache_beam/io/filebasedsink.py

sdks/python/apache_beam/utils/subprocess_server.py

sdks/python/apache_beam/runners/worker/channel_factory.py

sdks/python/apache_beam/utils/subprocess_server.py

sdks/python/apache_beam/runners/worker/channel_factory.py

sdks/python/apache_beam/utils/subprocess_server.py

liferoad · 2025-10-31T16:27:16Z

sdks/python/apache_beam/runners/worker/channel_factory.py

  DEFAULT_OPTIONS = [
-      ("grpc.keepalive_time_ms", 20000),
-      ("grpc.keepalive_timeout_ms", 300000),
+      # Default: 20000ms (20s), increased to 10 minutes for stability


removed grpc.keepalive_time_ms since https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values

INT_MAX (disabled) on client

It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.

I see. I just added keepalive_time_ms back as before. Since both values are same as before, I think it should be fine for our case.

sdks/python/apache_beam/runners/worker/channel_factory.py

tvalentyn · 2025-11-04T22:59:12Z

Do we have any periodic messages sent from SDK to runner that would otherwise detect a dead channel?

I tried launching a pipeline, using an SDK with @liferoad 's changes patched, SSHing to the VM and restarting the 'harness' container to simulate the crash. SDK detected Socket closed error, and restarted within a few seconds. Logs:

NOTICE 2025-11-04T22:54:50.975484Z valentyn : TTY=pts/1 ; PWD=/home/valentyn ; USER=root ; COMMAND=/var/lib/toolbox/nerdctl -n k8s.io restart 4a25ec1329e0
...
DEFAULT 2025-11-04T22:54:52.114094879Z raise self
DEFAULT 2025-11-04T22:54:52.114100315Z grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
DEFAULT 2025-11-04T22:54:52.114105833Z status = StatusCode.UNAVAILABLE
DEFAULT 2025-11-04T22:54:52.114111214Z details = "Socket closed"
DEFAULT 2025-11-04T22:54:52.114119261Z debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:12371 {created_time:"2025-11-04T22:54:51.032035467+00:00", grpc_status:14, grpc_message:"Socket closed"}"
DEFAULT 2025-11-04T22:54:52.114145002Z >
DEFAULT 2025-11-04T22:54:52.114150544Z {"stream":"stderr"}
DEFAULT 2025-11-04T22:54:52.114155948Z 2025/11/04 22:54:52 boot.go: error logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
DEFAULT 2025-11-04T22:54:52.114161451Z 2025/11/04 22:54:52 WARN Python (worker sdk-0-0_sibling_1) exited 2 times: exit status 1
DEFAULT 2025-11-04T22:54:52.114167074Z restarting SDK process
...
INFO 2025-11-04T22:55:08.835663318Z Python sdk harness starting.
...
INFO 2025-11-04T22:55:10.050536Z All SDK Harnesses registered!

scwhittle · 2025-11-05T09:10:16Z

Do we have any periodic messages sent from SDK to runner that would otherwise detect a dead channel?

I tried launching a pipeline, using an SDK with @liferoad 's changes patched, SSHing to the VM and restarting the 'harness' container to simulate the crash. SDK detected Socket closed error, and restarted within a few seconds. Logs:

NOTICE 2025-11-04T22:54:50.975484Z valentyn : TTY=pts/1 ; PWD=/home/valentyn ; USER=root ; COMMAND=/var/lib/toolbox/nerdctl -n k8s.io restart 4a25ec1329e0
...
DEFAULT 2025-11-04T22:54:52.114094879Z raise self
DEFAULT 2025-11-04T22:54:52.114100315Z grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
DEFAULT 2025-11-04T22:54:52.114105833Z status = StatusCode.UNAVAILABLE
DEFAULT 2025-11-04T22:54:52.114111214Z details = "Socket closed"
DEFAULT 2025-11-04T22:54:52.114119261Z debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:12371 {created_time:"2025-11-04T22:54:51.032035467+00:00", grpc_status:14, grpc_message:"Socket closed"}"
DEFAULT 2025-11-04T22:54:52.114145002Z >
DEFAULT 2025-11-04T22:54:52.114150544Z {"stream":"stderr"}
DEFAULT 2025-11-04T22:54:52.114155948Z 2025/11/04 22:54:52 boot.go: error logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
DEFAULT 2025-11-04T22:54:52.114161451Z 2025/11/04 22:54:52 WARN Python (worker sdk-0-0_sibling_1) exited 2 times: exit status 1
DEFAULT 2025-11-04T22:54:52.114167074Z restarting SDK process
...
INFO 2025-11-04T22:55:08.835663318Z Python sdk harness starting.
...
INFO 2025-11-04T22:55:10.050536Z All SDK Harnesses registered!

Thanks Valentyn. Can we clarify the motivation for this in the PR better? If it is just perceived overhead of heartbeats, I can't imagine it is much and this doesn't seem worth risk of adding that additional latency in some cases. If it is to resolve unnecessary failures when we're CPU pegged that seems like better motivation and given the testing seems safe enough to try.

sergiitk

The problem with the client side: keepalive pings are disabled, therefore non of other options apply.

sergiitk · 2025-11-07T04:18:07Z

sdks/python/apache_beam/utils/subprocess_server.py

+          # Default: 2, set to 0 to allow unlimited ping strikes
+          ("grpc.http2.max_ping_strikes", 0),
+          # Default: 0 (disabled), enable socket reuse for better handling
+          ("grpc.so_reuseport", 1),


Great! With this option, you don't need to close the socket for the found port anymore, as you'll be able to bind and serve on it:

beam/sdks/python/apache_beam/utils/subprocess_server.py

Lines 610 to 612 in eba04b2

# Close sockets only now to avoid the same port to be chosen twice

for s in sockets:

s.close()

You'll need to bind the initial socket with SO_REUSEPORT, ideally with SO_REUSEADDR as well.

beam/sdks/python/apache_beam/utils/subprocess_server.py

Line 595 in eba04b2

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

This approach addresses the race condition where an unused port was found by one process, closes the socket, but before this process starts listening on the found port, it's acquired by another process, resulting in EADDRINUSE.

By not closing the socket until the server stops listening, you'll prevent other processes from seeing that port as unused.

Note that this only applies to systems where SO_REUSEPORT is supported.

sergiitk · 2025-11-07T04:39:57Z

sdks/python/apache_beam/runners/worker/channel_factory.py

  DEFAULT_OPTIONS = [
-      ("grpc.keepalive_time_ms", 20000),
-      ("grpc.keepalive_timeout_ms", 300000),
+      # Default: 20000ms (20s), increased to 10 minutes for stability


It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.

sdks/python/apache_beam/runners/worker/channel_factory.py

sdks/python/apache_beam/utils/subprocess_server.py

sergiitk · 2025-11-07T05:00:14Z

sdks/python/apache_beam/utils/subprocess_server.py

+          # Default: 5000ms (5s), increased to 10 minutes for stability
+          ("grpc.keepalive_timeout_ms", 600000),


As discussed in another thread, this should be ok for your usage.

Note that without setting grpc.keepalive_time_ms in server channel ars, the server will send a keepalive ping every 2 hours.

So in the current setup, the server sends a ping every two hours, then waits for 10 minutes for client to return the ping

scwhittle

channel_factory.py changes look good, deferring to other reviewers for the server

sdks/python/apache_beam/utils/subprocess_server.py

sdks/python/apache_beam/runners/worker/channel_factory.py

Co-authored-by: Sergii Tkachenko <sergiitk@google.com>

tvalentyn · 2025-11-14T04:56:56Z

thanks @liferoad ! let's see if it helps with the test flakiness.

increase grpc keepalive timeout and adjust ping settings

55cb276

Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.

github-actions bot added python runners labels Oct 15, 2025

yapf

e268dc3

liferoad added 3 commits October 15, 2025 14:56

perf(subprocess_server): add grpc keepalive options to improve connec…

0520967

…tion stability Add various grpc keepalive and ping-related options to prevent connection drops during long-running operations. The new settings help maintain active connections and detect failures faster.

perf(grpc): increase keepalive and ping intervals to reduce frequency

0ca7bcb

Increase grpc.keepalive_time_ms from 30s to 60s and grpc.http2.min_sent_ping_interval_without_data_ms from 10s to 30s to reduce network overhead and improve performance

format

c617972

liferoad added 2 commits October 15, 2025 20:45

more changes

4196239

fix(milvus): increase timeout to 60s for container startup

3f59e83

github-actions bot added the Next Action: Reviewers label Oct 16, 2025

github-actions bot added io and removed io labels Oct 16, 2025

liferoad commented Oct 16, 2025

View reviewed changes

sdks/python/apache_beam/io/filebasedsink.py Show resolved Hide resolved

Merge branch 'master' into grpc-options

8da691d

github-actions bot added io and removed io labels Oct 16, 2025

retry Milvus

079b5d7

github-actions bot added io and removed io labels Oct 16, 2025

liferoad added 2 commits October 16, 2025 17:25

style: use string formatting in milvus search logging

23a8609

Merge branch 'master' into grpc-options

e606e24

github-actions bot added io and removed io labels Oct 16, 2025

fixed external tests

633cf93

github-actions bot added the io label Oct 17, 2025

tvalentyn reviewed Oct 21, 2025

View reviewed changes

sdks/python/apache_beam/utils/subprocess_server.py Show resolved Hide resolved

scwhittle requested changes Oct 21, 2025

View reviewed changes

sergiitk reviewed Oct 30, 2025

View reviewed changes

liferoad added 3 commits October 31, 2025 10:18

Merge branch 'master' into grpc-options

82ce53a

addressed some comments

4a5ddce

removed some options

774ff13

liferoad commented Oct 31, 2025

View reviewed changes

liferoad requested review from damccorm, scwhittle, sergiitk, shunping and tvalentyn October 31, 2025 16:28

tvalentyn approved these changes Nov 4, 2025

View reviewed changes

scwhittle requested changes Nov 4, 2025

View reviewed changes

sdks/python/apache_beam/runners/worker/channel_factory.py Outdated Show resolved Hide resolved

liferoad added 2 commits November 6, 2025 11:10

Merge branch 'master' into grpc-options

600fed4

keep 300000 for keepalive_timeout_ms

8853082

liferoad requested a review from scwhittle November 6, 2025 16:11

sergiitk suggested changes Nov 7, 2025

View reviewed changes

liferoad added 2 commits November 7, 2025 09:17

fixed the comments

a5822ec

added keepalive_time_ms back

603bbf0

liferoad requested a review from sergiitk November 7, 2025 14:26

scwhittle approved these changes Nov 7, 2025

View reviewed changes

sergiitk approved these changes Nov 10, 2025

View reviewed changes

sdks/python/apache_beam/utils/subprocess_server.py Outdated Show resolved Hide resolved

sdks/python/apache_beam/runners/worker/channel_factory.py Outdated Show resolved Hide resolved

tvalentyn and others added 2 commits November 13, 2025 19:23

Update sdks/python/apache_beam/utils/subprocess_server.py

5659767

Co-authored-by: Sergii Tkachenko <sergiitk@google.com>

address comments.

49c851b

tvalentyn merged commit 9288814 into apache:master Nov 14, 2025
97 of 101 checks passed

shunping mentioned this pull request Dec 5, 2025

Fix too-many-pings on FnAPI runner under grpc mode #37013

Merged

	# Close sockets only now to avoid the same port to be chosen twice
	for s in sockets:
	s.close()

		# Default: 5000ms (5s), increased to 10 minutes for stability
		("grpc.keepalive_timeout_ms", 600000),

optimize grpc settings #36528

optimize grpc settings #36528

Uh oh!

Conversation

liferoad commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

gemini-code-assist bot commented Oct 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

codecov bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liferoad Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

sergiitk Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liferoad Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tvalentyn commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scwhittle commented Nov 5, 2025

Uh oh!

sergiitk left a comment

Choose a reason for hiding this comment

Uh oh!

sergiitk Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

sergiitk Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiitk Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scwhittle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvalentyn commented Nov 14, 2025

Uh oh!

liferoad commented Oct 15, 2025 •

edited

Loading

codecov bot commented Oct 15, 2025 •

edited

Loading

sergiitk Nov 7, 2025 •

edited

Loading

tvalentyn commented Nov 4, 2025 •

edited

Loading

sergiitk Nov 7, 2025 •

edited

Loading

sergiitk Nov 7, 2025 •

edited

Loading