Skip to content

Conversation

@liferoad
Copy link
Contributor

@liferoad liferoad commented Oct 15, 2025

Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.

Ref: https://github.com/grpc/grpc/blob/master/doc/keepalive.md


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @liferoad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes GRPC channel settings within the Apache Beam Python SDK. The primary goal is to enhance performance and maintain connection stability by reducing the frequency of keepalive pings and adjusting various ping-related parameters. These changes are expected to minimize network overhead without compromising the reliability of GRPC communication.

Highlights

  • GRPC Keepalive Time Adjustment: The grpc.keepalive_time_ms setting has been increased from 20 seconds to 30 seconds. This change aims to reduce the frequency of keepalive pings, thereby decreasing unnecessary network traffic.
  • GRPC Ping Behavior Configuration: New GRPC options have been introduced to fine-tune ping behavior. These include allowing unlimited pings without data (grpc.http2.max_pings_without_data set to 0), permitting keepalive pings even when no active calls are present (grpc.keepalive_permit_without_calls set to True), and establishing minimum intervals for receiving and sending pings without data (grpc.http2.min_recv_ping_interval_without_data_ms to 5 minutes and grpc.http2.min_sent_ping_interval_without_data_ms to 10 seconds).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

…tion stability

Add various grpc keepalive and ping-related options to prevent connection drops during long-running operations. The new settings help maintain active connections and detect failures faster.
Increase grpc.keepalive_time_ms from 30s to 60s and grpc.http2.min_sent_ping_interval_without_data_ms from 10s to 30s to reduce network overhead and improve performance
@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 19.35484% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.92%. Comparing base (af748d0) to head (aec1d2a).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...hon/apache_beam/ml/rag/enrichment/milvus_search.py 12.50% 21 Missing ⚠️
sdks/python/apache_beam/io/filebasedsink.py 33.33% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #36528      +/-   ##
============================================
- Coverage     56.93%   56.92%   -0.01%     
  Complexity     3393     3393              
============================================
  Files          1222     1222              
  Lines        186815   186843      +28     
  Branches       3544     3544              
============================================
+ Hits         106365   106369       +4     
- Misses        77078    77102      +24     
  Partials       3372     3372              
Flag Coverage Δ
python 80.98% <19.35%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @jrmccluskey for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

… temp dir

Add fallback logic when initialization result is EmptySideInput to create a temporary directory instead. This prevents potential issues when the pipeline initialization phase returns an empty collection.
@github-actions github-actions bot added io and removed io labels Oct 16, 2025
@github-actions github-actions bot added io and removed io labels Oct 16, 2025
@github-actions github-actions bot added io and removed io labels Oct 16, 2025
@github-actions github-actions bot added io and removed io labels Oct 16, 2025
@github-actions github-actions bot added the io label Oct 17, 2025
DEFAULT_OPTIONS = [
("grpc.keepalive_time_ms", 20000),
("grpc.keepalive_timeout_ms", 300000),
# Default: 20000ms (20s), increased to 10 minutes for stability
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed grpc.keepalive_time_ms since https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values

INT_MAX (disabled) on client

Copy link
Contributor

@sergiitk sergiitk Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I just added keepalive_time_ms back as before. Since both values are same as before, I think it should be fine for our case.

@tvalentyn
Copy link
Contributor

tvalentyn commented Nov 4, 2025

Do we have any periodic messages sent from SDK to runner that would otherwise detect a dead channel?

I tried launching a pipeline, using an SDK with @liferoad 's changes patched, SSHing to the VM and restarting the 'harness' container to simulate the crash. SDK detected Socket closed error, and restarted within a few seconds. Logs:

NOTICE 2025-11-04T22:54:50.975484Z valentyn : TTY=pts/1 ; PWD=/home/valentyn ; USER=root ; COMMAND=/var/lib/toolbox/nerdctl -n k8s.io restart 4a25ec1329e0
...
DEFAULT 2025-11-04T22:54:52.114094879Z raise self
DEFAULT 2025-11-04T22:54:52.114100315Z grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
DEFAULT 2025-11-04T22:54:52.114105833Z status = StatusCode.UNAVAILABLE
DEFAULT 2025-11-04T22:54:52.114111214Z details = "Socket closed"
DEFAULT 2025-11-04T22:54:52.114119261Z debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:12371 {created_time:"2025-11-04T22:54:51.032035467+00:00", grpc_status:14, grpc_message:"Socket closed"}"
DEFAULT 2025-11-04T22:54:52.114145002Z >
DEFAULT 2025-11-04T22:54:52.114150544Z {"stream":"stderr"}
DEFAULT 2025-11-04T22:54:52.114155948Z 2025/11/04 22:54:52 boot.go: error logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
DEFAULT 2025-11-04T22:54:52.114161451Z 2025/11/04 22:54:52 WARN Python (worker sdk-0-0_sibling_1) exited 2 times: exit status 1
DEFAULT 2025-11-04T22:54:52.114167074Z restarting SDK process
...
INFO 2025-11-04T22:55:08.835663318Z Python sdk harness starting.
...
INFO 2025-11-04T22:55:10.050536Z All SDK Harnesses registered!


@scwhittle
Copy link
Contributor

Do we have any periodic messages sent from SDK to runner that would otherwise detect a dead channel?

I tried launching a pipeline, using an SDK with @liferoad 's changes patched, SSHing to the VM and restarting the 'harness' container to simulate the crash. SDK detected Socket closed error, and restarted within a few seconds. Logs:

NOTICE 2025-11-04T22:54:50.975484Z valentyn : TTY=pts/1 ; PWD=/home/valentyn ; USER=root ; COMMAND=/var/lib/toolbox/nerdctl -n k8s.io restart 4a25ec1329e0
...
DEFAULT 2025-11-04T22:54:52.114094879Z raise self
DEFAULT 2025-11-04T22:54:52.114100315Z grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
DEFAULT 2025-11-04T22:54:52.114105833Z status = StatusCode.UNAVAILABLE
DEFAULT 2025-11-04T22:54:52.114111214Z details = "Socket closed"
DEFAULT 2025-11-04T22:54:52.114119261Z debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:12371 {created_time:"2025-11-04T22:54:51.032035467+00:00", grpc_status:14, grpc_message:"Socket closed"}"
DEFAULT 2025-11-04T22:54:52.114145002Z >
DEFAULT 2025-11-04T22:54:52.114150544Z {"stream":"stderr"}
DEFAULT 2025-11-04T22:54:52.114155948Z 2025/11/04 22:54:52 boot.go: error logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
DEFAULT 2025-11-04T22:54:52.114161451Z 2025/11/04 22:54:52 WARN Python (worker sdk-0-0_sibling_1) exited 2 times: exit status 1
DEFAULT 2025-11-04T22:54:52.114167074Z restarting SDK process
...
INFO 2025-11-04T22:55:08.835663318Z Python sdk harness starting.
...
INFO 2025-11-04T22:55:10.050536Z All SDK Harnesses registered!

Thanks Valentyn. Can we clarify the motivation for this in the PR better? If it is just perceived overhead of heartbeats, I can't imagine it is much and this doesn't seem worth risk of adding that additional latency in some cases. If it is to resolve unnecessary failures when we're CPU pegged that seems like better motivation and given the testing seems safe enough to try.

@liferoad liferoad requested a review from scwhittle November 6, 2025 16:11
Copy link
Contributor

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the client side: keepalive pings are disabled, therefore non of other options apply.

# Default: 2, set to 0 to allow unlimited ping strikes
("grpc.http2.max_ping_strikes", 0),
# Default: 0 (disabled), enable socket reuse for better handling
("grpc.so_reuseport", 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! With this option, you don't need to close the socket for the found port anymore, as you'll be able to bind and serve on it:

# Close sockets only now to avoid the same port to be chosen twice
for s in sockets:
s.close()

You'll need to bind the initial socket with SO_REUSEPORT, ideally with SO_REUSEADDR as well.

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

This approach addresses the race condition where an unused port was found by one process, closes the socket, but before this process starts listening on the found port, it's acquired by another process, resulting in EADDRINUSE.

By not closing the socket until the server stops listening, you'll prevent other processes from seeing that port as unused.

Note that this only applies to systems where SO_REUSEPORT is supported.

DEFAULT_OPTIONS = [
("grpc.keepalive_time_ms", 20000),
("grpc.keepalive_timeout_ms", 300000),
# Default: 20000ms (20s), increased to 10 minutes for stability
Copy link
Contributor

@sergiitk sergiitk Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.

Comment on lines 193 to 194
# Default: 5000ms (5s), increased to 10 minutes for stability
("grpc.keepalive_timeout_ms", 600000),
Copy link
Contributor

@sergiitk sergiitk Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in another thread, this should be ok for your usage.

Note that without setting grpc.keepalive_time_ms in server channel ars, the server will send a keepalive ping every 2 hours.

So in the current setup, the server sends a ping every two hours, then waits for 10 minutes for client to return the ping

@liferoad liferoad requested a review from sergiitk November 7, 2025 14:26
Copy link
Contributor

@scwhittle scwhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channel_factory.py changes look good, deferring to other reviewers for the server

tvalentyn and others added 2 commits November 13, 2025 19:23
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
@tvalentyn tvalentyn merged commit 9288814 into apache:master Nov 14, 2025
97 of 101 checks passed
@tvalentyn
Copy link
Contributor

thanks @liferoad ! let's see if it helps with the test flakiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants