-
Notifications
You must be signed in to change notification settings - Fork 4.5k
optimize grpc settings #36528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize grpc settings #36528
Conversation
Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.
Summary of ChangesHello @liferoad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes GRPC channel settings within the Apache Beam Python SDK. The primary goal is to enhance performance and maintain connection stability by reducing the frequency of keepalive pings and adjusting various ping-related parameters. These changes are expected to minimize network overhead without compromising the reliability of GRPC communication. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
…tion stability Add various grpc keepalive and ping-related options to prevent connection drops during long-running operations. The new settings help maintain active connections and detect failures faster.
Increase grpc.keepalive_time_ms from 30s to 60s and grpc.http2.min_sent_ping_interval_without_data_ms from 10s to 30s to reduce network overhead and improve performance
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #36528 +/- ##
============================================
- Coverage 56.93% 56.92% -0.01%
Complexity 3393 3393
============================================
Files 1222 1222
Lines 186815 186843 +28
Branches 3544 3544
============================================
+ Hits 106365 106369 +4
- Misses 77078 77102 +24
Partials 3372 3372
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Assigning reviewers: R: @jrmccluskey for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
… temp dir Add fallback logic when initialization result is EmptySideInput to create a temporary directory instead. This prevents potential issues when the pipeline initialization phase returns an empty collection.
| DEFAULT_OPTIONS = [ | ||
| ("grpc.keepalive_time_ms", 20000), | ||
| ("grpc.keepalive_timeout_ms", 300000), | ||
| # Default: 20000ms (20s), increased to 10 minutes for stability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed grpc.keepalive_time_ms since https://github.com/grpc/grpc/blob/master/doc/keepalive.md#defaults-values
INT_MAX (disabled) on client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I just added keepalive_time_ms back as before. Since both values are same as before, I think it should be fine for our case.
I tried launching a pipeline, using an SDK with @liferoad 's changes patched, SSHing to the VM and restarting the 'harness' container to simulate the crash. SDK detected |
Thanks Valentyn. Can we clarify the motivation for this in the PR better? If it is just perceived overhead of heartbeats, I can't imagine it is much and this doesn't seem worth risk of adding that additional latency in some cases. If it is to resolve unnecessary failures when we're CPU pegged that seems like better motivation and given the testing seems safe enough to try. |
sergiitk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the client side: keepalive pings are disabled, therefore non of other options apply.
| # Default: 2, set to 0 to allow unlimited ping strikes | ||
| ("grpc.http2.max_ping_strikes", 0), | ||
| # Default: 0 (disabled), enable socket reuse for better handling | ||
| ("grpc.so_reuseport", 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! With this option, you don't need to close the socket for the found port anymore, as you'll be able to bind and serve on it:
beam/sdks/python/apache_beam/utils/subprocess_server.py
Lines 610 to 612 in eba04b2
| # Close sockets only now to avoid the same port to be chosen twice | |
| for s in sockets: | |
| s.close() |
You'll need to bind the initial socket with SO_REUSEPORT, ideally with SO_REUSEADDR as well.
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) |
This approach addresses the race condition where an unused port was found by one process, closes the socket, but before this process starts listening on the found port, it's acquired by another process, resulting in EADDRINUSE.
By not closing the socket until the server stops listening, you'll prevent other processes from seeing that port as unused.
Note that this only applies to systems where SO_REUSEPORT is supported.
| DEFAULT_OPTIONS = [ | ||
| ("grpc.keepalive_time_ms", 20000), | ||
| ("grpc.keepalive_timeout_ms", 300000), | ||
| # Default: 20000ms (20s), increased to 10 minutes for stability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's disabled by default, and unless you enable it, all other keepalive settings you have (grpc.keepalive_timeout_ms, grpc.http2.max_pings_without_data, grpc.keepalive_permit_without_calls) have no effect.
| # Default: 5000ms (5s), increased to 10 minutes for stability | ||
| ("grpc.keepalive_timeout_ms", 600000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in another thread, this should be ok for your usage.
Note that without setting grpc.keepalive_time_ms in server channel ars, the server will send a keepalive ping every 2 hours.
So in the current setup, the server sends a ping every two hours, then waits for 10 minutes for client to return the ping
scwhittle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
channel_factory.py changes look good, deferring to other reviewers for the server
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
|
thanks @liferoad ! let's see if it helps with the test flakiness. |
Adjust GRPC channel settings to reduce ping frequency and allow more flexible keepalive behavior. This improves performance by reducing unnecessary network traffic while maintaining connection stability.
Ref: https://github.com/grpc/grpc/blob/master/doc/keepalive.md
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.