-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Reenable prism as default #35621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable prism as default #35621
Conversation
|
Running the timing out tests locally, I got: This isn't showing up in CI because stdout is only logged when a suite fails, but we're timing out before that can happen. |
|
And iin a different test (which runs first) I see the following (looks like same root cause): |
|
Run Yaml_Xlang_Direct PreCommit |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
R: @shunping |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
shunping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| IdGenerator idGenerator = IdGenerators.decrementingLongs(); | ||
| ShortIdMap metricsShortIds = new ShortIdMap(); | ||
| ExecutorService executorService = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is needed now. Does this suggest current ScheduledExecutorServiceFactory not working as expected?
beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/ExecutorOptions.java
Line 46 in 7c2b57e
| class ScheduledExecutorServiceFactory implements DefaultValueFactory<ScheduledExecutorService> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this out - was in the PR description originally, and then I lost it when I updated to include information for folks coming from CHANGES. From the original description:
This fixes a thread safety issue which was causing tests to time out and fail.
The race condition comes from spinning up Java environments - when we do this today (before this PR), we:
- Get a singleton ExecutorService object -
options.as(ExecutorOptions.class).getScheduledExecutorService(); - Execute a pipeline
- Shutdown the ExecutorService -
executorService.shutdown();
When this is all handled in different processes (as the current FnApiRunner implementation does), its not a problem since the ExecutorService doesn't need to outlive the pipeline. In multi-pipeline processes, however, the same ExecutorService object gets reused and teardown can get called while it is executing or before it starts. This leads to a long execution and an eventual timeout (see comments in PR below for details on error thrown).
This PR fixes the issue by creating a new ExecutorService object per worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see #35621 (comment), could be due to main() get called multiple times, and the Executor embedded in pipeline option already shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh racing condition comment. This attracted attention at first glance because there was a thread leak bug due to Executor created every time spin up a runner - #32272 . As long as shutdown is called when Executor no longer used I think it's fine. Thanks again for the details!
* Reenable prism as default * Fix race condition * Spotless * Portable test * Fix a few new test issues * Update CHANGES
Changes the default Python runner to be prism (when prism is supported).
There are a bunch of use cases automatically excluded, but this also required updating a bunch of tests. The 2 main breaking changes this introduces are:
Instead of raising Python-specific exceptions, when Prism jobs fail a RuntimeError is always raised. To fix any tests relying on the old error, change your exception catching logic to look for broader errors and use regexes to verify the contents of the error. There are lots of examples of this in this PR.
Prism names its metrics differently than the local runner. For this PR in Beam, I just kept all of those on the old FnApiRunner for now, but you could also just change the metric name you're looking for (this will be the eventual fix for these tests in this repo).
If this change breaks you in unexpected ways, please:
Part of #34549
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.