Skip to content

Conversation

@akashorabek
Copy link
Collaborator

Fixes: #34397 and #34396
Successful PostCommit Go example - https://github.com/akashorabek/beam/actions/runs/14200817248
Successful The PostCommit Go Dataflow ARM example - https://github.com/akashorabek/beam/actions/runs/14200833343


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@akashorabek
Copy link
Collaborator Author

akashorabek commented Apr 1, 2025

After investigating the issue, it turns out that sometimes when a failure occurs due to OOM, the worker shuts down immediately and doesn't reach the part of the code in boot.go responsible for generating the dump file. Attempts to add timeouts before and after reading the file, preallocate additional memory in boot.go, and use parameters like dumpHeapOnOom and saveHeapDumpsToGcsPath didn’t help. Temporarily disabled this test so that The PostCommit Go Dataflow ARM and The PostCommit Go workflows pass successfully. Created a separate issue for further investigation.

@akashorabek akashorabek marked this pull request as draft April 1, 2025 20:10
@akashorabek akashorabek marked this pull request as ready for review April 1, 2025 20:23
@akashorabek akashorabek requested review from damccorm and lostluck and removed request for lostluck April 1, 2025 20:23
@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2025

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck for label go.
R: @Abacn for label build.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@damccorm
Copy link
Contributor

damccorm commented Apr 1, 2025

After investigating the issue, it turns out that sometimes when a failure occurs due to OOM, the worker shuts down immediately and doesn't reach the part of the code in boot.go responsible for generating the dump file. Attempts to add timeouts before and after reading the file, preallocate additional memory in boot.go, and use parameters like dumpHeapOnOom and saveHeapDumpsToGcsPath didn’t help. Temporarily disabled this test so that The PostCommit Go Dataflow ARM and The PostCommit Go workflows pass successfully. Created a separate issue for further investigation.

Thanks for looking into this - what frequency does this fail at? We can merge this, but I'm curious to know the impact/how often this does/doesn't work

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@damccorm damccorm merged commit 984e875 into apache:master Apr 1, 2025
5 checks passed
@akashorabek
Copy link
Collaborator Author

After investigating the issue, it turns out that sometimes when a failure occurs due to OOM, the worker shuts down immediately and doesn't reach the part of the code in boot.go responsible for generating the dump file. Attempts to add timeouts before and after reading the file, preallocate additional memory in boot.go, and use parameters like dumpHeapOnOom and saveHeapDumpsToGcsPath didn’t help. Temporarily disabled this test so that The PostCommit Go Dataflow ARM and The PostCommit Go workflows pass successfully. Created a separate issue for further investigation.

Thanks for looking into this - what frequency does this fail at? We can merge this, but I'm curious to know the impact/how often this does/doesn't work

These workflows fail around 60-70% of the time due to this error. Interestingly, the failures started occurring around March 19–20, and rolling back to previous PRs didn’t help, so it’s possible that some changes on the GCP side might be the cause.

liferoad pushed a commit to liferoad/beam that referenced this pull request Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The PostCommit Go job is flaky

2 participants