Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use direct executor to deflake tests #33187

Merged
merged 2 commits into from
Nov 26, 2024
Merged

Conversation

m-trieu
Copy link
Contributor

@m-trieu m-trieu commented Nov 21, 2024

MoreExecutors.directExecutor()/directExecutorService runs all tasks on the calling thread (w/o offloading to another thread for async work) and calls to submit and execute will block until the submitted task returns (i.e Runnable.run()).

Use this in test implementations of ChannelCache and FanOutStreamingEngineWorkerHarness to prevent threads waiting on each other. The old implementation seems to work locally but in the test runner environment has increased in flakiness.

Flakiness is referenced in #28957


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.93%. Comparing base (a06454a) to head (48a048e).
Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #33187   +/-   ##
=========================================
  Coverage     58.93%   58.93%           
  Complexity     3112     3112           
=========================================
  Files          1133     1133           
  Lines        174989   174989           
  Branches       3343     3343           
=========================================
  Hits         103136   103136           
  Misses        68508    68508           
  Partials       3345     3345           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@m-trieu
Copy link
Contributor Author

m-trieu commented Nov 26, 2024

R: @Abacn

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I think we need to understand the test failure was purely testing issue or could also happen in production.

getDataMetricTracker);
getDataMetricTracker,
// Run the workerMetadataConsumer on the direct calling thread to make testing more
// deterministic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to make testing more deterministic" gives an impression that the change just fix tests, however the test code path then diverts from the real one.

Please provide more information in this comment why the race observed in the test does not affect production, for future reference.

If this indeed could happen in production then we should fix the code.

Copy link
Contributor Author

@m-trieu m-trieu Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prod, this is to hand off the task from a thread that (may) perform network IO and we do not want the task to block that since it acquires a lock to do its work. Not needed in testing and can logically be called in line

Added comment.

@@ -85,7 +86,9 @@ static ChannelCache forTesting(
notification -> {
shutdownChannel(notification.getValue());
onChannelShutdown.run();
});
},
// Run the removal on the calling thread for better determinism in tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, this doesn't change any behavior we just want the removal to run synchronously so we don't have to rely on waiting in tests

@Abacn
Copy link
Contributor

Abacn commented Nov 26, 2024

The old implementation seems to work locally but in the test runner environment has increased in flakiness.

We've seen similar scenario for different tests. This is due to CI/CD is often busier, has heavier CPU / thread pressure, which arguably more resemble to production workers

@m-trieu
Copy link
Contributor Author

m-trieu commented Nov 26, 2024

Thanks for the fix. I think we need to understand the test failure was purely testing issue or could also happen in production.

This has only shown up in these test suites (haven't run into in load testing). I wonder if its due to the threads waiting to be scheduled, but the resources are consumed while executing other tests.

@m-trieu
Copy link
Contributor Author

m-trieu commented Nov 26, 2024

done, back to you @Abacn thanks!

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Abacn Abacn merged commit 720b824 into apache:master Nov 26, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants