Skip to content

Conversation

@andrross
Copy link
Member

@andrross andrross commented Apr 18, 2025

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Related Issues

Resolves #14297

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for 7e471e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for a6990d9: SUCCESS

@codecov
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.50%. Comparing base (c5e55b0) to head (7ba1155).
Report is 93 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18006      +/-   ##
============================================
- Coverage     72.52%   72.50%   -0.02%     
+ Complexity    67163    67138      -25     
============================================
  Files          5473     5473              
  Lines        310092   310094       +2     
  Branches      45060    45061       +1     
============================================
- Hits         224899   224840      -59     
- Misses        66814    66891      +77     
+ Partials      18379    18363      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

@andrross
Copy link
Member Author

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

Instead of resizing down to 900, it resizes down to 1500 which guarantees that the executor has enough capacity to not reject anything if all tasks are submitted before any are able to be executed.

@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from a6990d9 to 210949d Compare April 24, 2025 16:35
@github-actions
Copy link
Contributor

❌ Gradle check result for 210949d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 210949d to 7ba1155 Compare April 24, 2025 20:29
@github-actions
Copy link
Contributor

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for 7ba1155: SUCCESS

@andrross
Copy link
Member Author

@ashking94 Can you take another look at this? Thanks!

@andrross andrross dismissed ashking94’s stale review May 17, 2025 00:16

Addressed concerns. No response after a few weeks.

@andrross andrross merged commit a82336b into opensearch-project:main May 17, 2025
30 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board May 17, 2025
@andrross andrross deleted the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch May 17, 2025 00:16
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Jun 1, 2025
…ct#18006)

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Signed-off-by: Andrew Ross <andrross@amazon.com>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…ct#18006)

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Signed-off-by: Andrew Ross <andrross@amazon.com>Signed-off-by: TJ Neuenfeldt <tjneu@amazon.com>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
…ct#18006)

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut Cluster Manager flaky-test Random test failure that succeeds on second run Other skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for QueueResizableOpenSearchThreadPoolExecutorTests

4 participants