Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

andrross · 2025-04-18T17:56:13Z

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Related Issues

Resolves #14297

Check List

Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-04-18T18:12:26Z

❌ Gradle check result for 7e471e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-23T01:20:05Z

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-23T03:55:05Z

✅ Gradle check result for a6990d9: SUCCESS

codecov · 2025-04-23T03:55:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.50%. Comparing base (c5e55b0) to head (7ba1155).
Report is 93 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18006      +/-   ##
============================================
- Coverage     72.52%   72.50%   -0.02%     
+ Complexity    67163    67138      -25     
============================================
  Files          5473     5473              
  Lines        310092   310094       +2     
  Branches      45060    45061       +1     
============================================
- Hits         224899   224840      -59     
- Misses        66814    66891      +77     
+ Partials      18379    18363      -16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ashking94

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

...a/org/opensearch/common/util/concurrent/QueueResizableOpenSearchThreadPoolExecutorTests.java

andrross · 2025-04-24T15:06:12Z

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

Instead of resizing down to 900, it resizes down to 1500 which guarantees that the executor has enough capacity to not reject anything if all tasks are submitted before any are able to be executed.

github-actions · 2025-04-24T16:50:40Z

❌ Gradle check result for 210949d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test. The fix is to resize down to a queue size of 1500 to ensure there is enough capacity even if all tasks are submitted before any can be executed. And finally I refactored the tests to reduce duplication of code and ensure the executor gets shutdown properly even in case of a test failure. This will avoid the spurious thread leak failure if a test case exits because of a failure. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2025-04-24T21:33:37Z

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-24T23:14:26Z

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T21:19:27Z

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T23:50:12Z

✅ Gradle check result for 7ba1155: SUCCESS

andrross · 2025-04-29T00:37:23Z

@ashking94 Can you take another look at this? Thanks!

Addressed concerns. No response after a few weeks.

…ct#18006) There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test. The fix is to resize down to a queue size of 1500 to ensure there is enough capacity even if all tasks are submitted before any can be executed. And finally I refactored the tests to reduce duplication of code and ensure the executor gets shutdown properly even in case of a test failure. This will avoid the spurious thread leak failure if a test case exits because of a failure. Signed-off-by: Andrew Ross <andrross@amazon.com>

…ct#18006) There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test. The fix is to resize down to a queue size of 1500 to ensure there is enough capacity even if all tasks are submitted before any can be executed. And finally I refactored the tests to reduce duplication of code and ensure the executor gets shutdown properly even in case of a test failure. This will avoid the spurious thread leak failure if a test case exits because of a failure. Signed-off-by: Andrew Ross <andrross@amazon.com>Signed-off-by: TJ Neuenfeldt <tjneu@amazon.com>

…ct#18006) There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test. The fix is to resize down to a queue size of 1500 to ensure there is enough capacity even if all tasks are submitted before any can be executed. And finally I refactored the tests to reduce duplication of code and ensure the executor gets shutdown properly even in case of a test failure. This will avoid the spurious thread leak failure if a test case exits because of a failure. Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross added the skip-changelog label Apr 18, 2025

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut Cluster Manager flaky-test Random test failure that succeeds on second run Other labels Apr 18, 2025

github-project-automation bot added this to Cluster Manager Project Board Apr 18, 2025

opensearch-ci-bot mentioned this pull request Apr 23, 2025

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open

ashking94 previously requested changes Apr 24, 2025

View reviewed changes

...a/org/opensearch/common/util/concurrent/QueueResizableOpenSearchThreadPoolExecutorTests.java Outdated Show resolved Hide resolved

...a/org/opensearch/common/util/concurrent/QueueResizableOpenSearchThreadPoolExecutorTests.java Show resolved Hide resolved

github-project-automation bot moved this to 👀 In review in Cluster Manager Project Board Apr 24, 2025

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from a6990d9 to 210949d Compare April 24, 2025 16:35

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 210949d to 7ba1155 Compare April 24, 2025 20:29

opensearch-ci-bot mentioned this pull request Apr 25, 2025

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobContainerRetriesTests #17551

Open

This was referenced Apr 28, 2025

[AUTOCUT] Gradle Check Flaky Test Report for RecoveryWhileUnderLoadIT #14509

Open

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT #15806

Open

opensearch-ci-bot mentioned this pull request May 9, 2025

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT #16115

Open

mch2 approved these changes May 17, 2025

View reviewed changes

andrross merged commit a82336b into opensearch-project:main May 17, 2025
30 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board May 17, 2025

andrross deleted the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch May 17, 2025 00:16

opensearch-ci-bot mentioned this pull request Apr 30, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SharedClusterSnapshotRestoreIT #15845

Open

opensearch-ci-bot mentioned this pull request Nov 20, 2025

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT #20058

Open

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

Uh oh!

Conversation

andrross commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Check List

Uh oh!

github-actions bot commented Apr 18, 2025

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

codecov bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ashking94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrross commented Apr 24, 2025

Uh oh!

github-actions bot commented Apr 24, 2025

Uh oh!

github-actions bot commented Apr 24, 2025

Uh oh!

github-actions bot commented Apr 24, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

andrross commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrross commented Apr 18, 2025 •

edited

Loading

codecov bot commented Apr 23, 2025 •

edited

Loading