Skip to content

Comments

[grid] Distributor retry session when RemoteNode executor shutting down#17109

Merged
VietND96 merged 2 commits intotrunkfrom
retry-session
Feb 18, 2026
Merged

[grid] Distributor retry session when RemoteNode executor shutting down#17109
VietND96 merged 2 commits intotrunkfrom
retry-session

Conversation

@VietND96
Copy link
Member

🔗 Related Issues

Fixes #17044 - intermittently fails to create new sessions (HTTP 500): SequentialScheduler task rejected / ThreadPoolExecutor “Shutting down”

💥 What does this PR do?

When a Node restarts or drains while a newSession request is in flight, the JDK HTTP client on the Hub throws a RejectedExecutionException (executor shutting down), which propagates as:

UncheckedIOException ← JdkHttpClient.execute0()
Caused by: IOException ← HttpClientImpl.send()
Caused by: RejectedExecutionException ← ThreadPoolExecutor$AbortPolicy

This propagated uncaught through RemoteNode.newSession() and was wrapped by LocalDistributor.startSession() into a SessionNotCreatedException with the cryptic internal JDK message surfaced directly to the client:

Could not start a new session. java.io.IOException: Task
jdk.internal.net.http.common.SequentialScheduler$SchedulableTask@5affbf0
rejected from java.util.concurrent.ThreadPoolExecutor@60c48c82
[Shutting down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 84]

Update RemoteNode.newSession() now catches UncheckedIOException and walks the cause chain. If a RejectedExecutionException is found, it returns RetrySessionRequestException with a clear message - signalling the
distributor to retry on a healthy node instead of immediately failing the client.

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
Copilot AI review requested due to automatic review settings February 18, 2026 17:51
@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings labels Feb 18, 2026
@qodo-code-review
Copy link
Contributor

PR Type

Bug fix


Description

  • Catch UncheckedIOException in RemoteNode.newSession() and detect RejectedExecutionException

  • Return RetrySessionRequestException when node executor is shutting down

  • Allows distributor to retry on healthy node instead of failing client

  • Add comprehensive regression tests for executor shutdown scenario


File Walkthrough

Relevant files
Bug fix
RemoteNode.java
Handle executor shutdown in newSession exception handling

java/src/org/openqa/selenium/grid/node/remote/RemoteNode.java

  • Added import for RejectedExecutionException
  • Added catch block for UncheckedIOException in newSession() method
  • Walks cause chain to detect RejectedExecutionException from JDK HTTP
    client
  • Returns RetrySessionRequestException with clear message when executor
    is shutting down
  • Re-throws unrelated UncheckedIOExceptions unchanged
+19/-0   
Tests
RemoteNodeTest.java
Add regression tests for executor shutdown handling           

java/test/org/openqa/selenium/grid/node/RemoteNodeTest.java

  • New test file with comprehensive regression tests for RemoteNode
  • Test verifies exact exception chain from production traces
    (UncheckedIOException → IOException → RejectedExecutionException)
  • Test confirms conversion to RetrySessionRequestException for
    distributor retry
  • Test ensures unrelated UncheckedIOExceptions are not silently
    swallowed
  • Includes helper methods for creating RemoteNode with mock HttpClient
+153/-0 

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Feb 18, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
User-facing details: The new RetrySessionRequestException message includes externalUri and node state hints
which may be surfaced to end users and could reveal internal infrastructure details
depending on how this exception is propagated.

Referred Code
return Either.left(
    new RetrySessionRequestException(
        "Node "
            + externalUri
            + " rejected the execution (possibly restarting or shutting down)",
        e));

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Feb 18, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Learned
best practice
Remove raw types and casts

Type the Either result and avoid casting result.left() to Exception by assigning
it to a properly typed variable before assertions.

java/test/org/openqa/selenium/grid/node/RemoteNodeTest.java [87-93]

-Either<?, ?> result = node.newSession(request);
+Either<org.openqa.selenium.WebDriverException, org.openqa.selenium.grid.data.CreateSessionResponse> result =
+    node.newSession(request);
 
 assertThat(result.isLeft()).isTrue();
-assertThat(result.left()).isInstanceOf(RetrySessionRequestException.class);
-assertThat(((Exception) result.left()).getMessage())
+org.openqa.selenium.WebDriverException error = result.left();
+assertThat(error).isInstanceOf(RetrySessionRequestException.class);
+assertThat(error.getMessage())
     .contains(NODE_URI.toString())
     .contains("restarting or shutting down");
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Avoid raw generics and unsafe casts in tests; use concrete types for compile-time safety and clarity.

Low
General
Simplify exception cause chain traversal
Suggestion Impact:The while-loop traversal of the exception cause chain was converted to the suggested for-loop form, removing the manual cause update.

code diff:

-      Throwable cause = e;
-      while (cause != null) {
+      for (Throwable cause = e; cause != null; cause = cause.getCause()) {
         if (cause instanceof RejectedExecutionException) {
           return Either.left(
               new RetrySessionRequestException(
@@ -167,7 +166,6 @@
                       + " rejected the execution (possibly restarting or shutting down)",
                   e));
         }
-        cause = cause.getCause();
       }

Replace the while loop used for traversing the exception cause chain with an
equivalent for loop for conciseness.

java/src/org/openqa/selenium/grid/node/remote/RemoteNode.java [160-172]

-Throwable cause = e;
-while (cause != null) {
+for (Throwable cause = e; cause != null; cause = cause.getCause()) {
   if (cause instanceof RejectedExecutionException) {
     return Either.left(
         new RetrySessionRequestException(
             "Node "
                 + externalUri
                 + " rejected the execution (possibly restarting or shutting down)",
             e));
   }
-  cause = cause.getCause();
 }
 throw e;

[Suggestion processed]

Suggestion importance[1-10]: 2

__

Why: This is a minor stylistic suggestion that replaces a while loop with a for loop. The logic remains identical, and the readability improvement is subjective and minimal.

Low
  • Update

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an intermittent session creation failure (#17044) that occurs when a Grid Node restarts or drains while a session request is in flight. The JDK HTTP client throws a RejectedExecutionException wrapped in nested exceptions, which was previously surfacing to clients as a cryptic error. The fix detects this specific scenario and signals the distributor to retry on a healthy node instead.

Changes:

  • Added exception handling in RemoteNode.newSession() to walk the cause chain and detect RejectedExecutionException
  • When detected, returns RetrySessionRequestException to signal the distributor to retry
  • Added comprehensive test coverage for both the fix scenario and to ensure other IOExceptions are not incorrectly caught

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
java/src/org/openqa/selenium/grid/node/remote/RemoteNode.java Added UncheckedIOException catch block that walks exception cause chain to detect RejectedExecutionException and convert to RetrySessionRequestException
java/test/org/openqa/selenium/grid/node/RemoteNodeTest.java Added new test class with two tests: one validating the fix converts executor shutdown to retry, and one ensuring other IOExceptions are not affected

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@VietND96 VietND96 merged commit 527a40b into trunk Feb 18, 2026
53 of 54 checks passed
@VietND96 VietND96 deleted the retry-session branch February 18, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-grid Everything grid and server related C-java Java Bindings Review effort 2/5

Projects

None yet

2 participants