Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent grid from creating sessions that are about to timeout in queue (corrects issue #11881) #12014

Merged
merged 13 commits into from
May 22, 2023

Conversation

bhecquet
Copy link
Contributor

@bhecquet bhecquet commented May 9, 2023

Description

As stated in issue #11881, when Selenium grid receives more session requests than it can handle, some sessions may be created whereas the client that requests it receives a timeout response.
This bug only shows up when --session-request-timeout option is set

Motivation and Context

On our grid setup, for some reasons, many grid nodes may become unavailable (upgrade, electrical failure, network failure, ...)
In this case the remaining nodes are not able to handle all the incoming session requests. These requests goes to queue, but we observed that at some point, a browser may be created on node and never used (it remains in this state 5 minutes) before session is killed by grid.

This is due to the fact (details and logs can be found in mentionned issue: #11881 (comment))

  • LocalNewSessionQueue waits at most 30 seconds (the configured session request timeout) before sending timeout to client
  • nothing prevents the distributor to create the requested session just before that (e.g: at 29 secs)
  • session creation takes about 2 seconds so when session si created, client has already received the timeout exception and will never use the created session

Correction takes place in LocalNewSessionQueue. When session request timeout expires, we first check if the session is still in queue or not (session not in queue means distributor is about to create it). In the latter case we wait a bit more so that distributor has time to create the session.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Remark: added tests are a bit long to run to test timeout behaviour

@diemol diemol added the C-grid label May 11, 2023
@diemol diemol added this to the 4.10 milestone May 11, 2023
Copy link
Member

@diemol diemol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR!

I think you have a good point, where the current logic allows for a window where the timeout is triggered but still the Distributor can create a session that will end up being orphaned.

However, I do not think the approach in the PR is completely correct, or clean at least. This PR is basically waiting 5 seconds more for the session request to resolve.

A nicer approach, I think, is to put the logic you are putting in isRequestInQueueas a filter in timeoutSessions. With that, we timeout only session requests that are present in the queue.

Something around the lines of:

  private void timeoutSessions() {
    Instant now = Instant.now();

    Lock readLock = lock.readLock();
    readLock.lock();
    Set<RequestId> ids;
    try {
      ids =
          requests.entrySet().stream()
              .filter(
                  entry ->
                      queue.stream()
                          .anyMatch(
                              sessionRequest ->
                                  sessionRequest.getRequestId().equals(entry.getKey())))
              .filter(entry -> isTimedOut(now, entry.getValue()))
              .map(Map.Entry::getKey)
              .collect(ImmutableSet.toImmutableSet());
    } finally {
      readLock.unlock();
    }
    ids.forEach(this::failDueToTimeout);
  }

(Might need testing, I just quickly tried this).

What do you think?

@bhecquet
Copy link
Contributor Author

bhecquet commented May 19, 2023

Hello

I think that my PR needed to be improved but, correct me if I'm wrong, the timeoutSessions method is only called once every 10 seconds by default and would not prevent the current problematic behaviour
Maybe we should (Or I misunderstand your exemple) combine both approaches:

  • keep my change in 'addToQueue' method because there is no way to tell the latch not to timeout if session is being created
  • take your suggestion for 'timeoutSessions' method which is way better than increasing the whole timeout

@bhecquet bhecquet requested a review from diemol May 22, 2023 07:31
@bhecquet
Copy link
Contributor Author

I changed the code as suggested

@diemol
Copy link
Member

diemol commented May 22, 2023

Hello

I think that my PR needed to be improved but, correct me if I'm wrong, the timeoutSessions method is only called once every 10 seconds by default and would not prevent the current problematic behaviour Maybe we should (Or I misunderstand your exemple) combine both approaches:

* keep my change in 'addToQueue' method because there is no way to tell the latch not to timeout if session is being created

* take your suggestion for 'timeoutSessions' method which is way better than increasing the whole timeout

I think you are right to point out that we were timing out session requests that were being processed by the Distributor. Now that this PR is taking that into account, it should not happen anymore. Thank you.

@codecov-commenter
Copy link

Codecov Report

Patch and project coverage have no change.

Comparison is base (0f8a922) 54.85% compared to head (9533703) 54.85%.

❗ Current head 9533703 differs from pull request most recent head be65511. Consider uploading reports for the commit be65511 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@           Coverage Diff           @@
##            trunk   #12014   +/-   ##
=======================================
  Coverage   54.85%   54.85%           
=======================================
  Files          86       86           
  Lines        5728     5728           
  Branches      233      233           
=======================================
  Hits         3142     3142           
  Misses       2353     2353           
  Partials      233      233           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@diemol diemol merged commit 9b42f09 into SeleniumHQ:trunk May 22, 2023
@diemol
Copy link
Member

diemol commented May 22, 2023

Oh wait, somehow when I reviewed the change set I only saw the filter added, not the initial code. I will revert the piece that adds the 5 seconds and go with the filter only. If the bug is still present, we can consider the other part of the PR.

@diemol
Copy link
Member

diemol commented May 22, 2023

Well, the tests you added are now failing. So maybe all the code you added is needed 😄

I will leave it as it is, but hopefully you can be around in case bugs are raised.

@bhecquet
Copy link
Contributor Author

Yes, I'll be happy to contribute improve it if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants