Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in AdaptiveServerSelection and misc fixes #13104

Merged
merged 1 commit into from
May 7, 2024

Conversation

vvivekiyer
Copy link
Contributor

@vvivekiyer vvivekiyer commented May 7, 2024

Contains 2 fixes

1. Adaptive Server Selection - race condition:

A race condition between jetty threads and netty threads can result in setting negative values for numInFlightRequests for servers. This can result in that particular server being overloaded when compared to other.

It's difficult to reproduce this but the race-condition is obvious from code-reading.

The race condition is explained below
Let's say a query is routed to 2 servers S1 and S2. Say the query has a timeout of 1s. The race condition timeline is as follows:
T1: Query is routed to S1 and S2. The ADSS stats will look as follows:
S1 Stats = { numInFlightRequests = 1 } S2 Stats = { numInFlightRequests = 1 }

T2: S1 responds with the results (dataTable). The ADSS stats will be updated to look as follows. Note that this update is by the netty thread that receives the response.
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 1 }

T3: Let's say the query timed out. The jetty thread will update the ADSS stats for S2 as per code to look as follows:
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 0 }

T4: Before the jetty thread removes the QueryResponse object for the request, the server S2 could respond and the corresponding netty thread would update the ADSS stats incorrectly to look as follows
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = -1 }

2. Updates client error list to add a few more exceptions.

@codecov-commenter
Copy link

codecov-commenter commented May 7, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 62.11%. Comparing base (59551e4) to head (b2724d6).
Report is 416 commits behind head on master.

Files Patch % Lines
...pache/pinot/core/transport/AsyncQueryResponse.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13104      +/-   ##
============================================
+ Coverage     61.75%   62.11%   +0.36%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2514      +78     
  Lines        133233   137786    +4553     
  Branches      20636    21319     +683     
============================================
+ Hits          82274    85583    +3309     
- Misses        44911    45787     +876     
- Partials       6048     6416     +368     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.11% <50.00%> (+0.40%) ⬆️
java-21 <0.01% <0.00%> (-61.63%) ⬇️
skip-bytebuffers-false 62.11% <50.00%> (+0.36%) ⬆️
skip-bytebuffers-true 0.00% <0.00%> (-27.73%) ⬇️
temurin 62.11% <50.00%> (+0.36%) ⬆️
unittests 62.10% <50.00%> (+0.36%) ⬆️
unittests1 46.77% <50.00%> (-0.12%) ⬇️
unittests2 27.74% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the hotfix!

Copy link
Contributor

@somandal somandal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -152,12 +153,6 @@ void receiveDataTable(ServerRoutingInstance serverRoutingInstance, DataTable dat
ServerResponse response = _responseMap.get(serverRoutingInstance);
response.receiveDataTable(dataTable, responseSize, deserializationTimeMs);

// Record query completion stats immediately after receiving the response from the server instead of waiting
Copy link
Contributor

@jasperjiaguo jasperjiaguo May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if we remove this discount upon each receiveDataTable and rely only on the one in getFinalResponses, it means the performance of all fan out servers are determined by the slowest one among them, do you think it would case inaccuracy where we over estimate the load on some servers with or without a timeout happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation.

I see this resulting in more time taken to warm up/ramp up - that's the reason we had this piece of code earlier. With this approach, we'll be more conservative to not overload servers (because we assume that every server has not responded till the last server responds).

Achieving both will be a hairier change - considering the interaction between netty/jetty. We can revisit this logic depending on the behavior we see in our environment.

@vvivekiyer vvivekiyer merged commit 6a73450 into apache:master May 7, 2024
20 checks passed
@jadami10
Copy link
Contributor

@vvivekiyer thank you for working on this fix! This actually bit us a month ago (only time in ~2 years), but we restarted the brokers before grabbing the routing stats, so we couldn't root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants