-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in AdaptiveServerSelection and misc fixes #13104
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13104 +/- ##
============================================
+ Coverage 61.75% 62.11% +0.36%
+ Complexity 207 198 -9
============================================
Files 2436 2514 +78
Lines 133233 137786 +4553
Branches 20636 21319 +683
============================================
+ Hits 82274 85583 +3309
- Misses 44911 45787 +876
- Partials 6048 6416 +368
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for making the hotfix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -152,12 +153,6 @@ void receiveDataTable(ServerRoutingInstance serverRoutingInstance, DataTable dat | |||
ServerResponse response = _responseMap.get(serverRoutingInstance); | |||
response.receiveDataTable(dataTable, responseSize, deserializationTimeMs); | |||
|
|||
// Record query completion stats immediately after receiving the response from the server instead of waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC if we remove this discount upon each receiveDataTable
and rely only on the one in getFinalResponses
, it means the performance of all fan out servers are determined by the slowest one among them, do you think it would case inaccuracy where we over estimate the load on some servers with or without a timeout happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation.
I see this resulting in more time taken to warm up/ramp up - that's the reason we had this piece of code earlier. With this approach, we'll be more conservative to not overload servers (because we assume that every server has not responded till the last server responds).
Achieving both will be a hairier change - considering the interaction between netty/jetty. We can revisit this logic depending on the behavior we see in our environment.
@vvivekiyer thank you for working on this fix! This actually bit us a month ago (only time in ~2 years), but we restarted the brokers before grabbing the routing stats, so we couldn't root cause. |
Contains 2 fixes
1. Adaptive Server Selection - race condition:
A race condition between jetty threads and netty threads can result in setting negative values for numInFlightRequests for servers. This can result in that particular server being overloaded when compared to other.
It's difficult to reproduce this but the race-condition is obvious from code-reading.
The race condition is explained below
Let's say a query is routed to 2 servers S1 and S2. Say the query has a timeout of 1s. The race condition timeline is as follows:
T1: Query is routed to S1 and S2. The ADSS stats will look as follows:
S1 Stats = { numInFlightRequests = 1 } S2 Stats = { numInFlightRequests = 1 }
T2: S1 responds with the results (dataTable). The ADSS stats will be updated to look as follows. Note that this update is by the netty thread that receives the response.
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 1 }
T3: Let's say the query timed out. The jetty thread will update the ADSS stats for S2 as per code to look as follows:
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 0 }
T4: Before the jetty thread removes the QueryResponse object for the request, the server S2 could respond and the corresponding netty thread would update the ADSS stats incorrectly to look as follows
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = -1 }
2. Updates client error list to add a few more exceptions.