-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28359 Improve quota RateLimiter synchronization #5683
Conversation
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/RateLimiter.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
return readAvailable; | ||
// at this point we've grabbed some quota, so we should use at least that | ||
return Math.max(readAvailable, readConsumed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change we've experienced erroneous MultiActionResultTooLarge when riding close to quotas. As readAvailable
can be negative, this can even affect otherwise irrelevant operation types. For example, riding close to a read quota and entering negative read availability would cause multiPuts, which do no reading, to throw MultiActionResultTooLarge. The logic for throwing a MultiActionResultTooLarge can be found here. Worse, this exception is retried immediately, so this can snowball into a DDOS of your RPC layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this Math.max to caller? Theoretically readAvailable could be negative, the caller should decide how to deal with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wonder whether it's worth having a separate method, OperationQuota#getMaxResultSize
, which encapsulates this logic and is not hiding business logic in what appears to be a simple getter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this here. I have three reasons:
- There are only 2 callers to this method, and both would need the Math.max() added
- We don't currently expose readConsumed
- Logically, I think it makes sense to consider readAvailable in this way for all cases. I could even see it being
Math.max(readAvailable + readConsumed, readConsumed)
. This is because of the nuanced quota logic in checkQuota. Details on this one below:
In checkQuota, it iterates all limiters twice. Once to check for available (or throw throttle exception). Once it gets through all limiters, it loops them again to grab the quota. This loop can over consume the quota. At this point we've grabbed the quota and there's no easy way to return it. So we might as well consider this as "available" for us.
The quotas are complicated because there can be many limiters and checkQuota is not synchronized. It'd be more accurate to synchronize the entire checkQuota, but I think it's better for performance to let it be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then better rename the method, or at least add more comments to describe the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry Bryan, I didn't see your comment until I pushed a change. I think the only difference though would be to remove the readAvailable and readConsumed getters, and move the getMaxResultSize logic out of the interface. Do you think that's worth doing, or do you like the changeset in its current form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer the original name, which has the javadoc:
/** Returns the number of bytes available to read to avoid exceeding the quota *
To me "read available" makes sense to be "how much did we grab, plus how much is left". But I can see that might be subjective. So I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see both sides. I think it's ok for us to be more specific with the naming here because, like you said, the current usage scope is small enough for us to be specific
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
return readAvailable; | ||
// at this point we've grabbed some quota, so we should use at least that | ||
return Math.max(readAvailable, readConsumed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this Math.max to caller? Theoretically readAvailable could be negative, the caller should decide how to deal with this.
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/RateLimiter.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/RateLimiter.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/RateLimiter.java
Outdated
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
We've been experiencing RpcThrottlingException with 0ms waitInterval. This seems odd and wasteful, since the client side will immediately retry without backoff. I think the problem is related to the synchronization of RateLimiter.
The TimeBasedLimiter checkQuota method does the following:
Both canExecute and waitInterval are synchronized, but we're calling them independently. So it's possible under high concurrency for canExecute to return false, but then waitInterval returns 0 (would have been true).
This simplifies the API by having canExecute return the waitInterval, it being greater than 0 if we should throttle the client.
This also implicitly fixes a bug with request number quotas — we were returning a waitInterval assuming a single resource consumption, regardless of the resources consumed by the operation:
hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/TimeBasedLimiter.java
Lines 144 to 146 in 9c8c9e7
I'm marking this as a draft while I deploy this onto a test cluster to confirm that it resolves our 0ms throttle backoffs
cc @hgromer @eab148 @bozzkar