-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely. #2322
Conversation
…ion lock indefinitely.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Minor nit below that you might want to change. If not, just say and will merge. Thanks.
Thrown whenever we are not able to get the lock within the specified wait time. | ||
*/ | ||
@InterfaceAudience.Public | ||
public class LockTimeoutException extends IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this subclass HBaseIOException rather than IOException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit. Please review again
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
private void takeUserRegionLock() throws IOException { | ||
try { | ||
long waitTime = connectionConfig.getScannerTimeoutPeriod(); | ||
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saintstack Rushabh and I had a quick offline chat about this PR. We were wondering what is the right timeout to use for this lock. In the client code path there are a bunch of time out configurations depending on the path we take and sometimes layered on top of each other.
Specifically I was wondering if hbase.client.operation.timeout would be the right one to use for this. I understand that we are using the scanner timeout here because the call wraps a scanner with the same timeout. From a client standpoint though, scanner is just an implementation detail of locateRegion (root caller in this case) and that root caller should be wrapped with a general operation timeout rather than a timeout that is specific to the scanner.
Not a big deal but I was just curious and would like to know your thoughts. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were wondering what is the right timeout to use for this lock.
+1. The scanner timeout is not a good choice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharathv @infraio Thank you for your feedback.
Specifically I was wondering if hbase.client.operation.timeout would be the right one to use for this.
I don't feel hbase.client.operation.timeout is the right choice here too. This config is meant for the whole end to end operation timeout which includes all layers of retries and the default value is 20 mins. If we use this timeout then we are not gaining anything.
We can introduce a new config property (something like hbase.client.lock.timeout.period) and default it to something like 10 seconds. That way we don't depend on existing scanner/operation timeout periods. Let me know what you guys think. Thank you !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really difficult to operate already, just look at us discussing it now. Scanner timeout? Operation timeout? Both! Neither! Make another! Let's not introduce another config option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good concern @bharathv .
operation timeout is for 'whole operation end-to-end' per @shahrs87 . Here we are doing a sub-task so operation timeout doesn't seem right. Scanner timeout seems good; when it expires the scan will throw and we'll do the finally block anyways?
What would you suggest @infraio ?
I agree w/ @apurtell that last thing we need is new timeout ; client timeout is fraught as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bq. There is one retry loop here which will retry if exception is not TNFE or retries are not exhausted.
Please check the code. The LockTimeoutException will be throwed out and terminate the loop. It is not in the try...catch code block.
bq. IIUC the code, callable.prepare will never throw LockTimeoutException.
callable.prepare will call locateRegionInMeta inside. So locateRegionInMeta throw LockTimeoutException, then it will throw this exception out.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the code. The LockTimeoutException will be throwed out and terminate the loop. It is not in the try...catch code block.
@infraio you are right. Fixed that in latest commit. Please review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@infraio could you please review again ?
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about the notion of introducing yet another configuration setting, to which I am opposed, but the proposed change does not, it uses the scanner timeout period setting.
if (locations != null && locations.getRegionLocation(replicaId) != null) { | ||
return locations; | ||
} | ||
// We don't need to check if useCache is enabled or not. Even if useCache is false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my local system I created patch for this on top of 1.3 branch but while porting the patch to branch-2, I missed applying this hunk.
@saintstack @apurtell Since you guys already +1 the previous diff, wanted to bring this change to your attention. Sorry for confusion.
Cc @bharathv @infraio
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits but lgtm otherwise.
@@ -117,6 +119,11 @@ | |||
|
|||
this.writeRpcTimeout = conf.getInt(HConstants.HBASE_RPC_WRITE_TIMEOUT_KEY, | |||
conf.getInt(HConstants.HBASE_RPC_TIMEOUT_KEY, HConstants.DEFAULT_HBASE_RPC_TIMEOUT)); | |||
|
|||
this.scannerTimeoutPeriod = HBaseConfiguration.getInt(conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static getInt() is deprecated, switch to conf.getInt()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY config property is deprecated after 0.96 release. So remove the deprecated config property altogether.
private void takeUserRegionLock() throws IOException { | ||
try { | ||
long waitTime = connectionConfig.getScannerTimeoutPeriod(); | ||
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation timeout is for 'whole operation end-to-end'
In that case IMO the right fix is to extend the operation timeout to all public methods of HTable (may or may not be a part of this change). Looking at the code, we do it for some and we skip it for others. One of those methods is locateRegion() which is the root caller in this case.
My concern was from a user configuration POV, there 3 or 4 different timeouts a user has to configure (depending on which API they are using) to get a proper timeout behavior. Ideally it'd be nice if there is a single timeout at the root level that is an e-e timeout and probably other timeouts like scanner timeout etc for finer control depending on the need.
Don't mean to block this PR, we can probably implement that as a separate change.
/* | ||
Thrown whenever we are not able to get the lock within the specified wait time. | ||
*/ | ||
@InterfaceAudience.Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't know. Since this will thrown back all the way to client so thought to make it public. But open for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there is an exact same class LockTimeoutException under org.apache.hadoop.hbase.exceptions, switch to that?
But open for suggestions.
I was hoping that clients would rely on a generic HBaseIOException and marking this as private would give us more flexibility to remove/update etc in the future. But i think it doesn't matter if we switch to the above LockTimeout I was referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there is an exact same class LockTimeoutException under org.apache.hadoop.hbase.exceptions, switch to that?
This class only exists in branch-1 and not in master/branch-2. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe when I create a new PR for branch-1, I can re-use the existing exception class. Would that work ?
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for overall change, and confused with Scanner/Operation timeout. If we use scanner, it's purpose is not justified, if operation, then default value is considerably high. My vote is for operation timeout, but good to proceed with consensus.
🎊 +1 overall
This message was automatically generated. |
There is one outstanding discussion above IIUC... need @infraio buy-off on this patch. HBASE-24983 is for concerns raised in here as a follow-on. |
🎊 +1 overall
This message was automatically generated. |
@@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool | |||
} | |||
} | |||
|
|||
void takeUserRegionLock() throws IOException { | |||
try { | |||
long waitTime = connectionConfig.getScannerTimeoutPeriod(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not push the latest commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@infraio I moved the acquiring of lock inside try catch block.
hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
Line 868 in 58cb1cd
takeUserRegionLock(); |
Also added a test case for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am -1 to use scanner timeout. It is too weird and confused. Operation timeout is not the best choice too but better. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reread RpcRetryingCallerImpl#callWithRetries code. If callable.prepare throw the LockTimeoutException, it will not retry. There are a check about call duration and callTimeout in line 156.
My point here is that your 15 seconds SLA case is not right. It is still meet your SLA even you use the operation timeout. I didn't mean that "move takeUserRegionLock to try catch block". Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operation timeout is not the best choice too but better
@infraio In scan operation, there are 2 operations. One is to wait for lock and other is to wait for rpc to complete. On top of that we have retries. The problem we are trying to solve here is what is the timeout to use for lock. If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. So I am confused when you suggest to use operation timeout, are you suggesting to wait for operation timeout period while trying to get lock ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saintstack Could you please chime in with your inputs. I think we are going back and forth on which timeout to use. Also I have created https://issues.apache.org/jira/browse/HBASE-24983 to wrap the whole scan operation within operation timeout but is outside the scope of this jira. Thank you ! Cc @SukumarMaddineni
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts.
Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable.
are you suggesting to wait for operation timeout period while trying to get lock
Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now.
I think we are going back and forth on which timeout to use.
I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@infraio Thank you for your comment. Now I understand better what you mean. Let me update the PR by today. Thank you for being so patient with me.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Please speak up if anyone has any objections with the current state of the patch, else I'll merge this by EOD today. Thanks. |
…initely. (apache#2322) Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-Off-By: Andrew Purtell <apurtell@apache.org> Signed-off by: Viraj Jasani <vjasani@apache.org> Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org>
…ion lock indefinitely. (#2322) Closes #2427 Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Viraj Jasani <vjasani@apache.org> Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org>
…ion lock indefinitely. (apache#2322) Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-Off-By: Andrew Purtell <apurtell@apache.org> Signed-off by: Viraj Jasani <vjasani@apache.org> Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Guanghao Zhang <zghao@apache.org>
No description provided.