-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely. #2322
Changes from 5 commits
3a795db
2e72cff
94f2de8
541b33b
d184c89
f3210c8
4cd6f95
b7c7114
57d680a
58cb1cd
37e306b
be6cda6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -863,13 +863,15 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool | |||
} | ||||
// Query the meta region | ||||
long pauseBase = this.pause; | ||||
userRegionLock.lock(); | ||||
takeUserRegionLock(); | ||||
try { | ||||
if (useCache) {// re-check cache after get lock | ||||
RegionLocations locations = getCachedLocation(tableName, row); | ||||
if (locations != null && locations.getRegionLocation(replicaId) != null) { | ||||
return locations; | ||||
} | ||||
// We don't need to check if useCache is enabled or not. Even if useCache is false | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On my local system I created patch for this on top of 1.3 branch but while porting the patch to branch-2, I missed applying this hunk. |
||||
// we already cleared the cache for this row before acquiring userRegion lock so if this | ||||
// row is present in cache that means some other thread has populated it while we were | ||||
// waiting to acquire user region lock. | ||||
RegionLocations locations = getCachedLocation(tableName, row); | ||||
if (locations != null && locations.getRegionLocation(replicaId) != null) { | ||||
return locations; | ||||
} | ||||
if (relocateMeta) { | ||||
relocateRegion(TableName.META_TABLE_NAME, HConstants.EMPTY_START_ROW, | ||||
|
@@ -892,7 +894,7 @@ rpcControllerFactory, getMetaLookupPool(), metaReplicaCallTimeoutScanInMicroSeco | |||
} | ||||
tableNotFound = false; | ||||
// convert the row result into the HRegionLocation we need! | ||||
RegionLocations locations = MetaTableAccessor.getRegionLocations(regionInfoRow); | ||||
locations = MetaTableAccessor.getRegionLocations(regionInfoRow); | ||||
if (locations == null || locations.getRegionLocation(replicaId) == null) { | ||||
throw new IOException("RegionInfo null in " + tableName + ", row=" + regionInfoRow); | ||||
} | ||||
|
@@ -968,6 +970,19 @@ rpcControllerFactory, getMetaLookupPool(), metaReplicaCallTimeoutScanInMicroSeco | |||
} | ||||
} | ||||
|
||||
private void takeUserRegionLock() throws IOException { | ||||
try { | ||||
long waitTime = connectionConfig.getScannerTimeoutPeriod(); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems not push the latest commit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @infraio I moved the acquiring of lock inside try catch block. hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java Line 868 in 58cb1cd
Also added a test case for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout...... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am -1 to use scanner timeout. It is too weird and confused. Operation timeout is not the best choice too but better. Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My point here is that your 15 seconds SLA case is not right. It is still meet your SLA even you use the operation timeout. I didn't mean that "move takeUserRegionLock to try catch block". Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@infraio In scan operation, there are 2 operations. One is to wait for lock and other is to wait for rpc to complete. On top of that we have retries. The problem we are trying to solve here is what is the timeout to use for lock. If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. So I am confused when you suggest to use operation timeout, are you suggesting to wait for operation timeout period while trying to get lock ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saintstack Could you please chime in with your inputs. I think we are going back and forth on which timeout to use. Also I have created https://issues.apache.org/jira/browse/HBASE-24983 to wrap the whole scan operation within operation timeout but is outside the scope of this jira. Thank you ! Cc @SukumarMaddineni There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable.
Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now.
I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @infraio Thank you for your comment. Now I understand better what you mean. Let me update the PR by today. Thank you for being so patient with me. |
||||
if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saintstack Rushabh and I had a quick offline chat about this PR. We were wondering what is the right timeout to use for this lock. In the client code path there are a bunch of time out configurations depending on the path we take and sometimes layered on top of each other. Specifically I was wondering if hbase.client.operation.timeout would be the right one to use for this. I understand that we are using the scanner timeout here because the call wraps a scanner with the same timeout. From a client standpoint though, scanner is just an implementation detail of locateRegion (root caller in this case) and that root caller should be wrapped with a general operation timeout rather than a timeout that is specific to the scanner. Not a big deal but I was just curious and would like to know your thoughts. Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1. The scanner timeout is not a good choice here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bharathv @infraio Thank you for your feedback.
I don't feel hbase.client.operation.timeout is the right choice here too. This config is meant for the whole end to end operation timeout which includes all layers of retries and the default value is 20 mins. If we use this timeout then we are not gaining anything. We can introduce a new config property (something like hbase.client.lock.timeout.period) and default it to something like 10 seconds. That way we don't depend on existing scanner/operation timeout periods. Let me know what you guys think. Thank you ! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is really difficult to operate already, just look at us discussing it now. Scanner timeout? Operation timeout? Both! Neither! Make another! Let's not introduce another config option. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good concern @bharathv . operation timeout is for 'whole operation end-to-end' per @shahrs87 . Here we are doing a sub-task so operation timeout doesn't seem right. Scanner timeout seems good; when it expires the scan will throw and we'll do the finally block anyways? What would you suggest @infraio ? I agree w/ @apurtell that last thing we need is new timeout ; client timeout is fraught as is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. bq. There is one retry loop here which will retry if exception is not TNFE or retries are not exhausted. Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@infraio you are right. Fixed that in latest commit. Please review again. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @infraio could you please review again ? |
||||
throw new LockTimeoutException("Failed to get user region lock in" | ||||
+ waitTime + " ms. " + " for accessing meta region server."); | ||||
} | ||||
} catch (InterruptedException ie) { | ||||
LOG.error("Interrupted while waiting for a lock", ie); | ||||
throw ExceptionUtil.asInterrupt(ie); | ||||
} | ||||
} | ||||
|
||||
/** | ||||
* Put a newly discovered HRegionLocation into the cache. | ||||
* @param tableName The table name. | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
/** | ||
* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.hadoop.hbase.client; | ||
|
||
import org.apache.hadoop.hbase.HBaseIOException; | ||
import org.apache.yetus.audience.InterfaceAudience; | ||
|
||
/* | ||
Thrown whenever we are not able to get the lock within the specified wait time. | ||
*/ | ||
@InterfaceAudience.Public | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need to be public? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I don't know. Since this will thrown back all the way to client so thought to make it public. But open for suggestions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realized there is an exact same class LockTimeoutException under org.apache.hadoop.hbase.exceptions, switch to that?
I was hoping that clients would rely on a generic HBaseIOException and marking this as private would give us more flexibility to remove/update etc in the future. But i think it doesn't matter if we switch to the above LockTimeout I was referring to. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This class only exists in branch-1 and not in master/branch-2. :( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe when I create a new PR for branch-1, I can re-use the existing exception class. Would that work ? |
||
public class LockTimeoutException extends HBaseIOException { | ||
public LockTimeoutException(String message) { | ||
super(message); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static getInt() is deprecated, switch to conf.getInt()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY config property is deprecated after 0.96 release. So remove the deprecated config property altogether.