-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28146: Make ServerManager rsAdmins map thread safe #5461
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
This problem only affects 2.x? Master and branch-3 are safe? |
It looks like master was fixed via your commit 5d872d3. Should we make a similar change here? Otherwise if we intend to just fix this bug here, then looking at ServerManager.getRsAdmin, it is not thread safe. We should probably use computeIfAbsent if the get returns null. |
Ah, OK, we change to use async connection for master and branch-3... |
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Outdated
Show resolved
Hide resolved
try { | ||
return this.connection.getAdmin(server); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throw this IOException out? This is a problem of computeIfAbsent, as the functional interface does not throw IOException...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, it seems we need some synchronization. A concurrent map is pointless if you do unsyncrhonized writes on it. You could use ConcurrentMapUtils.computeIfAbsentEx to get around the exception issue.
That said, looking at the impls of getRSRpcServices and getAdmin, I don't think we really need to cache these at all. getRSRpcServices directly returns an instance variable. getAdmin already does it's own caching.
So it might make sense to simply remove the rsAdmins map. @Apache9 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computeIfAbsentEx
seems like a good idea here. I don't have an informed opinion regarding whether the caching is really necessary, but I'm happy to rip it out if we think that's worth doing now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, checked the code, as @bbeaudreault mentioned, we already have a cache in ConnectionImplementation
,
hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
Line 1434 in 18a38ae
return (AdminProtos.AdminService.BlockingInterface) computeIfAbsentEx(stubs, key, () -> { |
So I think we can just remove the cache here.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
admin = this.connection.getAdmin(sn); | ||
} | ||
this.rsAdmins.put(sn, admin); | ||
LOG.debug("New admin connection to " + sn.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this to use a {}
, and just pass in sn
instead of sn.toString()
? Not sure how useful the log is now, we could remove it maybe, but at the least we should avoid the minor bad logging practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea. agreed it could be useless, but I want to keep this log in so that we've got some ability to investigate if the removal of caching causes more work than we've anticipated here
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
… safe (apache#5461) Co-authored-by: Ray Mattingly <rmattingly@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org>
Co-authored-by: Ray Mattingly <rmattingly@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> (cherry picked from commit 1641a4a)
Co-authored-by: Ray Mattingly <rmattingly@hubspot.com> Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Bryan Beaudreault <bbeaudreault@apache.org> (cherry picked from commit 1641a4a)
On 2.x the ServerManager registers admins in a HashMap. This can result in thread safety issues — we recently observed an exception which caused a region to be indefinitely stuck in transition until we could manually intervene. We saw the following exception in the HMaster logs:
This error is not a particularly clear indication of concurrency issues, but that's more an issue with HashMap error handling. Here's a SO thread which discusses the error and why it occurs. This is a good explanation imo.
cc @bbeaudreault @hgromer @eab148 @bozzkar