-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28690 Aborting Active HMaster is not rejecting reportRegionStateTransition if procedure is initialised by next Active master #6129
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just some logging and comments issues.
Thanks!
@@ -2567,6 +2573,17 @@ public ReportProcedureDoneResponse reportProcedureDone(RpcController controller, | |||
return ReportProcedureDoneResponse.getDefaultInstance(); | |||
} | |||
|
|||
private void throwOnOldMasterStartCode(Stream<Long> masterStartCodeStream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use LongStream here, but for me I prefer we log more about the procedures here, like which procedure's master start code is greater than current master's start code, and also log the start codes for better debugging.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Outdated
Show resolved
Hide resolved
@@ -93,11 +93,14 @@ class PostOpenDeployContext { | |||
private final HRegion region; | |||
private final long openProcId; | |||
private final long masterSystemTime; | |||
private final long masterStartCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add some comments or give it a more clear name so developers will know that this is the start code of the master which scheduled this procedure, for fencing.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
Outdated
Show resolved
Hide resolved
Had a offline discussion with @Umeshkumar9414 and this logic has a flaw. Instead of using master start code we should be using master active time for fencing otherwise we will end up in a deadly loop of old master not processing rpc reequests. |
I do not get the point... The problem we want to solve here is to not allow old master to process rpc requests? |
@Apache9 - the problem with master start code is that if we are running 5 masters and one of the master which have highest start code creates the procedure and aborts. Now since all other masters have start code smaller than last active(which has aborted now), this check
will always pass and we will always throw exception. So instead of using the master start code we should be using master active time for this check. The misunderstanding i believe was that start code changes when a master becomes active which is false. Only master active time would update when a master becomes active. |
OK, so the problem here is that, a backup master may have lower start code then the current active master, and once the backup master becomes the active master, it can not accept the procedure's report which was scheduled by the previous active master. Makes sense. We should use the timestamp when the master becomes the active master. |
Yeah, I will change this to use masterActiveTime |
2a4203f
to
e201096
Compare
@Apache9 , @mnpoonia , @ranganathg , @virajjasani I try to resolve all the concerns. Please review. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!!
💔 -1 overall
This message was automatically generated. |
For branch-2 - #6136 |
f5167cd
to
587e08c
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java
Outdated
Show resolved
Hide resolved
public UnassignRegionHandler(HRegionServer server, String encodedName, long closeProcId, | ||
boolean abort, @Nullable ServerName destination, EventType eventType) { | ||
this(server, encodedName, closeProcId, abort, destination, eventType, false); | ||
this(server, encodedName, closeProcId, abort, destination, eventType, -1, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we call this constructor? Is it safe to just use -1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if anyone is using this. At least in the code, I didn't see any usage of this. As I was not aware I didn't change the signature. -1 is safe as it will not block the procedure reporting as -1 will be the smallest active code.
Should I follow the aggressive approach and ask for an argument and pass that further? I think that would be good. Let me change to that approch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think best would be just remove it
@@ -471,9 +474,10 @@ public ServerOperation(RemoteProcedure remoteProcedure, long procId, Class<?> rs | |||
this.rsProcData = rsProcData; | |||
} | |||
|
|||
public RemoteProcedureRequest buildRequest() { | |||
public RemoteProcedureRequest buildRequest(long initiatingMasterActiveTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we add initiatingMasterActiveTime as a field of the operation, instead of a method parameter.We can add a initiatingMasterActiveTime in ServerOperation and RegionOperation class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Umeshkumar9414 FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that time I had some thoughts in my mind -
- Including this in operation will introduce too many changes.
- Should I add this to the Procedure class or in the Operation? But now I am clear in this perspective. I am adding this into operation as this is something related to dispatching not related to the procedure itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were other reasons also like there is a difference in how we create requests for OpenRegion and CloseRegions. For OpenRegionRequest we create only one request for all the OpenRegionOperations but for close we create CloseRegoinRequest for each operation.
It would be all good in close but for OpenRegionRequest here, we need to get the activeCode from one of the entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be like this
builder.setInitiatingMasterActiveTime(operations.stream().map(
RemoteOperation::getInitiatingMasterActiveTime).findAny().get());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it as well. Added that in RemoteOperations
...e-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
Outdated
Show resolved
Hide resolved
587e08c
to
1a1b78d
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
f73196e
to
cf101ab
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@Apache9 , @virajjasani kindly review and merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Only a small nits.
Thanks @Umeshkumar9414 !
@@ -441,7 +441,9 @@ protected final void remoteCallFailed(final MasterProcedureEnv env, final IOExce | |||
private static OpenRegionRequest buildOpenRegionRequest(final MasterProcedureEnv env, | |||
final ServerName serverName, final List<RegionOpenOperation> operations) { | |||
final OpenRegionRequest.Builder builder = OpenRegionRequest.newBuilder(); | |||
builder.setServerStartCode(serverName.getStartcode()); | |||
builder.setServerStartCode(serverName.getStartCode()); | |||
builder.setInitiatingMasterActiveTime(operations.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this field is optional in the pb message, better to use stream().findAny().IfPresent(builder::setInitiatingMasterActiveTime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
PR for branch-2 #6136 |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@Apache9, @virajjasani can you please help me in merging this PR as well as #6136 |
Good work! @Umeshkumar9414 |
…eTransition if procedure is initialised by next Active master (apache#6129) Added masterActiveTime as fencing token for remote procedures Signed-off-by: Duo Zhang <zhangduo@apache.org> Reviewed-by: Aman Poonia <aman.poonia.29@gmail.com> (cherry picked from commit d2a1f19)
masterStartCode as a fencing token for remote procedures
There are 3 flows that I checked regionOpen, regionClose, and other remote Procedures. In might not be needed in the third flow but I added masterStartCode in all three.