-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-17514: RBF: Routers should unset cached stateID when namenode does not set stateID in RPC response header. #6804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ca832c3
f6c3404
1ff6698
e4e3c55
c245292
9a344c1
afb800f
14ab9f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,7 +64,12 @@ public void updateResponseState(RpcHeaderProtos.RpcResponseHeaderProto.Builder h | |
| */ | ||
| @Override | ||
| public void receiveResponseState(RpcHeaderProtos.RpcResponseHeaderProto header) { | ||
| sharedGlobalStateId.accumulate(header.getStateId()); | ||
| if (header.getStateId() == 0 && sharedGlobalStateId.get() > 0) { | ||
| sharedGlobalStateId.reset(); | ||
| poolLocalStateId.reset(); | ||
| } else { | ||
| sharedGlobalStateId.accumulate(header.getStateId()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simbadzina I have a naive question: What protects us here from the state where It seems like if this case were to occur then sharedGlobalStateId would go backwards.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sharedGlobalStateID is created as
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah makes sense. Thanks. |
||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateId 0 means no state id right?
This is different than msync being 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume stateId is an integer and protobuf will return 0 for an integer if it is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, stateID
0means no stateId is set. Msync doesn't have a return value.There is another bug in the router in that it accepts 0 as a value to advance it's cachedStateID to.
Ideally
sharedGlobalStateId.get() > 0should not be necessary here. For now it captures namenodes that actually had STATE_ID_CONTEXT enabled to begin with. But stale reads could happen with a namenode that has never had STATE_ID_CONTEXT enabled.Fixing this will touch other tests so I'm debating whether to try fix that in this PR or separately. I'm leaning towards a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the
sharedGlobalStateId.get() > 0, right?If sharedGlobalStateId.get() < 0, routers already fallback to active and no need to reset. If it is > 0 and then we see a request without StateID, we will reset this counter and routers will fallback to active.
Adding
sharedGlobalStateId.get() > 0doesn't seem to make a difference.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in TestNoNamenodesAvailableLongTime rely on the router allowing a stateId of 0. So having
sharedGlobalStateId.get() > 0allows this behavior while guarding against when the sharedGlobalStateId has advances beyond zero.The tests in
TestNoNamenodesAvailableLongTimedo need to be fixed but I would like to limit the scope of this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to fixing the tests and associated check in follow on PR.