-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-17514: RBF: Routers should unset cached stateID when namenode does not set stateID in RPC response header. #6804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…havior of old filesystem object.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| @Override | ||
| public void receiveResponseState(RpcHeaderProtos.RpcResponseHeaderProto header) { | ||
| sharedGlobalStateId.accumulate(header.getStateId()); | ||
| if (header.getStateId() == 0 && sharedGlobalStateId.get() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateId 0 means no state id right?
This is different than msync being 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume stateId is an integer and protobuf will return 0 for an integer if it is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, stateID 0 means no stateId is set. Msync doesn't have a return value.
There is another bug in the router in that it accepts 0 as a value to advance it's cachedStateID to.
Ideally sharedGlobalStateId.get() > 0 should not be necessary here. For now it captures namenodes that actually had STATE_ID_CONTEXT enabled to begin with. But stale reads could happen with a namenode that has never had STATE_ID_CONTEXT enabled.
Fixing this will touch other tests so I'm debating whether to try fix that in this PR or separately. I'm leaning towards a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the sharedGlobalStateId.get() > 0, right?
If sharedGlobalStateId.get() < 0, routers already fallback to active and no need to reset. If it is > 0 and then we see a request without StateID, we will reset this counter and routers will fallback to active.
Adding sharedGlobalStateId.get() > 0 doesn't seem to make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in TestNoNamenodesAvailableLongTime rely on the router allowing a stateId of 0. So having sharedGlobalStateId.get() > 0 allows this behavior while guarding against when the sharedGlobalStateId has advances beyond zero.
The tests in TestNoNamenodesAvailableLongTime do need to be fixed but I would like to limit the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to fixing the tests and associated check in follow on PR.
xinglin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the quick fix!
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| if (header.getStateId() == 0 && sharedGlobalStateId.get() > 0) { | ||
| sharedGlobalStateId.reset(); | ||
| } else { | ||
| sharedGlobalStateId.accumulate(header.getStateId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simbadzina I have a naive question: What protects us here from the state where header.getStateId() > 0 && header.getStateId() < sharedGlobalStateId?
It seems like if this case were to occur then sharedGlobalStateId would go backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sharedGlobalStateID is created as new LongAccumulator(Math::max, Long.MIN_VALUE)
So accumulate either keeps the current value or moves it forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simbadzina Should we also add a test case to TestPoolAlignmentContext.java to ensure the sharedGlobalStateId moves under the conditions we expect it to?
|
🎊 +1 overall
This message was automatically generated. |
Good call. Let me add that. |
|
💔 -1 overall
This message was automatically generated. |
ctrezzo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM. Thanks @simbadzina !
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Failing tests in continuous-integration unrelated to my changes |
…es not set stateID in RPC response header. (apache#6804)
…es not set stateID in RPC response header. (apache#6804)
…es not set stateID in RPC response header. (apache#6804) (Cherry-picked from 6a4f0be)
HDFS-17514
Description of PR
When a namenode that had "dfs.namenode.state.context.enabled" set to true is restarted with the configuration set to false, routers will keep using a previously cached state ID.
Without RBF
With RBF
New clients that are created after the restart should not fetch the stale state ID.
How was this patch tested?
New unit test
Unit test fails with the patch is PoolAlignmentContext in the second commit.
Unit test pass with the patch.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?