-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-5766 Validate target node id when collecting health report #22811
CORE-5766 Validate target node id when collecting health report #22811
Conversation
Introduced an error code that indicates the node that the request was sent to is not the one that received it. Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added validation that checks if the node replying request is the one the request was sent to. The validation is important as the receiving node id might have changed while the RPC endpoint address stays the same. Signed-off-by: Michał Maślanka <michal@redpanda.com>
Is this problem specific to health reports, or can we hit the same problem with any inter-node RPCs? I'm just thinking maybe the check should be included into RPC mechanism, and an RPC call is not even to be processed by the remote node if it is not the original addressee? |
you are right, this is a problem with the RPC mechanism in general, we were thinking about adding a handshake to perform a validation, definitely should be scheduled for the future work. I didn't do it right now as the solution would be much more complex and not easy to backport. The problem with the health is real and this PR is solving that in isolation |
if (!reply) { | ||
return {reply.error()}; | ||
} | ||
if (!reply.value().report.has_value()) { | ||
return {reply.value().error}; | ||
} | ||
if (reply.value().report->id != target_node_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of rechecking it here, if we anyway have checked it on the server side in service::do_collect_node_health_report
? I guess it'll only make sense for some corner cases when nodes run different versions of Redpanda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52703#019136d9-1f11-4b31-b4d3-341497bdcf08 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52718#019137f3-b300-440f-a498-586e2c4611a7 |
Added a field indicating what node the request was targeted to. If present the `target_node_id` will be validated when processing the request. Signed-off-by: Michał Maślanka <michal@redpanda.com>
The health report is used to determine if a cluster node is online and available. When a node id changes but the RPC endpoint does not change the requester may incorrectly assume that the node with the previous node_id but the same endpoint is still operational. Added validation of the node that the request was sent to before collecting the health report. This way a sender will have correct information about the node availability as only the request targeted to the node with the correct node id will be replied with success. Fixes: CORE-5766 Signed-off-by: Michał Maślanka <michal@redpanda.com>
The node folder deletion test checks if a node joins the cluster with the new node id after its data folder was deleted. Introduced a new validation checking if in this case the node with the old node_id is reported as offline Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added validation of the node_id of the reply received from the node. The report is not considered as valid if the reply node id doesn't match the id of node the report was sent to. Signed-off-by: Michał Maślanka <michal@redpanda.com>
81313db
to
08de93d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you know you actually need it to work with only some nodes upgraded I'm approving it. But I'm still curious why in real life someone would upgrade only some nodes for a long time, or, if the cluster is not going to stay mixed for long, why would anyone care whether the bug is fixed when the first node is upgraded or the last one.
/backport v24.2.x |
/backport v24.1.x |
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
Failed to create a backport PR to v24.1.x branch. I tried:
|
The health report is used to determine if a cluster node is online and
available. When a node id changes but the RPC endpoint does not change
the requester may incorrectly assume that the node with the previous
node_id but the same endpoint is still operational. Added validation of
the node that the request was sent to before collecting the health
report. This way a sender will have correct information about the node
availability as only the request targeted to the node with the correct
node id will be replied with success.
Fixes: CORE-5766
Backports Required
Release Notes
Bug Fixes