-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10875. XceiverRatisServer#getRaftPeersInPipeline should be called before XceiverRatisServer#removeGroup #6696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 , thanks for working on this! How about catching GroupMismatchException
in the existing try-catch?
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java
@@ -105,7 +105,6 @@ public void handle(SCMCommand command, OzoneContainer ozoneContainer,
try {
XceiverServerSpi server = ozoneContainer.getWriteChannel();
if (server.isExist(pipelineIdProto)) {
- server.removeGroup(pipelineIdProto);
if (server instanceof XceiverServerRatis) {
// TODO: Refactor Ratis logic to XceiverServerRatis
// Propagate the group remove to the other Raft peers in the pipeline
@@ -127,12 +126,18 @@ public void handle(SCMCommand command, OzoneContainer ozoneContainer,
}
});
}
+ server.removeGroup(pipelineIdProto);
LOG.info("Close Pipeline {} command on datanode {}.", pipelineID,
dn.getUuidString());
} else {
LOG.debug("Ignoring close pipeline command for pipeline {} " +
"as it does not exist", pipelineID);
}
+ } catch (GroupMismatchException gme) {
+ // ignore silently since this means that the group has been closed by earlier close pipeline
+ // command in another datanode
+ LOG.debug("The Ratis group for the pipeline {} has been removed by earlier close pipeline command from " +
+ "other datanodes", pipelineID.getId());
} catch (IOException e) {
LOG.error("Can't close pipeline {}", pipelineID, e);
} finally {
Thank you for the review @szetszwo.
Sure, updated. Was using a nested try-catch since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
In RaftClientReply reply;
try {
reply = server.groupManagement(request);
} catch (Exception e) {
throw new IOException(e.getMessage(), e);
} Currently, I'm using |
Found
Added a new |
@szetszwo I have made some changes on the patch. Could you help review this again? Thank you. |
} catch (GroupMismatchException gme) { | ||
// ignore silently since this means that the group has been closed by earlier close pipeline | ||
// command in another datanode | ||
LOG.debug("The group for pipeline {} on datanode {} has been removed by earlier close " + | ||
"pipeline command handled in another datanode", pipelineID, dn.getUuidString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 , Since HddsClientUtils.containsException
below will cover this case, let's remove catch (GroupMismatchException gme) {...}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Updated.
+1 to the latest change. |
Thank you for the reviews @szetszwo , merged. |
…d before XceiverRatisServer#removeGroup (apache#6696)
…concile-cli * HDDS-10239-container-reconciliation: (296 commits) HDDS-10897. Refactor OzoneQuota (apache#6714) HDDS-10422. Fix some warnings about exposing internal representation in hdds-common (apache#6351) HDDS-10899. Refactor Lease callbacks (apache#6715) HDDS-10890. Increase default value for hdds.container.ratis.log.appender.queue.num-elements (apache#6711) HDDS-10832. Client should switch to streaming based on OpenKeySession replication (apache#6683) HDDS-10435. Support S3 object tags for existing requests (apache#6607) HDDS-10883. Improve logging in Recon for finalising DN logic. (apache#6704) HDDS-8752. Enable TestOzoneRpcClientAbstract#testOverWriteKeyWithAndWithOutVersioning (apache#6702) HDDS-10875. XceiverRatisServer#getRaftPeersInPipeline should be called before XceiverRatisServer#removeGroup (apache#6696) HDDS-10514. Recon - Provide DN decommissioning detailed status and info inline with current CLI command output. (apache#6376) HDDS-10878. Bump zstd-jni to 1.5.6-3 (apache#6701) HDDS-10877. Bump Dropwizard metrics to 3.2.6 (apache#6699) HDDS-10876. Bump jackson to 2.16.2 (apache#6697) HDDS-6116. Remove flaky tag from TestSCMInstallSnapshot (apache#6695) HDDS-2643. TestOzoneDelegationTokenSecretManager#testRenewTokenFailureRenewalTime fails intermittently. HDDS-10699. Refactor ContainerBalancerTask and TestContainerBalancerTask (apache#6537) HDDS-10861. Ozone cli supports default ozone.om.service.id (apache#6680) HDDS-10859. Improve error messages when decommission and maintenance fail-early (apache#6678) HDDS-9031. Upgrade acceptance tests to Docker Compose v2 (apache#6667) HDDS-10559. Add a warning or a check to run repair tool as System user (apache#6574) ... Conflicts: hadoop-ozone/dist/src/main/smoketest/admincli/container.robot
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
What changes were proposed in this pull request?
From the https://issues.apache.org/jira/browse/HDDS-10750?focusedCommentId=17847435&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17847435 in HDDS-10750, it's found
GroupMismatchException
are thrown during theClosePipelineCommandHandler
.This is because
XceiverRatisServer#removeGroup
is called beforeXceiverRatisServer#getRaftPeersInPipeline
, which causesXceiverRatisServer#getRaftPeersInPipeline
to throwGroupMismatchException
when it's trying to get theRaftServerProxy#getDivision
since the group has been removed.Therefore, we need to first call the
XceiverRatisServer#getRaftPeersInPipeline
before callingXceiverRatisServer#removeGroup
.This patch also catch the
GroupMismatchException
in case the group has been removed by earlierClosePipelineCommandHandler
in other datanode for the same pipeline. The datanode will also try to remove the Ratis group from the other datanodes (ignoring the GroupMismatchException) before removing its own Ratis group.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10875
How was this patch tested?
Clean CI: https://github.com/ivandika3/ozone/actions/runs/9145754999