Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10859. Improve error messages when decommission and maintenance fail-early #6678

Merged
merged 1 commit into from
May 16, 2024

Conversation

Tejaskriya
Copy link
Contributor

@Tejaskriya Tejaskriya commented May 15, 2024

What changes were proposed in this pull request?

The current error message if decommission or maintenance fails early is not detailed enough:
Error: AllHosts: Sufficient nodes are not available.
The message must be self-explanatory and mention the reason of failure.
In this PR, the error messages have been changed to show the number of nodes trying to be decommissioned/put into maintenance and how many nodes are needed to maintain required replication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10859

How was this patch tested?

Tested locally in docker cluster with 5 nodes, max replication factor of keys was RATIS-THREE:

bash-4.2$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-2 ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-1
ozone-datanode-2
ozone-datanode-3
Error: AllHosts: Insufficient nodes. Tried to decommission 3 nodes of which 3 nodes were valid. Cluster has 5 IN-SERVICE nodes, 3 of which are required for minimum replication. 
Some nodes could not enter the decommission workflow
bash-4.2$ 
bash-4.2$ ozone admin datanode maintenance ozone-datanode-1 ozone-datanode-2 ozone-datanode-3 ozone-datanode-4
Entering maintenance mode on datanode(s):
ozone-datanode-1
ozone-datanode-2
ozone-datanode-3
ozone-datanode-4
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 4 nodes of which 4 nodes were valid. Cluster has 5 IN-SERVICE nodes, 2 of which are required for minimum replication. 
Some nodes could not start the maintenance workflow

@Tejaskriya Tejaskriya marked this pull request as ready for review May 15, 2024 08:23
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the CI run.

@Tejaskriya
Copy link
Contributor Author

@sodonnel thanks for the review and approval!
The CI failures don't seem to be related to my changes. Could you please re-trigger the failed ones?

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Tejaskriya for the improvement! Pending green CI.

@siddhantsangwan siddhantsangwan merged commit 366d074 into apache:master May 16, 2024
51 of 56 checks passed
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 23, 2024
errose28 added a commit to errose28/ozone that referenced this pull request May 28, 2024
…concile-cli

* HDDS-10239-container-reconciliation: (296 commits)
  HDDS-10897. Refactor OzoneQuota (apache#6714)
  HDDS-10422. Fix some warnings about exposing internal representation in hdds-common (apache#6351)
  HDDS-10899. Refactor Lease callbacks (apache#6715)
  HDDS-10890. Increase default value for hdds.container.ratis.log.appender.queue.num-elements (apache#6711)
  HDDS-10832. Client should switch to streaming based on OpenKeySession replication (apache#6683)
  HDDS-10435. Support S3 object tags for existing requests (apache#6607)
  HDDS-10883. Improve logging in Recon for finalising DN logic. (apache#6704)
  HDDS-8752. Enable TestOzoneRpcClientAbstract#testOverWriteKeyWithAndWithOutVersioning (apache#6702)
  HDDS-10875. XceiverRatisServer#getRaftPeersInPipeline should be called before XceiverRatisServer#removeGroup (apache#6696)
  HDDS-10514. Recon - Provide DN decommissioning detailed status and info inline with current CLI command output. (apache#6376)
  HDDS-10878. Bump zstd-jni to 1.5.6-3 (apache#6701)
  HDDS-10877. Bump Dropwizard metrics to 3.2.6 (apache#6699)
  HDDS-10876. Bump jackson to 2.16.2 (apache#6697)
  HDDS-6116. Remove flaky tag from TestSCMInstallSnapshot (apache#6695)
  HDDS-2643. TestOzoneDelegationTokenSecretManager#testRenewTokenFailureRenewalTime fails intermittently.
  HDDS-10699. Refactor ContainerBalancerTask and TestContainerBalancerTask (apache#6537)
  HDDS-10861. Ozone cli supports default ozone.om.service.id (apache#6680)
  HDDS-10859. Improve error messages when decommission and maintenance fail-early (apache#6678)
  HDDS-9031. Upgrade acceptance tests to Docker Compose v2 (apache#6667)
  HDDS-10559. Add a warning or a check to run repair tool as System user (apache#6574)
  ...

Conflicts:
    hadoop-ozone/dist/src/main/smoketest/admincli/container.robot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants