-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541
Conversation
@sadanand48 @siddhantsangwan @adoroszlai Could you help review the code? Thank you very much! In this PR, we have enhanced the DataNodeSafeModeRule to determine how many DataNodes need to be registered based on the list of DataNodes stored in the pipeline, multiplied by a certain ratio. From my perspective, this approach is better than directly setting the number of DataNodes. |
Hi @slfan1989 can you elaborate more on the use case for this feature?
I can see how the default value of 1 may not be too helpful, but what are the practical issues that the current safemode configurations are causing in your cluster? The other rules that check for container replicas and pipelines should ensure that cluster is sufficiently operational when safemode is exited. I have a few concerns with adding this. Pipeline list is not a definitive list of cluster membership, especially if new nodes are added with the restart, so the rule may be more or less an approximation depending on the situation. It also adds more configuration and complexity to safemode. FWIW HDFS does not have such a feature because it does not have fixed cluster membership by design. Ozone does not really have fixed cluster membership either. Persistent pipelines actually add more complexity to the system than they are worth IMO and I don't think we should tie more features into them. I feel like a dashboard as described in HDDS-11525 would better address the concern about nodes not registering, and that the container and pipeline rules provide stronger guarantees about the cluster's readiness than node count or percentage. |
Thanks @errose28 for the input. I agree with Ethan's point. We should try to simplify the SafemodeRules, not add complexity to it. Even with the additional complexity, this doesn't provide any value addition to the Safemode logic. The Container and Pipeline Safemode rules ensures that the cluster is operational when it comes out of Safemode. |
Thank you very much for your response! However, please allow me to provide some clarification regarding this PR. My intention is not to increase the complexity of SafeMode, but rather to address some issues that have arisen during its use: The reasons I added this feature are as follows:
Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes). The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot.
So, what value should we set for the parameter However, if we set a proportion, the rule would become more flexible. For example, if we expect 60% of the DataNodes to register, this value would dynamically change over time.
I would really appreciate any suggestions you may have. Thank you once again! To add a bit more: HDDS-11525 is indeed a viable solution, and we do need a monitoring dashboard. I will continue to follow up on this. |
Thanks @slfan1989 for sharing details.
We could implement new (or improve existing)
I think the sentiment here is that other safemode rules should help with that. |
Related to this change, @nandakumar131 recently filed HDDS-11904. Maybe once that is fixed
Probably
I see datanode safemode rule as a sort of smoketest that cluster wide configs are setup correctly so SCM and DNs can communicate. Security, DNS, hostnames, etc. are all working such that if at least one datanode registers then we have reason to believe that later safemode rules are failing for different reasons. I don't think the intent of this rule is really to enforce a cluster membership requirement before bringing the cluster out of safemode.
I don't have much experience in this area but it seems like the system being monitored should not be bound by the system being used to monitor it. If Ozone is being scaled horizontally then in theory the metrics collection system should be able to scale horizontally to match. If the datanodes are doing something egregious that is causing the size of metrics data to explode then we should try to optimize that, but we shouldn't remove information that could otherwise be useful. |
@errose28 @adoroszlai Thank you very much for your response! The feature I submitted is relatively small, and I really appreciate your attention and the detailed explanation. I agree with your point that integrating some of the functionality into DataNodeSafeModeRule may not be ideal. However, we have already rolled out this feature internally, and for my part, it meets our needs. I understand that different users may have different requirements and perspectives on the system. I plan to close this PR and set the JIRA status to "Works for me." If other community members search for this JIRA and see our discussion, and find it helpful, that would be great. HDDS-11525 can solve our issue, so I will focus on this JIRA and do my best to contribute. I look forward to HDDS-11904 bringing better results. If a PR is submitted for the related JIRA, I will also take a look.
I will also consider this idea, as it's a good approach. However, as @errose28 mentioned, we would need to store this information elsewhere, which would still add complexity to the system. Thank you all again for your time! |
What changes were proposed in this pull request?
We enhanced the
DataNodeSafeModeRule
to validate the registered DN based on the DN list stored in the Pipeline.Currently, the
DataNodeSafeModeRule
in SCM requires users to manually configure the number of DataNodes in the cluster using the settinghdds.scm.safemode.min.datanode
, with a default value of1
. If not configured, this value defaults to1
. However, since clusters are frequently scaled and the number of active DataNodes is uncertain, relying on a fixed value to determine whether SafeMode conditions are met is not reliable.To address this issue, we propose a new rule:
allowing
DataNodeSafeModeRule
to retrieve the DataNode list from the registered Pipeline and use a configurable ratio to dynamically calculate the required number of DataNodes. This will provide a more accurate way to determine when the system can exit SafeMode.Configuration 1: Enable Pipeline Rule.
Configuration 2: Enable Pipeline Rule.
hdds.scm.safemode.reported.datanode.pct
value.What is the link to the Apache JIRA
JIRA: HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List.
How was this patch tested?
Manual Testing.