Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

Closed
wants to merge 1 commit into from

Conversation

slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Dec 8, 2024

What changes were proposed in this pull request?

Background

We enhanced the DataNodeSafeModeRule to validate the registered DN based on the DN list stored in the Pipeline.

Currently, the DataNodeSafeModeRule in SCM requires users to manually configure the number of DataNodes in the cluster using the setting hdds.scm.safemode.min.datanode, with a default value of 1. If not configured, this value defaults to 1. However, since clusters are frequently scaled and the number of active DataNodes is uncertain, relying on a fixed value to determine whether SafeMode conditions are met is not reliable.

To address this issue, we propose a new rule:
allowing DataNodeSafeModeRule to retrieve the DataNode list from the registered Pipeline and use a configurable ratio to dynamically calculate the required number of DataNodes. This will provide a more accurate way to determine when the system can exit SafeMode.

The usage of the new rules

  • The following configurations need to be set simultaneously:

Configuration 1: Enable Pipeline Rule.

<property>
    <name>hdds.scm.safemode.datanode.use.pipeline.enabled</name>
    <value>true</value>
</property>

Configuration 2: Enable Pipeline Rule.

<property>
    <name>hdds.scm.safemode.reported.datanode.pct</name>
    <value>0.1</value>
</property>

Execution Flow of the New Rule

  • We will use the DN list from the Pipeline as the total list of DataNodes to be registered.
  • The total DN list will determine the required DataNodes for registration based on the hdds.scm.safemode.reported.datanode.pct value.
  • Once DataNodes continuously report and the number of registered DataNodes reaches the calculated threshold, the DataNodeSafeModeRule will exit automatically.
  • At the same time, we will list 5 DataNodes that have not been registered, in order to facilitate troubleshooting of DataNodes that failed to register.

Final Result Display

image

What is the link to the Apache JIRA

JIRA: HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List.

How was this patch tested?

Manual Testing.

@slfan1989 slfan1989 marked this pull request as ready for review December 8, 2024 08:18
@slfan1989
Copy link
Contributor Author

@sadanand48 @siddhantsangwan @adoroszlai Could you help review the code? Thank you very much!

In this PR, we have enhanced the DataNodeSafeModeRule to determine how many DataNodes need to be registered based on the list of DataNodes stored in the pipeline, multiplied by a certain ratio. From my perspective, this approach is better than directly setting the number of DataNodes.

@ChenSammi
Copy link
Contributor

cc @nandakumar131

@errose28
Copy link
Contributor

errose28 commented Dec 9, 2024

Hi @slfan1989 can you elaborate more on the use case for this feature?

However, since clusters are frequently scaled and the number of active DataNodes is uncertain, relying on a fixed value to determine whether SafeMode conditions are met is not reliable.

I can see how the default value of 1 may not be too helpful, but what are the practical issues that the current safemode configurations are causing in your cluster? The other rules that check for container replicas and pipelines should ensure that cluster is sufficiently operational when safemode is exited.

I have a few concerns with adding this. Pipeline list is not a definitive list of cluster membership, especially if new nodes are added with the restart, so the rule may be more or less an approximation depending on the situation. It also adds more configuration and complexity to safemode. FWIW HDFS does not have such a feature because it does not have fixed cluster membership by design. Ozone does not really have fixed cluster membership either. Persistent pipelines actually add more complexity to the system than they are worth IMO and I don't think we should tie more features into them.

I feel like a dashboard as described in HDDS-11525 would better address the concern about nodes not registering, and that the container and pipeline rules provide stronger guarantees about the cluster's readiness than node count or percentage.

@nandakumar131
Copy link
Contributor

Thanks @errose28 for the input. I agree with Ethan's point. We should try to simplify the SafemodeRules, not add complexity to it. Even with the additional complexity, this doesn't provide any value addition to the Safemode logic.

The Container and Pipeline Safemode rules ensures that the cluster is operational when it comes out of Safemode.

@slfan1989
Copy link
Contributor Author

slfan1989 commented Dec 11, 2024

@errose28 @nandakumar131

Thank you very much for your response! However, please allow me to provide some clarification regarding this PR. My intention is not to increase the complexity of SafeMode, but rather to address some issues that have arisen during its use:

The reasons I added this feature are as follows:

  1. The Ozone SCM does not maintain a complete list of DataNodes, which can lead to a problem. In large clusters, when we restart the SCM, some DataNodes may fail to register, and we won't be able to locate these DataNodes (since they are not included in the DataNode list).

Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes).

The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot.

  1. The reason for improving the DataNodeSafeModeRule is that this rule is difficult to apply effectively in real-world usage. Let me provide a specific example:
  • Adding DataNodes:
    When our cluster reaches the 75% threshold, we need to expand the number of DataNodes, typically scaling from 100 machines to 120 or 130.

  • Reducing DataNodes:
    Our cluster may include various types of machines, some of which have poor performance. In such cases, we may need to take some machines offline for replacement, which could result in a reduction in the number of DataNodes.

So, what value should we set for the parameter hdds.scm.safemode.min.datanode? It's quite difficult to assess. A default value of 1 is certainly not ideal, but when we consider a cluster with 100 machines, should we set it to 40, 50, or 60? This is also hard to determine.

However, if we set a proportion, the rule would become more flexible. For example, if we expect 60% of the DataNodes to register, this value would dynamically change over time.

  1. Grafana is a good solution, but it also has some issues in large-scale cluster environments. The problem is that there are too many metrics for each DataNode. Even with a 30-second collection interval, a single DataNode generates a large of metrics. Currently, we have 5 clusters with over 3,000 DataNodes, which puts significant pressure on our collection system. At the moment, we collect metrics for DataNodes every 5 minutes, which results in delays in reflecting the actual situation in real time.

I would really appreciate any suggestions you may have. Thank you once again!

To add a bit more: HDDS-11525 is indeed a viable solution, and we do need a monitoring dashboard. I will continue to follow up on this.

@adoroszlai
Copy link
Contributor

Thanks @slfan1989 for sharing details.

Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes).

The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

  1. The reason for improving the DataNodeSafeModeRule is that this rule is difficult to apply effectively in real-world usage.

I think the sentiment here is that other safemode rules should help with that. DataNodeSafeModeRule is kind of a "minimum bar" for the cluster to pass.

@errose28
Copy link
Contributor

Related to this change, @nandakumar131 recently filed HDDS-11904. Maybe once that is fixed hdds.scm.safemode.healthy.pipeline.pct will better map to the percent of nodes that need to register before safemode is exited.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

Probably ozone admin datanode list is the command we are looking for here. This does not check against a master list of datanodes because Ozone does not attempt to maintain such a thing (nor does HDFS). If this is desired it must be stored externally.

So, what value should we set for the parameter hdds.scm.safemode.min.datanode? It's quite difficult to assess. A default value of 1 is certainly not ideal, but when we consider a cluster with 100 machines, should we set it to 40, 50, or 60? This is also hard to determine.

I see datanode safemode rule as a sort of smoketest that cluster wide configs are setup correctly so SCM and DNs can communicate. Security, DNS, hostnames, etc. are all working such that if at least one datanode registers then we have reason to believe that later safemode rules are failing for different reasons. I don't think the intent of this rule is really to enforce a cluster membership requirement before bringing the cluster out of safemode.

Even with a 30-second collection interval, a single DataNode generates a large of metrics. Currently, we have 5 clusters with over 3,000 DataNodes, which puts significant pressure on our collection system.

I don't have much experience in this area but it seems like the system being monitored should not be bound by the system being used to monitor it. If Ozone is being scaled horizontally then in theory the metrics collection system should be able to scale horizontally to match. If the datanodes are doing something egregious that is causing the size of metrics data to explode then we should try to optimize that, but we shouldn't remove information that could otherwise be useful.

@slfan1989
Copy link
Contributor Author

@errose28 @adoroszlai Thank you very much for your response! The feature I submitted is relatively small, and I really appreciate your attention and the detailed explanation. I agree with your point that integrating some of the functionality into DataNodeSafeModeRule may not be ideal. However, we have already rolled out this feature internally, and for my part, it meets our needs. I understand that different users may have different requirements and perspectives on the system. I plan to close this PR and set the JIRA status to "Works for me." If other community members search for this JIRA and see our discussion, and find it helpful, that would be great.

HDDS-11525 can solve our issue, so I will focus on this JIRA and do my best to contribute. I look forward to HDDS-11904 bringing better results. If a PR is submitted for the related JIRA, I will also take a look.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

I will also consider this idea, as it's a good approach. However, as @errose28 mentioned, we would need to store this information elsewhere, which would still add complexity to the system.

Thank you all again for your time!

cc: @nandakumar131 @ChenSammi

@slfan1989 slfan1989 closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants