HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

slfan1989 · 2024-12-08T07:09:43Z

What changes were proposed in this pull request?

Background

We enhanced the DataNodeSafeModeRule to validate the registered DN based on the DN list stored in the Pipeline.

Currently, the DataNodeSafeModeRule in SCM requires users to manually configure the number of DataNodes in the cluster using the setting hdds.scm.safemode.min.datanode, with a default value of 1. If not configured, this value defaults to 1. However, since clusters are frequently scaled and the number of active DataNodes is uncertain, relying on a fixed value to determine whether SafeMode conditions are met is not reliable.

To address this issue, we propose a new rule:
allowing DataNodeSafeModeRule to retrieve the DataNode list from the registered Pipeline and use a configurable ratio to dynamically calculate the required number of DataNodes. This will provide a more accurate way to determine when the system can exit SafeMode.

The usage of the new rules

The following configurations need to be set simultaneously:

Configuration 1: Enable Pipeline Rule.

<property>
    <name>hdds.scm.safemode.datanode.use.pipeline.enabled</name>
    <value>true</value>
</property>

Configuration 2: Enable Pipeline Rule.

<property>
    <name>hdds.scm.safemode.reported.datanode.pct</name>
    <value>0.1</value>
</property>

Execution Flow of the New Rule

We will use the DN list from the Pipeline as the total list of DataNodes to be registered.
The total DN list will determine the required DataNodes for registration based on the hdds.scm.safemode.reported.datanode.pct value.
Once DataNodes continuously report and the number of registered DataNodes reaches the calculated threshold, the DataNodeSafeModeRule will exit automatically.
At the same time, we will list 5 DataNodes that have not been registered, in order to facilitate troubleshooting of DataNodes that failed to register.

Final Result Display

What is the link to the Apache JIRA

JIRA: HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List.

How was this patch tested?

Manual Testing.

slfan1989 · 2024-12-09T02:05:54Z

@sadanand48 @siddhantsangwan @adoroszlai Could you help review the code? Thank you very much!

In this PR, we have enhanced the DataNodeSafeModeRule to determine how many DataNodes need to be registered based on the list of DataNodes stored in the pipeline, multiplied by a certain ratio. From my perspective, this approach is better than directly setting the number of DataNodes.

ChenSammi · 2024-12-09T13:39:18Z

cc @nandakumar131

errose28 · 2024-12-09T23:48:17Z

Hi @slfan1989 can you elaborate more on the use case for this feature?

However, since clusters are frequently scaled and the number of active DataNodes is uncertain, relying on a fixed value to determine whether SafeMode conditions are met is not reliable.

I can see how the default value of 1 may not be too helpful, but what are the practical issues that the current safemode configurations are causing in your cluster? The other rules that check for container replicas and pipelines should ensure that cluster is sufficiently operational when safemode is exited.

I have a few concerns with adding this. Pipeline list is not a definitive list of cluster membership, especially if new nodes are added with the restart, so the rule may be more or less an approximation depending on the situation. It also adds more configuration and complexity to safemode. FWIW HDFS does not have such a feature because it does not have fixed cluster membership by design. Ozone does not really have fixed cluster membership either. Persistent pipelines actually add more complexity to the system than they are worth IMO and I don't think we should tie more features into them.

I feel like a dashboard as described in HDDS-11525 would better address the concern about nodes not registering, and that the container and pipeline rules provide stronger guarantees about the cluster's readiness than node count or percentage.

nandakumar131 · 2024-12-10T09:52:58Z

Thanks @errose28 for the input. I agree with Ethan's point. We should try to simplify the SafemodeRules, not add complexity to it. Even with the additional complexity, this doesn't provide any value addition to the Safemode logic.

The Container and Pipeline Safemode rules ensures that the cluster is operational when it comes out of Safemode.

slfan1989 · 2024-12-11T01:24:28Z

@errose28 @nandakumar131

Thank you very much for your response! However, please allow me to provide some clarification regarding this PR. My intention is not to increase the complexity of SafeMode, but rather to address some issues that have arisen during its use:

The reasons I added this feature are as follows:

The Ozone SCM does not maintain a complete list of DataNodes, which can lead to a problem. In large clusters, when we restart the SCM, some DataNodes may fail to register, and we won't be able to locate these DataNodes (since they are not included in the DataNode list).

Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes).

The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot.

The reason for improving the DataNodeSafeModeRule is that this rule is difficult to apply effectively in real-world usage. Let me provide a specific example:

Adding DataNodes:
When our cluster reaches the 75% threshold, we need to expand the number of DataNodes, typically scaling from 100 machines to 120 or 130.
Reducing DataNodes:
Our cluster may include various types of machines, some of which have poor performance. In such cases, we may need to take some machines offline for replacement, which could result in a reduction in the number of DataNodes.

So, what value should we set for the parameter hdds.scm.safemode.min.datanode? It's quite difficult to assess. A default value of 1 is certainly not ideal, but when we consider a cluster with 100 machines, should we set it to 40, 50, or 60? This is also hard to determine.

However, if we set a proportion, the rule would become more flexible. For example, if we expect 60% of the DataNodes to register, this value would dynamically change over time.

Grafana is a good solution, but it also has some issues in large-scale cluster environments. The problem is that there are too many metrics for each DataNode. Even with a 30-second collection interval, a single DataNode generates a large of metrics. Currently, we have 5 clusters with over 3,000 DataNodes, which puts significant pressure on our collection system. At the moment, we collect metrics for DataNodes every 5 minutes, which results in delays in reflecting the actual situation in real time.

I would really appreciate any suggestions you may have. Thank you once again!

To add a bit more: HDDS-11525 is indeed a viable solution, and we do need a monitoring dashboard. I will continue to follow up on this.

adoroszlai · 2024-12-11T17:44:58Z

Thanks @slfan1989 for sharing details.

Currently, my approach is to manually compare the DataNode lists from the two SCMs, identify the DataNodes that failed to register, track these DataNodes, and take appropriate actions (in most cases, this involves restarting the DataNodes).

The purpose of retrieving the DataNode list from the Pipeline List is to identify any unregistered DataNodes, as shown in my screenshot.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

The reason for improving the DataNodeSafeModeRule is that this rule is difficult to apply effectively in real-world usage.

I think the sentiment here is that other safemode rules should help with that. DataNodeSafeModeRule is kind of a "minimum bar" for the cluster to pass.

errose28 · 2024-12-12T00:32:57Z

Related to this change, @nandakumar131 recently filed HDDS-11904. Maybe once that is fixed hdds.scm.safemode.healthy.pipeline.pct will better map to the percent of nodes that need to register before safemode is exited.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

Probably ozone admin datanode list is the command we are looking for here. This does not check against a master list of datanodes because Ozone does not attempt to maintain such a thing (nor does HDFS). If this is desired it must be stored externally.

So, what value should we set for the parameter hdds.scm.safemode.min.datanode? It's quite difficult to assess. A default value of 1 is certainly not ideal, but when we consider a cluster with 100 machines, should we set it to 40, 50, or 60? This is also hard to determine.

I see datanode safemode rule as a sort of smoketest that cluster wide configs are setup correctly so SCM and DNs can communicate. Security, DNS, hostnames, etc. are all working such that if at least one datanode registers then we have reason to believe that later safemode rules are failing for different reasons. I don't think the intent of this rule is really to enforce a cluster membership requirement before bringing the cluster out of safemode.

Even with a 30-second collection interval, a single DataNode generates a large of metrics. Currently, we have 5 clusters with over 3,000 DataNodes, which puts significant pressure on our collection system.

I don't have much experience in this area but it seems like the system being monitored should not be bound by the system being used to monitor it. If Ozone is being scaled horizontally then in theory the metrics collection system should be able to scale horizontally to match. If the datanodes are doing something egregious that is causing the size of metrics data to explode then we should try to optimize that, but we shouldn't remove information that could otherwise be useful.

slfan1989 · 2024-12-12T03:19:08Z

@errose28 @adoroszlai Thank you very much for your response! The feature I submitted is relatively small, and I really appreciate your attention and the detailed explanation. I agree with your point that integrating some of the functionality into DataNodeSafeModeRule may not be ideal. However, we have already rolled out this feature internally, and for my part, it meets our needs. I understand that different users may have different requirements and perspectives on the system. I plan to close this PR and set the JIRA status to "Works for me." If other community members search for this JIRA and see our discussion, and find it helpful, that would be great.

HDDS-11525 can solve our issue, so I will focus on this JIRA and do my best to contribute. I look forward to HDDS-11904 bringing better results. If a PR is submitted for the related JIRA, I will also take a look.

We could implement new (or improve existing) ozone admin command to help with that. It may need backend changes, but not DataNodeSafeModeRule.

I will also consider this idea, as it's a good approach. However, as @errose28 mentioned, we would need to store this information elsewhere, which would still add complexity to the system.

Thank you all again for your time!

cc: @nandakumar131 @ChenSammi

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List.

e1b5701

slfan1989 marked this pull request as ready for review December 8, 2024 08:18

adoroszlai requested review from nandakumar131, siddhantsangwan and sadanand48 December 9, 2024 15:26

slfan1989 closed this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

slfan1989 commented Dec 8, 2024 •

edited

Loading

slfan1989 commented Dec 9, 2024

ChenSammi commented Dec 9, 2024

errose28 commented Dec 9, 2024

nandakumar131 commented Dec 10, 2024

slfan1989 commented Dec 11, 2024 •

edited

Loading

adoroszlai commented Dec 11, 2024

errose28 commented Dec 12, 2024

slfan1989 commented Dec 12, 2024

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

HDDS-11843. Enhance DataNodeSafeModeRule to Use Pipeline DN List. #7541

Conversation

slfan1989 commented Dec 8, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

slfan1989 commented Dec 9, 2024

ChenSammi commented Dec 9, 2024

errose28 commented Dec 9, 2024

nandakumar131 commented Dec 10, 2024

slfan1989 commented Dec 11, 2024 • edited Loading

adoroszlai commented Dec 11, 2024

errose28 commented Dec 12, 2024

slfan1989 commented Dec 12, 2024

slfan1989 commented Dec 8, 2024 •

edited

Loading

slfan1989 commented Dec 11, 2024 •

edited

Loading