Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11881. Simplify DatanodeAdminMonitorImpl Code Structure. #7542

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Dec 8, 2024

What changes were proposed in this pull request?

Background

Recently, while working with the Datanode maintenance mode and Datanode decommission features, we noticed the code in DatanodeAdminMonitorImpl. Although the code quality was not an issue, I made some improvements to enhance its readability. This includes encapsulating some classes and adding detailed comments. I hope these changes will make the code easier to understand and maintain, improve readability, and be recognized by the team.

Improvements include the following

  • Fixed typos in the code: I corrected several spelling errors. These were minor issues, but I thought it would be best to address them alongside the other changes. I didn't submit a separate PR for typos, as doing so would consume the reviewer's time unnecessarily.

  • Consolidated counting variables within the logic: I abstracted the TrackedNodeContainers object to track the replication status of the relevant containers. This change maintains the idempotency of the original code.

  • Extracted TrackedNode to a separate class: I moved the TrackedNode logic outside of DatanodeAdminMonitorImpl to make the class more focused on business logic and cleaner overall.

  • Added necessary comments: I included additional comments to improve code readability for other team members.

What is the link to the Apache JIRA

JIRA: HDDS-11881. Simplify DatanodeAdminMonitorImpl Code Structure.

How was this patch tested?

CI test.

@slfan1989 slfan1989 marked this pull request as ready for review December 8, 2024 08:18
@slfan1989
Copy link
Contributor Author

@adoroszlai @sodonnel Could you help review the code? Thank you very much!

@ChenSammi
Copy link
Contributor

ChenSammi commented Dec 9, 2024

cc @nandakumar131 .

@sodonnel
Copy link
Contributor

sodonnel commented Dec 9, 2024

Is there any plan to build on these changes to add features? In general, I don't like refactoring things "just to make them nicer" unless things are very bad already. It creates churn on the code, risks introducing bugs and makes backports more difficult if any change is needed on this code due to a bug, and then needs to be backported to an earlier version.

Note I haven't looked at the changes in detail. I only quickly looked to see what the scope of the change was.

@slfan1989
Copy link
Contributor Author

Is there any plan to build on these changes to add features? In general, I don't like refactoring things "just to make them nicer" unless things are very bad already. It creates churn on the code, risks introducing bugs and makes backports more difficult if any change is needed on this code due to a bug, and then needs to be backported to an earlier version.

Note I haven't looked at the changes in detail. I only quickly looked to see what the scope of the change was.

@sodonnel Thank you very much for your response! I do agree with your viewpoint to some extent. We indeed want to contribute a new feature to the community, and I have named this feature "Fast Decommission / Maintenance." However, before proceeding, we do need to carry out some preparatory work, and I feel that this PR is one of those necessary steps.

Background

Currently, whether it's decommissioning or maintenance, we rely on UnderReplication. Replication in our system is very slow. As shown in the monitoring screenshot below, after I took one DataNode decommission, there were 38,000 Containers to be replicated. After 18 hours, only 5,768 Containers were replicated (from 38,228 to 32,460), and this process was accompanied by a significant amount of I/O.

image

New Solution

We have developed the V1 version of the Fast Decommission feature. After testing, we have confirmed that we can decommission at least 10 machines simultaneously, with each machine containing 140TB of data, and the process takes about 2 days. The general approach of this solution is as follows:

  1. The SCM creates a decommissioning execution plan based on the decommission DataNode and the global data storage status (which Containers should be transferred to which Target DataNode). However, SCM is not responsible for executing this plan.
  2. The SCM sends the execution plan generated in step 1 to the DataNodes that need to be decommissioned. The decommissioning DataNodes are then responsible for transferring the Containers to the corresponding Target DataNodes.
  • Monitoring Screenshot

image
image

This feature allows us to make better use of the system bandwidth and reduce system I/O wait.

  • DataNode Transfer Screenshot

image

The current time tracking metrics may have some issues, and I will fix this problem in the V2 version.

@sodonnel
Copy link
Contributor

I'm curious about why it is so slow to replicate. Are you using the new or legacy replication manager? In the new RM, we did try to make things better, and try to balance the replication across all the DNs with replicas. The default settings may not be aggressive enough.

Did you try tuning anything there or dig into why the replication is slow? This is important, as while your solution may help speed up decommission, it does not help with the more serious case of a rack or several DNs going down and a large amount of replication being required to avoid data loss.

@slfan1989
Copy link
Contributor Author

I'm curious about why it is so slow to replicate. Are you using the new or legacy replication manager? In the new RM, we did try to make things better, and try to balance the replication across all the DNs with replicas. The default settings may not be aggressive enough.

Did you try tuning anything there or dig into why the replication is slow? This is important, as while your solution may help speed up decommission, it does not help with the more serious case of a rack or several DNs going down and a large amount of replication being required to avoid data loss.

@sodonnel Thank you very much for your reply! I will conduct a deeper study of the Legacy Replication Manager and then follow up with you for further discussion.

@sodonnel
Copy link
Contributor

I will conduct a deeper study of the Legacy Replication Manager and then follow up with you for further discussion.

The Legacy RM is set to be removed. I'd suggest getting onto the new RM which solves a lot of problems with the original design and see how it performs. We have scope to make it better if it is not working much faster.

@slfan1989
Copy link
Contributor Author

I will conduct a deeper study of the Legacy Replication Manager and then follow up with you for further discussion.

The Legacy RM is set to be removed. I'd suggest getting onto the new RM which solves a lot of problems with the original design and see how it performs. We have scope to make it better if it is not working much faster.

@sodonnel

Thank you for the suggestion! I reviewed the ReplicationManager Related code and Configurations and identified two key parameters:

  • hdds.scm.replication.inflight.limit.factor
  • hdds.scm.replication.datanode.replication.limit

We have made appropriate adjustments to these parameters based on our cluster's situation, and they have proven helpful for us.

However, I have encountered a new issue that may require further optimization.

I found that some machines are unable to enter the decommissioned state. Taking bigdata-ozone071 as an example, the analysis process is as follows:

  • Step1. Analyze the SCM logs.

tail -f ozone-hadoop-scm-bigdata-ozonemaster.log|grep 'sufficientlyReplicated'|grep 'bigdata-ozone071'

2024-12-27 09:10:49,930 [scm20-DatanodeAdminManager-0] INFO org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: 87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx) has 35604 sufficientlyReplicated, 0 deleting, 
65 underReplicated and 18 unclosed containers
  • Step2. Open the SCM DatanodeAdminMonitorImpl DEBUG logs.
2024-12-26 19:48:49,834 [scm20-DatanodeAdminManager-0] DEBUG org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: 
87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx) 
has 65 underReplicated [#2880767, #1047176, #2882549, #196344, #1048593, #2884552, #1049945, #1049979, #1050457, #1050467, #1050603, #1050772, #2951936, #1051434, #1051630, #1051875, #1052169, #1052236, #200415, #1052722, #1052786, #3477663, #1053006, #1053322, #2953943, #2889201, #2954850, #1055567, #1055648, #1056624, #1056767, #1057345, #2892835, #1058334, #1058391, #1058539, #2567657, #81316, #2965464, #82312, #2966641, #2967125, #2772683, #86423, #89872, #2908872, #2976269, #2910782, #92979, #2978544, #162430, #2915684, #2983752, #2987173, #3316167, #2988603, #107475, #173256, #2992090, #2929304, #3001030, #2870577, #2870715, #3002129, #2875862] 
and 18 unclosed [#3536258, #3539106, #3539454, #3541506, #3546006, #3547041, #3549110, #2511453, #3561382, #2454668, #3513040, #3515030, #3516597, #3516577, #3516647, #3523054, #3529943, #3533790] containers
  • Step3. SCM ozone admin container info 3533790

I found that ReplicaIndex: 8 has 35 unhealthy replicas.

Container id: 3533790
Pipeline id: fc36f759-9ba2-426b-aa0e-31ef36c1e90b
Write PipelineId: 7ae9e4bd-a60c-46ce-919f-8988f4b90771
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [d15329a6-fa24-49b6-ab7a-f4b8ec879032/bigdata-ozone3385,
0467a91b-9370-4ce8-a072-a43be3d357cd/bigdata-ozone3358,
ba7af4ce-60b8-49ac-89c7-7566723d9c80/bigdata-ozone3040,
e65cdf8b-56ee-4f33-992e-e9efaeeb83d1/bigdata-ozone3458,
1d526c91-150c-45e4-938f-e1d0fef397d9/bigdata-ozone3185,
b1001902-3e93-4ce0-9682-a95fc760c38b/bigdata-ozone3540,
......
87f51602-740b-4658-aade-0d01525460b7/bigdata-ozone071]
Replicas: [State: CLOSED; ReplicaIndex: 1; Origin: bd44548c-b261-4450-9132-ba356799cddd; Location: bd44548c-b261-4450-9132-ba356799cddd/bigdata-ozone2610,
State: CLOSED; ReplicaIndex: 2; Origin: 35fe9bbb-369d-44c5-b35e-c0cda545d64f; Location: 35fe9bbb-369d-44c5-b35e-c0cda545d64f/bigdata-ozone3316,
State: CLOSED; ReplicaIndex: 3; Origin: 11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9; Location: 11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9/bigdata-ozone3342,
State: CLOSED; ReplicaIndex: 4; Origin: 2e42b239-eda4-43fb-ad3b-4af5191fd496; Location: 2e42b239-eda4-43fb-ad3b-4af5191fd496/bigdata-ozone2721,
State: CLOSED; ReplicaIndex: 5; Origin: 06b0bfdd-d7c4-427a-ab90-58e680af21d1; Location: 06b0bfdd-d7c4-427a-ab90-58e680af21d1/bigdata-ozone3006,
State: CLOSED; ReplicaIndex: 6; Origin: b50ce961-3425-43b6-b174-ece51b76f088; Location: b50ce961-3425-43b6-b174-ece51b76f088/bigdata-ozone3351,
State: CLOSED; ReplicaIndex: 7; Origin: b2a2658e-d61c-4eab-a701-6defb5ce6f52; Location: b2a2658e-d61c-4eab-a701-6defb5ce6f52/bigdata-ozone3552,
State: UNHEALTHY; ReplicaIndex: 8; Origin: d15329a6-fa24-49b6-ab7a-f4b8ec879032; Location: d15329a6-fa24-49b6-ab7a-f4b8ec879032/bigdata-ozone3385,
State: UNHEALTHY; ReplicaIndex: 8; Origin: 0467a91b-9370-4ce8-a072-a43be3d357cd; Location: 0467a91b-9370-4ce8-a072-a43be3d357cd/bigdata-ozone3358,
State: UNHEALTHY; ReplicaIndex: 8; Origin: ba7af4ce-60b8-49ac-89c7-7566723d9c80; Location: ba7af4ce-60b8-49ac-89c7-7566723d9c80/bigdata-ozone3040,
State: UNHEALTHY; ReplicaIndex: 8; Origin: e65cdf8b-56ee-4f33-992e-e9efaeeb83d1; Location: e65cdf8b-56ee-4f33-992e-e9efaeeb83d1/bigdata-ozone3458,
State: UNHEALTHY; ReplicaIndex: 8; Origin: 1d526c91-150c-45e4-938f-e1d0fef397d9; Location: 1d526c91-150c-45e4-938f-e1d0fef397d9/bigdata-ozone3185,
State: UNHEALTHY; ReplicaIndex: 8; Origin: b1001902-3e93-4ce0-9682-a95fc760c38b; Location: b1001902-3e93-4ce0-9682-a95fc760c38b/bigdata-ozone3540,
.......
State: UNHEALTHY; ReplicaIndex: 8; Origin: 87f51602-740b-4658-aade-0d01525460b7; Location: 87f51602-740b-4658-aade-0d01525460b7/bigdata-ozone071,
State: UNHEALTHY; ReplicaIndex: 8; Origin: 881d3593-d6d1-4010-8b41-d1205c0551e3; Location: 881d3593-d6d1-4010-8b41-d1205c0551e3/bigdata-ozone3315,
State: UNHEALTHY; ReplicaIndex: 8; Origin: e3b3b130-9241-4ad7-beee-9a8015aeccd7; Location: e3b3b130-9241-4ad7-beee-9a8015aeccd7/bigdata-ozone2746,
State: UNHEALTHY; ReplicaIndex: 8; Origin: 9d9b4a76-664d-4aa3-9b8e-60ed7016002b; Location: 9d9b4a76-664d-4aa3-9b8e-60ed7016002b/bigdata-ozone3197,
State: UNHEALTHY; ReplicaIndex: 8; Origin: 335ee024-79d4-4ec5-bd5c-46b4c90aa2d6; Location: 335ee024-79d4-4ec5-bd5c-46b4c90aa2d6/bigdata-ozone3399,
State: UNHEALTHY; ReplicaIndex: 8; Origin: fef6e5df-6742-4615-827c-3ebf18136541; Location: fef6e5df-6742-4615-827c-3ebf18136541/bigdata-ozone3420,
State: UNHEALTHY; ReplicaIndex: 8; Origin: c5cf4fb3-dc3c-4587-984c-8124d52f895a; Location: c5cf4fb3-dc3c-4587-984c-8124d52f895a/bigdata-ozone2780,
State: CLOSED; ReplicaIndex: 9; Origin: d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c; Location: d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c/bigdata-ozone3080
  • Step4. Select a specific DN to continue tracking, and I will continue to check the issue with bigdata-ozone071.

  • dn auditlog

2024-12-16 09:51:39,121 | INFO  | DNAudit | op=CREATE_CONTAINER | {containerID=3533790, containerType=KeyValueContainer} | ret=SUCCESS |
  • dn log
2024-12-16 09:51:38,710 [nullContainerReplicationThread-1] INFO org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask: IN_PROGRESS reconstructECContainersCommand: containerID=3533790, replication=rs-6-3-1024k, missingIndexes=[8], 
sources={1=bd44548c-b261-4450-9132-ba356799cddd(bigdata-ozone2610/xx.xx.xxx.xx), 2=35fe9bbb-369d-44c5-b35e-c0cda545d64f(bigdata-ozone3316/xx.xx.xxx.xx), 3=11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9(bigdata-ozone3342/xx.xx.xxx.xx), 4=2e42b239-eda4-43fb-ad3b-4af5191fd496(bigdata-ozone2721/xx.xx.xxx.xx), 5=06b0bfdd-d7c4-427a-ab90-58e680af21d1(bigdata-ozone3006/xx.xx.xx.xx), 6=b50ce961-3425-43b6-b174-ece51b76f088(bigdata-ozone3351/xx.xx.xx.xx), 7=b2a2658e-d61c-4eab-a701-6defb5ce6f52(bigdata-ozone3552/10.93.41.17), 9=d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c(bigdata-ozone3080/xx.xx.xx.xx)}, targets={8=87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx)}
2024-12-16 09:51:39,113 [nullContainerReplicationThread-1] ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command CreateContainer on the pipeline Pipeline[ Id: 87f51602-740b-4658-aade-0d01525460b7, Nodes: 87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx), excludedSet: 87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx), ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-12-16T09:51:39.070209248+08:00].
2024-12-16 09:51:39,113 [nullContainerReplicationThread-1] WARN org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator: Exception while reconstructing the container 3533790. Cleaning up all the recovering containers in the reconstruction process.
java.io.IOException: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.029807841s. [closed=[], open=[[buffered_nanos=331949, remote_addr=xx.xx.xxx.xx/xx.xx.xxx.xx:9859]]]
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:443)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:342)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:337)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:318)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:542)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:495)
        at org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:178)
        at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:190)
        at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
        at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:369)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.029807841s. 
[closed=[], open=[[buffered_nanos=331949, remote_addr=xx.xx.xxx.xx/xx.xx.xx.xx:9859]]]
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:412)
        ... 14 more
  • Step5. Find the container 3533790 container.

ll /data*/ozonedata/hddsdata/hdds/CID-7e95e103-a270-4948-a4a3-4f63a8a60d0c/current/containerDir*|grep "3533790"

/data5/ozonedata/hddsdata/hdds/CID-7e95e103-a270-4948-a4a3-4f63a8a60d0c/current/containerDir245:
total 0
drwxrwxr-x 4 hadoop hadoop 48 Jun 16  2024 125476
.....
drwxrwxr-x 4 hadoop hadoop 48 Dec 16 09:51 3533790
  • Step6. Check the status of the container 3533790.
    tree 3533790
./3533790
├── chunks
└── metadata
    ├── 3533790.container
    └── db
        ├── block_data.data
        ├── deleted_blocks.data
        ├── delete_txns.data
        └── metadata.data

I checked the contents of 3533790.container and confirmed that this container is unhealthy. The cause of 3533790 was a failure in restructuring the EC block, which resulted in an empty container. We identified the failure, but were unable to successfully clean up the container.

The optimization measures still need to be added.

cc: @adoroszlai

@sodonnel
Copy link
Contributor

sodonnel commented Jan 8, 2025

Have all the unhealthy replicas used up all available DNs on the cluster, so that there are no other free hosts to create a healthy copy?

What error did you see when the reconstruction failed? There have been some bugs fixed around reconstruction over time, which could cause a container to not get recovered.

If the container is not under replicated (ie all the replicas are there an healthy), then the unhealthy ones should be cleaned up by the ClosedWithUnhealthyReplicasHandler, which should mark the container over replicated and then the unhealthy ones should be deleted. From the javadoc:

   * Handles a closed EC container with unhealthy replicas. Note that if we
   * reach here, there is no over or under replication. This handler
   * will just send commands to delete the unhealthy replicas.
   *
   * <p>
   * Consider the following set of replicas for a closed EC 3-2 container:
   * Replica Index 1: Closed
   * Replica Index 2: Closed
   * Replica Index 3: Closed replica, Unhealthy replica (2 replicas)
   * Replica Index 4: Closed
   * Replica Index 5: Closed
   *
   * In this case, the unhealthy replica of index 3 should be deleted. The
   * container will be marked over replicated as the unhealthy replicas need
   * to be removed.
   * </p>

@slfan1989
Copy link
Contributor Author

Have all the unhealthy replicas used up all available DNs on the cluster, so that there are no other free hosts to create a healthy copy?

What error did you see when the reconstruction failed? There have been some bugs fixed around reconstruction over time, which could cause a container to not get recovered.

If the container is not under replicated (ie all the replicas are there an healthy), then the unhealthy ones should be cleaned up by the ClosedWithUnhealthyReplicasHandler, which should mark the container over replicated and then the unhealthy ones should be deleted. From the javadoc:

@sodonnel Thank you very much for your response! Due to some necessary upgrades and modifications to the server room racks, we had to take a large number of machines offline. After our discussion, we adjusted some of the SCM and DN config, and so far, everything seems to be meeting expectations.

The main cause of the issue I described earlier was some of our custom modifications, which led to EC timing out during reconstruction. As a result, SCM kept selecting new DNs as the target for reconstruction. However, the positive aspect is that we identified the issue before all DN machines had been retried, and we have already implemented some fixes.

I think we should set a limit on the number of reconstruction attempts for EC containers, perhaps 3 times. If the attempts exceed 3, we should stop further reconstruction of that container until the cluster administrator intervenes. I would like to hear your thoughts on this.

@adoroszlai adoroszlai marked this pull request as draft January 9, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants