-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11881. Simplify DatanodeAdminMonitorImpl Code Structure. #7542
base: master
Are you sure you want to change the base?
Conversation
@adoroszlai @sodonnel Could you help review the code? Thank you very much! |
cc @nandakumar131 . |
Is there any plan to build on these changes to add features? In general, I don't like refactoring things "just to make them nicer" unless things are very bad already. It creates churn on the code, risks introducing bugs and makes backports more difficult if any change is needed on this code due to a bug, and then needs to be backported to an earlier version. Note I haven't looked at the changes in detail. I only quickly looked to see what the scope of the change was. |
@sodonnel Thank you very much for your response! I do agree with your viewpoint to some extent. We indeed want to contribute a new feature to the community, and I have named this feature "Fast Decommission / Maintenance." However, before proceeding, we do need to carry out some preparatory work, and I feel that this PR is one of those necessary steps.
Currently, whether it's decommissioning or maintenance, we rely on UnderReplication. Replication in our system is very slow. As shown in the monitoring screenshot below, after I took one DataNode decommission, there were 38,000 Containers to be replicated. After 18 hours, only 5,768 Containers were replicated (from 38,228 to 32,460), and this process was accompanied by a significant amount of I/O.
We have developed the V1 version of the Fast Decommission feature. After testing, we have confirmed that we can decommission at least 10 machines simultaneously, with each machine containing 140TB of data, and the process takes about 2 days. The general approach of this solution is as follows:
This feature allows us to make better use of the system bandwidth and reduce system I/O wait.
The current time tracking metrics may have some issues, and I will fix this problem in the V2 version. |
I'm curious about why it is so slow to replicate. Are you using the new or legacy replication manager? In the new RM, we did try to make things better, and try to balance the replication across all the DNs with replicas. The default settings may not be aggressive enough. Did you try tuning anything there or dig into why the replication is slow? This is important, as while your solution may help speed up decommission, it does not help with the more serious case of a rack or several DNs going down and a large amount of replication being required to avoid data loss. |
@sodonnel Thank you very much for your reply! I will conduct a deeper study of the Legacy Replication Manager and then follow up with you for further discussion. |
The Legacy RM is set to be removed. I'd suggest getting onto the new RM which solves a lot of problems with the original design and see how it performs. We have scope to make it better if it is not working much faster. |
Thank you for the suggestion! I reviewed the
We have made appropriate adjustments to these parameters based on our cluster's situation, and they have proven helpful for us. However, I have encountered a new issue that may require further optimization. I found that some machines are unable to enter the decommissioned state. Taking bigdata-ozone071 as an example, the analysis process is as follows:
tail -f ozone-hadoop-scm-bigdata-ozonemaster.log|grep 'sufficientlyReplicated'|grep 'bigdata-ozone071'
I found that ReplicaIndex: 8 has 35 unhealthy replicas.
ll /data*/ozonedata/hddsdata/hdds/CID-7e95e103-a270-4948-a4a3-4f63a8a60d0c/current/containerDir*|grep "3533790"
I checked the contents of 3533790.container and confirmed that this container is unhealthy. The cause of 3533790 was a failure in restructuring the EC block, which resulted in an empty container. We identified the failure, but were unable to successfully clean up the container. The optimization measures still need to be added. cc: @adoroszlai |
Have all the unhealthy replicas used up all available DNs on the cluster, so that there are no other free hosts to create a healthy copy? What error did you see when the reconstruction failed? There have been some bugs fixed around reconstruction over time, which could cause a container to not get recovered. If the container is not under replicated (ie all the replicas are there an healthy), then the unhealthy ones should be cleaned up by the ClosedWithUnhealthyReplicasHandler, which should mark the container over replicated and then the unhealthy ones should be deleted. From the javadoc:
|
@sodonnel Thank you very much for your response! Due to some necessary upgrades and modifications to the server room racks, we had to take a large number of machines offline. After our discussion, we adjusted some of the SCM and DN config, and so far, everything seems to be meeting expectations. The main cause of the issue I described earlier was some of our custom modifications, which led to EC timing out during reconstruction. As a result, SCM kept selecting new DNs as the target for reconstruction. However, the positive aspect is that we identified the issue before all DN machines had been retried, and we have already implemented some fixes. I think we should set a limit on the number of reconstruction attempts for EC containers, perhaps 3 times. If the attempts exceed 3, we should stop further reconstruction of that container until the cluster administrator intervenes. I would like to hear your thoughts on this. |
What changes were proposed in this pull request?
Recently, while working with the Datanode maintenance mode and Datanode decommission features, we noticed the code in DatanodeAdminMonitorImpl. Although the code quality was not an issue, I made some improvements to enhance its readability. This includes encapsulating some classes and adding detailed comments. I hope these changes will make the code easier to understand and maintain, improve readability, and be recognized by the team.
Fixed typos in the code: I corrected several spelling errors. These were minor issues, but I thought it would be best to address them alongside the other changes. I didn't submit a separate PR for typos, as doing so would consume the reviewer's time unnecessarily.
Consolidated counting variables within the logic: I abstracted the TrackedNodeContainers object to track the replication status of the relevant containers. This change maintains the idempotency of the original code.
Extracted TrackedNode to a separate class: I moved the TrackedNode logic outside of DatanodeAdminMonitorImpl to make the class more focused on business logic and cleaner overall.
Added necessary comments: I included additional comments to improve code readability for other team members.
What is the link to the Apache JIRA
JIRA: HDDS-11881. Simplify DatanodeAdminMonitorImpl Code Structure.
How was this patch tested?
CI test.