-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726
Conversation
…counted as mis-replicated
cc @kerneltime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for the patch. The change makes sense to me.
...pache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackAware.java
Outdated
Show resolved
Hide resolved
...che/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackScatter.java
Outdated
Show resolved
Hide resolved
@adoroszlai Thanks for the review. I've addressed your comments. Can you take a look? |
@@ -445,6 +445,7 @@ public ContainerPlacementStatus validateContainerPlacement( | |||
} | |||
} | |||
List<Integer> currentRackCount = new ArrayList<>(dns.stream() | |||
.filter(d -> !(d.isDecommissioned())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.
The problem here, is not the number of racks, but that we are violating the "max nodes per rack". The reason is that we pass in the expected number of nodes (repFactor) and the list of replicas, and the list of replicas is greater than the expected number.
In this case, we pass in 4 replicas (3xin_service 1xdecommissioning) and say we expect 3 (repFactor).
Then the calculation of max nodes per rack says:
protected int getMaxReplicasPerRack(int numReplicas, int numberOfRacks) {
return numReplicas / numberOfRacks
+ Math.min(numReplicas % numberOfRacks, 1);
}
So if we have 3 replicas, we get:
3 / 2 + min(( 3 % 2)=1, 1), so 1 + 1 = 2.
But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.
So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.
Do you think that change would fix it in the general case for any type of excess replica count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.
Are you saying that we should occasionally filter maintenance replicas as well? We skip them, since we expect a maintenance node to not be stopped.
But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.
So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.
I don't understand this part, can you explain it more? Where can I find it in the code?
In your example, (3xin_service 1xdecommissioning)
the container isn't over-replicated for SCM. The replica on the decommissioned node isn't counted. But the placement policy will say that the container is mis-replicated. As long as the SCM considers the container properly replicated, it makes sense to me to filter any replicas on offline nodes, in the policy as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am saying is that we should not filter nodes here, but adjust the max-nodes-per-rack higher if there are more replicas passed in. There is the decommission scenario, maintenance, plus there are scenarios in replication manager where we maintain extra copies of unhealthy replicas too, so you could get 3 x in_service plus 1 or more unhealthy.
If you have 3 replicas, then the expectation is 2 racks and a max of 2 per rack. But what if there are 4 replicas for rep-factor 3. Then you need 2 or 3 racks. Plus at most 3 per rack.
So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.
ok, I get it. Let me look into that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel I was thinking that we need to stop reporting mis-replication in case of excessive replicas due to offline nodes. Are you saying that we should include unhealthy replicas as well? Stop reporting mis-replication if we have proper replication and the excessive number of replicas are unavailable, either due to offline nodes or being unhealthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel Thanks for the explanation. I agree. I'll add all that as a comment because other people might have the same questions and confusion.
Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right? We don't need to check for decommission or maintenance to perform the operation.
I'm referring to something like this
int maxReplicasPerRack = getMaxReplicasPerRack(replicas,
Math.min(requiredRacks, numRacks));
boolean offlineNodeReplicas = dns.stream()
.anyMatch(d -> d.isMaintenance() || d.isDecommissioned());
if (offlineNodeReplicas) {
// Adjust max replicas per rack for excessive replicas belonging to offline nodes.
maxReplicasPerRack += Math.max(0, dns.size() - replicas);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right?
Yea, I think so. We shouldn't need to consider decommission or maintenance in here. If some caller wanted to exclude decom / maint for some reason, it can filter the replicas before it passes them into the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks. I'll add some tests and push the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel Since we are allowing over-replication, should we also extend the check in ContainerPlacementStatusDefault#isPolicySatisfied()
to something like this
cnt <= maxReplicasPerRack && maxReplicasPerRack - cnt <= 1);
so that we can report mis-replication if there is too much uneven distribution?
In the above example,
I think that one reason the max-per-rack was brought it, was for EC + Rack Scatter. Say you have 3 racks and EC 3-2. The idea distribution is 2, 2, 1. However 3, 1, 1 is also possible,
3, 1, 1 won't be reported as mis-replication with the new code addition. But if we check that the replica difference between racks isn't more than 1, we can identify such cases.
What do you think? This check might make sense only for EC containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3, 1, 1 should still be reported as mis-replicated for EC-3-2, as there should be 5 replicas. There is no excess / over replication in this example.
If we had 4, 1, 1 - this has an excess of 1. It also has an idea max-per-rack of 2, but we adjust it 1 higher due to the excess to 3, and hence it should be reported as mis-replicated. However 3, 2, 1 would not be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest revision LGTM. Sorry for being slow to come back, but I was out of the office for the past 2 weeks.
Thanks @xBis7 for the patch, @kerneltime, @sodonnel for the review. |
…is-replicated (apache#5726) (cherry picked from commit f3872a7)
…eported as mis-replicated (apache#5726) (cherry picked from commit f3872a7) Change-Id: I8c9332bc4183026f72776df7e5ec07dfe8a372dd
What changes were proposed in this pull request?
When a node is decommissioning, new replicas are being copied to other nodes and once this process has finished, then the node goes into decommission. After the copies are made, the container appears as mis-replicated due to the excessive replicas. These replicas are unavailable and the decommissioned node is expected to be stopped. For that reason, containers that belong to decommissioning or decommissioned nodes, shouldn't be counted as mis-replicated.
Nodes in maintenance, won't be filtered because such nodes are expected to come back and no replica copies are made while entering the state. Mis-replication on a node in maintenance, is the same as having mis-replication on a healthy and active node.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9683
How was this patch tested?
New unit tests are added. It can also be tested manually with the
ozone-topology
docker env like sodocker-config
to enable RackScatter policy (easier to reproduce with RackScatter)network-config
ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE
ozone admin container report
and check Recon container page.