HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

xBis7 · 2023-12-04T14:49:21Z

What changes were proposed in this pull request?

When a node is decommissioning, new replicas are being copied to other nodes and once this process has finished, then the node goes into decommission. After the copies are made, the container appears as mis-replicated due to the excessive replicas. These replicas are unavailable and the decommissioned node is expected to be stopped. For that reason, containers that belong to decommissioning or decommissioned nodes, shouldn't be counted as mis-replicated.

Nodes in maintenance, won't be filtered because such nodes are expected to come back and no replica copies are made while entering the state. Mis-replication on a node in maintenance, is the same as having mis-replication on a healthy and active node.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9683

How was this patch tested?

New unit tests are added. It can also be tested manually with the ozone-topology docker env like so

Edit docker-config to enable RackScatter policy (easier to reproduce with RackScatter)

- OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
+ #OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
+
+ OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
+ OZONE-SITE.XML_ozone.scm.pipeline.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
+
+ # For decommission
+ OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
+ OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
+
+ # Expedite the container replication checking
+ OZONE-SITE.XML_hdds.scm.replication.thread.interval=15s

Edit network-config

- 10.5.0.6	/rack1
+ 10.5.0.6	/rack2
10.5.0.7	/rack2
- 10.5.0.8	/rack2
+ 10.5.0.8	/rack3
- 10.5.0.9	/rack2
+ 10.5.0.9	/rack3

Create a key with replication Ratis THREE
- ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE

Find the replicas for container 1 and decommission one of the replica nodes

 bash-4.2$ ozone admin container info 1
     get 1 datanode
 bash-4.2$ ozone admin scm roles
     copy SCM IP
 bash-4.2$ ozone admin datanode list
     copy datanode IP
 bash-4.2$ ozone admin datanode decommission -id=scmservice --scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
 Started decommissioning datanode(s):
 172.23.0.8/ozone-datanode-2.ozone_default

Once the node is decommissioned, check the SCM container report ozone admin container report and check Recon container page.

…counted as mis-replicated

kerneltime · 2023-12-04T17:23:33Z

cc @kerneltime

adoroszlai

Thanks @xBis7 for the patch. The change makes sense to me.

...pache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackAware.java

...che/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackScatter.java

xBis7 · 2023-12-12T10:55:21Z

@adoroszlai Thanks for the review. I've addressed your comments. Can you take a look?

sodonnel · 2023-12-12T14:58:35Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java

@@ -445,6 +445,7 @@ public ContainerPlacementStatus validateContainerPlacement(
      }
    }
    List<Integer> currentRackCount = new ArrayList<>(dns.stream()
+        .filter(d -> !(d.isDecommissioned()))


Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.

The problem here, is not the number of racks, but that we are violating the "max nodes per rack". The reason is that we pass in the expected number of nodes (repFactor) and the list of replicas, and the list of replicas is greater than the expected number.

In this case, we pass in 4 replicas (3xin_service 1xdecommissioning) and say we expect 3 (repFactor).

Then the calculation of max nodes per rack says:

protected int getMaxReplicasPerRack(int numReplicas, int numberOfRacks) { return numReplicas / numberOfRacks + Math.min(numReplicas % numberOfRacks, 1); }

So if we have 3 replicas, we get:

3 / 2 + min(( 3 % 2)=1, 1), so 1 + 1 = 2.

But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.

So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.

Do you think that change would fix it in the general case for any type of excess replica count?

Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.

Are you saying that we should occasionally filter maintenance replicas as well? We skip them, since we expect a maintenance node to not be stopped.

But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.

So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.

I don't understand this part, can you explain it more? Where can I find it in the code?

In your example, (3xin_service 1xdecommissioning) the container isn't over-replicated for SCM. The replica on the decommissioned node isn't counted. But the placement policy will say that the container is mis-replicated. As long as the SCM considers the container properly replicated, it makes sense to me to filter any replicas on offline nodes, in the policy as well.

What I am saying is that we should not filter nodes here, but adjust the max-nodes-per-rack higher if there are more replicas passed in. There is the decommission scenario, maintenance, plus there are scenarios in replication manager where we maintain extra copies of unhealthy replicas too, so you could get 3 x in_service plus 1 or more unhealthy.

If you have 3 replicas, then the expectation is 2 racks and a max of 2 per rack. But what if there are 4 replicas for rep-factor 3. Then you need 2 or 3 racks. Plus at most 3 per rack.

So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.

So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.

ok, I get it. Let me look into that. Thanks!

@sodonnel I was thinking that we need to stop reporting mis-replication in case of excessive replicas due to offline nodes. Are you saying that we should include unhealthy replicas as well? Stop reporting mis-replication if we have proper replication and the excessive number of replicas are unavailable, either due to offline nodes or being unhealthy?

@sodonnel Thanks for the explanation. I agree. I'll add all that as a comment because other people might have the same questions and confusion.

Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right? We don't need to check for decommission or maintenance to perform the operation.

I'm referring to something like this

int maxReplicasPerRack = getMaxReplicasPerRack(replicas, Math.min(requiredRacks, numRacks)); boolean offlineNodeReplicas = dns.stream() .anyMatch(d -> d.isMaintenance() || d.isDecommissioned()); if (offlineNodeReplicas) { // Adjust max replicas per rack for excessive replicas belonging to offline nodes. maxReplicasPerRack += Math.max(0, dns.size() - replicas); }

Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right?

Yea, I think so. We shouldn't need to consider decommission or maintenance in here. If some caller wanted to exclude decom / maint for some reason, it can filter the replicas before it passes them into the method.

ok, thanks. I'll add some tests and push the changes.

@sodonnel Since we are allowing over-replication, should we also extend the check in ContainerPlacementStatusDefault#isPolicySatisfied() to something like this

cnt <= maxReplicasPerRack && maxReplicasPerRack - cnt <= 1);

so that we can report mis-replication if there is too much uneven distribution?

In the above example,

I think that one reason the max-per-rack was brought it, was for EC + Rack Scatter. Say you have 3 racks and EC 3-2. The idea distribution is 2, 2, 1. However 3, 1, 1 is also possible,

3, 1, 1 won't be reported as mis-replication with the new code addition. But if we check that the replica difference between racks isn't more than 1, we can identify such cases.

What do you think? This check might make sense only for EC containers.

3, 1, 1 should still be reported as mis-replicated for EC-3-2, as there should be 5 replicas. There is no excess / over replication in this example.

If we had 4, 1, 1 - this has an excess of 1. It also has an idea max-per-rack of 2, but we adjust it 1 higher due to the excess to 3, and hence it should be reported as mis-replicated. However 3, 2, 1 would not be.

sodonnel

The latest revision LGTM. Sorry for being slow to come back, but I was out of the office for the past 2 weeks.

adoroszlai · 2024-01-02T19:59:12Z

Thanks @xBis7 for the patch, @kerneltime, @sodonnel for the review.

…is-replicated (apache#5726) (cherry picked from commit f3872a7)

…eported as mis-replicated (apache#5726) (cherry picked from commit f3872a7) Change-Id: I8c9332bc4183026f72776df7e5ec07dfe8a372dd

Containers belonging to decommissioning or decommissioned nodes, are …

b563fd3

…counted as mis-replicated

xBis7 requested review from sodonnel and adoroszlai December 4, 2023 14:50

kerneltime self-requested a review December 11, 2023 17:27

kerneltime approved these changes Dec 12, 2023

View reviewed changes

adoroszlai reviewed Dec 12, 2023

View reviewed changes

...pache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackAware.java Outdated Show resolved Hide resolved

...che/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementRackScatter.java Outdated Show resolved Hide resolved

adoroszlai changed the title ~~HDDS-9683. Containers belonging to decommissioning or decommissioned nodes, are counted as mis-replicated~~ HDDS-9683. Containers present on decommission nodes are reported as mis-replicated Dec 12, 2023

Tests MethodSource cleanup

6b441a8

xBis7 requested a review from adoroszlai December 12, 2023 14:22

sodonnel reviewed Dec 12, 2023

View reviewed changes

xBis7 added 3 commits December 19, 2023 19:14

Merge remote-tracking branch 'origin/master' into HDDS-9683

a224ee7

adjust max number of replicas per rack

5dfd1ac

checkstyle

8c0f8ba

xBis7 requested a review from sodonnel December 20, 2023 21:30

sodonnel approved these changes Jan 2, 2024

View reviewed changes

adoroszlai merged commit f3872a7 into apache:master Jan 2, 2024
34 checks passed

adoroszlai pushed a commit to adoroszlai/ozone that referenced this pull request Jan 25, 2024

HDDS-9683. Containers present on decommission nodes are reported as m…

2aa983e

…is-replicated (apache#5726) (cherry picked from commit f3872a7)

adoroszlai mentioned this pull request Jan 25, 2024

[DO NOT MERGE] Backport some fixes from master to ozone-1.4 #6096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

xBis7 commented Dec 4, 2023

kerneltime commented Dec 4, 2023

adoroszlai left a comment

xBis7 commented Dec 12, 2023

sodonnel Dec 12, 2023 •

edited

Loading

xBis7 Dec 13, 2023 •

edited

Loading

sodonnel Dec 14, 2023

xBis7 Dec 14, 2023

xBis7 Dec 14, 2023

xBis7 Dec 19, 2023 •

edited

Loading

sodonnel Dec 19, 2023

xBis7 Dec 19, 2023

xBis7 Dec 19, 2023 •

edited

Loading

sodonnel Jan 2, 2024

sodonnel left a comment

adoroszlai commented Jan 2, 2024

HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

Conversation

xBis7 commented Dec 4, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

kerneltime commented Dec 4, 2023

adoroszlai left a comment

Choose a reason for hiding this comment

xBis7 commented Dec 12, 2023

sodonnel Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

xBis7 Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

sodonnel Dec 14, 2023

Choose a reason for hiding this comment

xBis7 Dec 14, 2023

Choose a reason for hiding this comment

xBis7 Dec 14, 2023

Choose a reason for hiding this comment

xBis7 Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

sodonnel Dec 19, 2023

Choose a reason for hiding this comment

xBis7 Dec 19, 2023

Choose a reason for hiding this comment

xBis7 Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

sodonnel Jan 2, 2024

Choose a reason for hiding this comment

sodonnel left a comment

Choose a reason for hiding this comment

adoroszlai commented Jan 2, 2024

sodonnel Dec 12, 2023 •

edited

Loading

xBis7 Dec 13, 2023 •

edited

Loading

xBis7 Dec 19, 2023 •

edited

Loading

xBis7 Dec 19, 2023 •

edited

Loading