Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-9683. Containers present on decommission nodes are reported as mis-replicated #5726

Merged
merged 5 commits into from
Jan 2, 2024

Conversation

xBis7
Copy link
Contributor

@xBis7 xBis7 commented Dec 4, 2023

What changes were proposed in this pull request?

When a node is decommissioning, new replicas are being copied to other nodes and once this process has finished, then the node goes into decommission. After the copies are made, the container appears as mis-replicated due to the excessive replicas. These replicas are unavailable and the decommissioned node is expected to be stopped. For that reason, containers that belong to decommissioning or decommissioned nodes, shouldn't be counted as mis-replicated.

Nodes in maintenance, won't be filtered because such nodes are expected to come back and no replica copies are made while entering the state. Mis-replication on a node in maintenance, is the same as having mis-replication on a healthy and active node.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9683

How was this patch tested?

New unit tests are added. It can also be tested manually with the ozone-topology docker env like so

  • Edit docker-config to enable RackScatter policy (easier to reproduce with RackScatter)
    • - OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
      + #OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware
      +
      + OZONE-SITE.XML_ozone.scm.container.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
      + OZONE-SITE.XML_ozone.scm.pipeline.placement.impl=org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
      +
      + # For decommission
      + OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
      + OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
      +
      + # Expedite the container replication checking
      + OZONE-SITE.XML_hdds.scm.replication.thread.interval=15s
  • Edit network-config
    • - 10.5.0.6	/rack1
      + 10.5.0.6	/rack2
      10.5.0.7	/rack2
      - 10.5.0.8	/rack2
      + 10.5.0.8	/rack3
      - 10.5.0.9	/rack2
      + 10.5.0.9	/rack3
  • Create a key with replication Ratis THREE
    • ozone sh key put /vol1/bucket1/key1 /etc/hosts -t=RATIS -r=THREE
  • Find the replicas for container 1 and decommission one of the replica nodes
    •  bash-4.2$ ozone admin container info 1
           get 1 datanode
       bash-4.2$ ozone admin scm roles
           copy SCM IP
       bash-4.2$ ozone admin datanode list
           copy datanode IP
       bash-4.2$ ozone admin datanode decommission -id=scmservice --scm=172.23.0.2:9894 172.23.0.8/ozone-datanode-2.ozone_default
       Started decommissioning datanode(s):
       172.23.0.8/ozone-datanode-2.ozone_default
      
  • Once the node is decommissioned, check the SCM container report ozone admin container report and check Recon container page.

@xBis7 xBis7 requested review from sodonnel and adoroszlai December 4, 2023 14:50
@kerneltime
Copy link
Contributor

cc @kerneltime

@kerneltime kerneltime self-requested a review December 11, 2023 17:27
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xBis7 for the patch. The change makes sense to me.

@adoroszlai adoroszlai changed the title HDDS-9683. Containers belonging to decommissioning or decommissioned nodes, are counted as mis-replicated HDDS-9683. Containers present on decommission nodes are reported as mis-replicated Dec 12, 2023
@xBis7
Copy link
Contributor Author

xBis7 commented Dec 12, 2023

@adoroszlai Thanks for the review. I've addressed your comments. Can you take a look?

@xBis7 xBis7 requested a review from adoroszlai December 12, 2023 14:22
@@ -445,6 +445,7 @@ public ContainerPlacementStatus validateContainerPlacement(
}
}
List<Integer> currentRackCount = new ArrayList<>(dns.stream()
.filter(d -> !(d.isDecommissioned()))
Copy link
Contributor

@sodonnel sodonnel Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.

The problem here, is not the number of racks, but that we are violating the "max nodes per rack". The reason is that we pass in the expected number of nodes (repFactor) and the list of replicas, and the list of replicas is greater than the expected number.

In this case, we pass in 4 replicas (3xin_service 1xdecommissioning) and say we expect 3 (repFactor).

Then the calculation of max nodes per rack says:

protected int getMaxReplicasPerRack(int numReplicas, int numberOfRacks) {
    return numReplicas / numberOfRacks
            + Math.min(numReplicas % numberOfRacks, 1);
  }

So if we have 3 replicas, we get:

3 / 2 + min(( 3 % 2)=1, 1), so 1 + 1 = 2.

But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.

So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.

Do you think that change would fix it in the general case for any type of excess replica count?

Copy link
Contributor Author

@xBis7 xBis7 Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I wonder about maintenance replicas. It would be possible for there to be extra copies of maintenance replicas in the cluster, like with decommission, and until the node goes offline, there would be too many replicas.

Are you saying that we should occasionally filter maintenance replicas as well? We skip them, since we expect a maintenance node to not be stopped.

But as the container is effectively over-replicated, we have to adjust the max nodes per rack upwards by the delta between the actual replica count and expected replicas.

So if we add to that max per rack Math.max(0, dns.size() - replicas) = 1, then we get 3 allowed per rack, which means the policy would not be violated any more.

I don't understand this part, can you explain it more? Where can I find it in the code?

In your example, (3xin_service 1xdecommissioning) the container isn't over-replicated for SCM. The replica on the decommissioned node isn't counted. But the placement policy will say that the container is mis-replicated. As long as the SCM considers the container properly replicated, it makes sense to me to filter any replicas on offline nodes, in the policy as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am saying is that we should not filter nodes here, but adjust the max-nodes-per-rack higher if there are more replicas passed in. There is the decommission scenario, maintenance, plus there are scenarios in replication manager where we maintain extra copies of unhealthy replicas too, so you could get 3 x in_service plus 1 or more unhealthy.

If you have 3 replicas, then the expectation is 2 racks and a max of 2 per rack. But what if there are 4 replicas for rep-factor 3. Then you need 2 or 3 racks. Plus at most 3 per rack.

So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you adjust the max per rack higher based on the number of replicas available, then I think it would fix the problem.

ok, I get it. Let me look into that. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel I was thinking that we need to stop reporting mis-replication in case of excessive replicas due to offline nodes. Are you saying that we should include unhealthy replicas as well? Stop reporting mis-replication if we have proper replication and the excessive number of replicas are unavailable, either due to offline nodes or being unhealthy?

Copy link
Contributor Author

@xBis7 xBis7 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel Thanks for the explanation. I agree. I'll add all that as a comment because other people might have the same questions and confusion.

Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right? We don't need to check for decommission or maintenance to perform the operation.

I'm referring to something like this

int maxReplicasPerRack = getMaxReplicasPerRack(replicas,
        Math.min(requiredRacks, numRacks));

boolean offlineNodeReplicas = dns.stream()
    .anyMatch(d -> d.isMaintenance() || d.isDecommissioned());

if (offlineNodeReplicas) {
  // Adjust max replicas per rack for excessive replicas belonging to offline nodes.
  maxReplicasPerRack += Math.max(0, dns.size() - replicas);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since over-replication shouldn't be considered as mis-replication, we should adjust the max replicas per node every time there are excessive replicas, right?

Yea, I think so. We shouldn't need to consider decommission or maintenance in here. If some caller wanted to exclude decom / maint for some reason, it can filter the replicas before it passes them into the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks. I'll add some tests and push the changes.

Copy link
Contributor Author

@xBis7 xBis7 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel Since we are allowing over-replication, should we also extend the check in ContainerPlacementStatusDefault#isPolicySatisfied() to something like this

cnt <= maxReplicasPerRack && maxReplicasPerRack - cnt <= 1);

so that we can report mis-replication if there is too much uneven distribution?

In the above example,

I think that one reason the max-per-rack was brought it, was for EC + Rack Scatter. Say you have 3 racks and EC 3-2. The idea distribution is 2, 2, 1. However 3, 1, 1 is also possible,

3, 1, 1 won't be reported as mis-replication with the new code addition. But if we check that the replica difference between racks isn't more than 1, we can identify such cases.

What do you think? This check might make sense only for EC containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3, 1, 1 should still be reported as mis-replicated for EC-3-2, as there should be 5 replicas. There is no excess / over replication in this example.

If we had 4, 1, 1 - this has an excess of 1. It also has an idea max-per-rack of 2, but we adjust it 1 higher due to the excess to 3, and hence it should be reported as mis-replicated. However 3, 2, 1 would not be.

@xBis7 xBis7 requested a review from sodonnel December 20, 2023 21:30
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest revision LGTM. Sorry for being slow to come back, but I was out of the office for the past 2 weeks.

@adoroszlai adoroszlai merged commit f3872a7 into apache:master Jan 2, 2024
34 checks passed
@adoroszlai
Copy link
Contributor

Thanks @xBis7 for the patch, @kerneltime, @sodonnel for the review.

adoroszlai pushed a commit to adoroszlai/ozone that referenced this pull request Jan 25, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 1, 2024
…eported as mis-replicated (apache#5726)

(cherry picked from commit f3872a7)
Change-Id: I8c9332bc4183026f72776df7e5ec07dfe8a372dd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants