HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

djordje-mijatovic · 2023-01-25T08:55:44Z

What changes were proposed in this pull request?

When the container goes into CLOSING state, if the data node is down than the container is stuck in CLOSING state forever, and the user does not know that this container is missing. In this PR we recalculate the HealthState of the CLOSING container to inform the user that container is also MISSING. The goal was to set the same HealthState as for CLOSED container.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7210

How was this patch tested?

Unit test and manual dev tests.

Take Master 13 Jan 23

…NG or QUASI_CLOSED

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

Take Ozone Master

djordje-mijatovic · 2023-02-06T17:04:20Z

@sodonnel Could you review this PR again? Thank you.

adoroszlai · 2023-02-20T12:56:16Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());


Why is the container categorized in all of these states? Isn't MISSING enough?

In theory, MISSING could be enough but we wanted to have the same behavior as we have in Recon - this is way we have set all the states.

Isn't MISSING enough?

This is also consistent with the replication manager detecting and reporting "MISSING" when the container is in the CLOSED state.

Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.

Looks like over time we've diverged from the original intention of the replication manager report:

* This class is used by ReplicationManager. Each time ReplicationManager runs, * it creates a new instance of this class and increments the various counters * to allow for creating a report on the various container states within the * system. There is a counter for each LifeCycleState (open, closing, closed * etc) and the sum of each of the lifecycle state counters should equal the * total number of containers in SCM. Ie, each container can only be in one of * the Lifecycle states at any time.

It specifies that each container should only be in one state at a time.

I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.

The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler. This is because it considers mis replication only when there is no under/over replication.

neils-dev · 2023-02-21T18:31:43Z

@siddhantsangwan , would you mind taking a look at this PR

siddhantsangwan

Mostly looks good. We will also need this fix in the new RM in ClosingContainerHandler. That can be done in a new PR or in this one.

siddhantsangwan · 2023-02-22T06:53:02Z

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());


Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.

Looks like over time we've diverged from the original intention of the replication manager report:

* This class is used by ReplicationManager. Each time ReplicationManager runs, * it creates a new instance of this class and increments the various counters * to allow for creating a report on the various container states within the * system. There is a counter for each LifeCycleState (open, closing, closed * etc) and the sum of each of the lifecycle state counters should equal the * total number of containers in SCM. Ie, each container can only be in one of * the Lifecycle states at any time.

It specifies that each container should only be in one state at a time.

I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.

The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler. This is because it considers mis replication only when there is no under/over replication.

neils-dev · 2023-02-23T00:38:17Z

Thanks @siddhantsangwan , filed a jira to handle closing containers and check for MISSING for the ClosingContainerHandler, HDDS-8017. https://issues.apache.org/jira/browse/HDDS-8017

We will also need this fix in the new RM in ClosingContainerHandler. That can be done in a new PR or in this one.

siddhantsangwan

This looks good to me. For now, we can count in all places for the container report in the legacy RM. Going forward, in the new RM, we're only counting them once: #4313

neils-dev

Thanks @djordje-mijatovic for providing this well done patch. Thanks @siddhantsangwan , @sodonnel , @adoroszlai for your comments and for reviewing this PR.

Will be merging this shortly. Thanks!

arp7 · 2023-05-30T18:32:03Z

@djordje-mijatovic can you request an Apache jira account?

djordje-mijatovic and others added 8 commits January 13, 2023 14:04

Merge pull request #1 from apache/master

c730b61

Take Master 13 Jan 23

HDDS-7210 Enable HealthState calculation when LifeCycleState is CLOSI…

02c37e1

…NG or QUASI_CLOSED

HDDS-7210 Recalculate MISSING HealthState when LifeCycleState is CLOSING

fe3d513

HDDS-7210 Fix StyleCheck problems and add unit test

bab561c

HDDS-7210 Execute the same code as for CLOSED container

a66748d

HDDS-7210 Fix StyleCheck problems

5103c80

HDDS-7210 Write safe setHealthStateForClosing function

a47c397

HDDS-7210 Update Style

62defa5

sodonnel reviewed Jan 25, 2023

View reviewed changes

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java Show resolved Hide resolved

neils-dev added the gr label Jan 26, 2023

djordje-mijatovic and others added 3 commits February 1, 2023 19:43

HDDS-7210 Update condition

a6c4450

Merge pull request #2 from apache/master

e96f141

Take Ozone Master

Merge branch 'master' into HDDS-7210

2040a96

adoroszlai reviewed Feb 20, 2023

View reviewed changes

adoroszlai requested a review from sodonnel February 20, 2023 12:56

neils-dev requested a review from siddhantsangwan February 21, 2023 18:31

siddhantsangwan reviewed Feb 22, 2023

View reviewed changes

siddhantsangwan approved these changes Feb 27, 2023

View reviewed changes

neils-dev approved these changes Mar 2, 2023

View reviewed changes

neils-dev merged commit a432438 into apache:master Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

djordje-mijatovic commented Jan 25, 2023

djordje-mijatovic commented Feb 6, 2023

adoroszlai Feb 20, 2023

djordje-mijatovic Feb 21, 2023

neils-dev Feb 21, 2023

siddhantsangwan Feb 22, 2023

neils-dev commented Feb 21, 2023

siddhantsangwan left a comment

siddhantsangwan Feb 22, 2023

neils-dev commented Feb 23, 2023

siddhantsangwan left a comment

neils-dev left a comment

arp7 commented May 30, 2023

HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

Conversation

djordje-mijatovic commented Jan 25, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

djordje-mijatovic commented Feb 6, 2023

adoroszlai Feb 20, 2023

Choose a reason for hiding this comment

djordje-mijatovic Feb 21, 2023

Choose a reason for hiding this comment

neils-dev Feb 21, 2023

Choose a reason for hiding this comment

siddhantsangwan Feb 22, 2023

Choose a reason for hiding this comment

neils-dev commented Feb 21, 2023

siddhantsangwan left a comment

Choose a reason for hiding this comment

siddhantsangwan Feb 22, 2023

Choose a reason for hiding this comment

neils-dev commented Feb 23, 2023

siddhantsangwan left a comment

Choose a reason for hiding this comment

neils-dev left a comment

Choose a reason for hiding this comment

arp7 commented May 30, 2023