Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207

Merged
merged 11 commits into from
Mar 3, 2023

Conversation

djordje-mijatovic
Copy link
Contributor

What changes were proposed in this pull request?

When the container goes into CLOSING state, if the data node is down than the container is stuck in CLOSING state forever, and the user does not know that this container is missing. In this PR we recalculate the HealthState of the CLOSING container to inform the user that container is also MISSING. The goal was to set the same HealthState as for CLOSED container.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7210

How was this patch tested?

Unit test and manual dev tests.

@neils-dev neils-dev added the gr label Jan 26, 2023
@djordje-mijatovic
Copy link
Contributor Author

@sodonnel Could you review this PR again? Thank you.

Comment on lines +1622 to +1625
report.incrementAndSample(HealthState.UNDER_REPLICATED,
container.containerID());
report.incrementAndSample(HealthState.MIS_REPLICATED,
container.containerID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the container categorized in all of these states? Isn't MISSING enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, MISSING could be enough but we wanted to have the same behavior as we have in Recon - this is way we have set all the states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't MISSING enough?

This is also consistent with the replication manager detecting and reporting "MISSING" when the container is in the CLOSED state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.

Looks like over time we've diverged from the original intention of the replication manager report:

 * This class is used by ReplicationManager. Each time ReplicationManager runs,
 * it creates a new instance of this class and increments the various counters
 * to allow for creating a report on the various container states within the
 * system. There is a counter for each LifeCycleState (open, closing, closed
 * etc) and the sum of each of the lifecycle state counters should equal the
 * total number of containers in SCM. Ie, each container can only be in one of
 * the Lifecycle states at any time.

It specifies that each container should only be in one state at a time.

I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.

The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler. This is because it considers mis replication only when there is no under/over replication.

@neils-dev
Copy link
Contributor

@siddhantsangwan , would you mind taking a look at this PR

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. We will also need this fix in the new RM in ClosingContainerHandler. That can be done in a new PR or in this one.

Comment on lines +1622 to +1625
report.incrementAndSample(HealthState.UNDER_REPLICATED,
container.containerID());
report.incrementAndSample(HealthState.MIS_REPLICATED,
container.containerID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.

Looks like over time we've diverged from the original intention of the replication manager report:

 * This class is used by ReplicationManager. Each time ReplicationManager runs,
 * it creates a new instance of this class and increments the various counters
 * to allow for creating a report on the various container states within the
 * system. There is a counter for each LifeCycleState (open, closing, closed
 * etc) and the sum of each of the lifecycle state counters should equal the
 * total number of containers in SCM. Ie, each container can only be in one of
 * the Lifecycle states at any time.

It specifies that each container should only be in one state at a time.

I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.

The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler. This is because it considers mis replication only when there is no under/over replication.

@neils-dev
Copy link
Contributor

Thanks @siddhantsangwan , filed a jira to handle closing containers and check for MISSING for the ClosingContainerHandler, HDDS-8017. https://issues.apache.org/jira/browse/HDDS-8017

We will also need this fix in the new RM in ClosingContainerHandler. That can be done in a new PR or in this one.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. For now, we can count in all places for the container report in the legacy RM. Going forward, in the new RM, we're only counting them once: #4313

Copy link
Contributor

@neils-dev neils-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @djordje-mijatovic for providing this well done patch. Thanks @siddhantsangwan , @sodonnel , @adoroszlai for your comments and for reviewing this PR.

Will be merging this shortly. Thanks!

@neils-dev neils-dev merged commit a432438 into apache:master Mar 3, 2023
@arp7
Copy link
Contributor

arp7 commented May 30, 2023

@djordje-mijatovic can you request an Apache jira account?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants