-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-7210. Missing open containers show up as "Closing" on the container report. #4207
Conversation
Take Master 13 Jan 23
…NG or QUASI_CLOSED
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Show resolved
Hide resolved
@sodonnel Could you review this PR again? Thank you. |
report.incrementAndSample(HealthState.UNDER_REPLICATED, | ||
container.containerID()); | ||
report.incrementAndSample(HealthState.MIS_REPLICATED, | ||
container.containerID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the container categorized in all of these states? Isn't MISSING
enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, MISSING could be enough but we wanted to have the same behavior as we have in Recon - this is way we have set all the states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't MISSING enough?
This is also consistent with the replication manager detecting and reporting "MISSING
" when the container is in the CLOSED
state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.
Looks like over time we've diverged from the original intention of the replication manager report:
* This class is used by ReplicationManager. Each time ReplicationManager runs,
* it creates a new instance of this class and increments the various counters
* to allow for creating a report on the various container states within the
* system. There is a counter for each LifeCycleState (open, closing, closed
* etc) and the sum of each of the lifecycle state counters should equal the
* total number of containers in SCM. Ie, each container can only be in one of
* the Lifecycle states at any time.
It specifies that each container should only be in one state at a time.
I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.
The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler
. This is because it considers mis replication only when there is no under/over replication.
@siddhantsangwan , would you mind taking a look at this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. We will also need this fix in the new RM in ClosingContainerHandler
. That can be done in a new PR or in this one.
report.incrementAndSample(HealthState.UNDER_REPLICATED, | ||
container.containerID()); | ||
report.incrementAndSample(HealthState.MIS_REPLICATED, | ||
container.containerID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.
Looks like over time we've diverged from the original intention of the replication manager report:
* This class is used by ReplicationManager. Each time ReplicationManager runs,
* it creates a new instance of this class and increments the various counters
* to allow for creating a report on the various container states within the
* system. There is a counter for each LifeCycleState (open, closing, closed
* etc) and the sum of each of the lifecycle state counters should equal the
* total number of containers in SCM. Ie, each container can only be in one of
* the Lifecycle states at any time.
It specifies that each container should only be in one state at a time.
I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.
The new RM does not count a missing container as mis replicated, but it does count it as under replicated in RatisReplicationCheckHandler
. This is because it considers mis replication only when there is no under/over replication.
Thanks @siddhantsangwan , filed a jira to handle closing containers and check for MISSING for the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. For now, we can count in all places for the container report in the legacy RM. Going forward, in the new RM, we're only counting them once: #4313
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @djordje-mijatovic for providing this well done patch. Thanks @siddhantsangwan , @sodonnel , @adoroszlai for your comments and for reviewing this PR.
Will be merging this shortly. Thanks!
@djordje-mijatovic can you request an Apache jira account? |
What changes were proposed in this pull request?
When the container goes into CLOSING state, if the data node is down than the container is stuck in CLOSING state forever, and the user does not know that this container is missing. In this PR we recalculate the HealthState of the CLOSING container to inform the user that container is also MISSING. The goal was to set the same HealthState as for CLOSED container.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7210
How was this patch tested?
Unit test and manual dev tests.