Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8683. Container balancer thread interrupt may not work #6179

Merged
merged 4 commits into from
Feb 28, 2024

Conversation

Tejaskriya
Copy link
Contributor

@Tejaskriya Tejaskriya commented Feb 6, 2024

What changes were proposed in this pull request?

Container balancer tries to interrupt current balancing thread upon being stopped. This may fail when the interrupt is before the thread even starts i.e., it hasn't been picked up by the scheduler. As a result, the balancer stop is delayed by the interval for which the thread goes to sleep next, which may be 5m by default for delayStart.
In this PR, a check is added in run() to check whether the containerBalancer is running on the scm. If it is running, only then it goes to sleep for delayStart. Otherwise the sleep is skipped.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8683

How was this patch tested?

Existing tests. No regression is caused by the refactoring in this PR.
Successful CI run on my fork: https://github.com/Tejaskriya/ozone/actions/runs/7796903702

@Tejaskriya Tejaskriya marked this pull request as ready for review February 6, 2024 08:43
LOG.info("Container Balancer waiting for {} to stop", balancingThread);
try {
balancingThread.join();
while (balancingThread.isAlive()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you interrupt the thread, as soon as it hits the Thread.sleep call, it should throw an interrupted exception and exit, so this retry logic should not be needed. I suspect there is some other bug in the interrupt handling that is causing this problem and we should try to get to the bottom of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel
yes, thread will get interrupted as as soon it goes for sleep. Tested.
Here, issue is,

  • before thread actually schedule to run, and if there is interrupt, that gets ignored.

@Tejaskriya we can revert loop, but can add a check in run() before going for sleep for the delay, isBalancerRunning(). That will be case where it can ignore interrupt and go for sleep for 5 min as startup delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reviews. I have added a check in run() to check if balancer is running. Could you please review it again?

Comment on lines 176 to 178
if (isBalancerRunning()) {
Thread.sleep(Duration.ofSeconds(delayDuration).toMillis());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bug report is not about this sleep, rather the one in:

long sleepTime = 3 * nodeReportInterval;
LOG.info("ContainerBalancer will sleep for {} ms while waiting " +
"for updated usage information from Datanodes.", sleepTime);
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
LOG.info("Container Balancer was interrupted while waiting for" +
"datanodes refreshing volume usage info");
Thread.currentThread().interrupt();
return;

@Tejaskriya Tejaskriya marked this pull request as draft February 13, 2024 09:39
@Tejaskriya
Copy link
Contributor Author

Through some testing on a branch on my fork, we saw that in certain environments, the thread ignores the interrupted flag when it is yet to go to sleep. CI run in which test runs take 1-7mins: https://github.com/Tejaskriya/ozone/actions/runs/8004080183/job/21860744227
So if the Thread.interrupt() is called before Thread.sleep(), then the interrupt is ignored.
This is fixed by retrying the interrupt (as if there is an interrupt when the thread is already sleeping, the interrupt is not ignored). CI run in which fix was implemented and all tests take <1min): https://github.com/Tejaskriya/ozone/actions/runs/8015385744
This fixes the delays caused in both scenarios:

  1. Thread sleeping in delayStart before starting
  2. Thread sleeping to wait for updated usage information from datanodes
    @sumitagrawl @sodonnel @adoroszlai please do review the latest changes, thank you!

@Tejaskriya Tejaskriya marked this pull request as ready for review February 26, 2024 08:41
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya Thanks for verifying and experimenting, interrupt seems getting missed so retry seems better working here.
LGTM.

@adoroszlai
Copy link
Contributor

@siddhantsangwan @sodonnel would you like to take another look?

@sodonnel
Copy link
Contributor

I am happy for this to be committed. I feel there is still something we don't understand, as these retries should not be needed, but its probably not worth spending any more time on.

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan @sodonnel would you like to take another look?

Thanks for asking, yes I'm taking another look.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause may turn out to be something else, but this looks good to me. Merging this. Thanks everyone!

@siddhantsangwan siddhantsangwan merged commit 8c4ab8e into apache:master Feb 28, 2024
35 checks passed
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Mar 15, 2024
)

(cherry picked from commit 8c4ab8e)
Change-Id: Ib5da84e949a6a655aaf6d1f8c3e0f03de2560b7f
@adoroszlai
Copy link
Contributor

adoroszlai commented Mar 19, 2024

The root cause may turn out to be something else,

The good news is that this is specific to unit tests. The bad one is it also affects other tests that exercise similar threading code.

Root cause: SUREFIRE-1815

  1. Created repro with unit test, including some logging
  2. Changing to NOP logger (which just ignores all log messages) restores correct behavior
  3. Upgrading Surefire version to 3.0.0-M6 restores correct behavior
  4. Using a simple Java program instead of unit test restores correct behavior

Unfortunately upgrading Surefire from 3.0.0-M5 breaks other behavior (#6075).

@adoroszlai
Copy link
Contributor

Root cause: SUREFIRE-1815

Created #6406 to downgrade to 3.0.0-M4 work around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants