HDDS-8683. Container balancer thread interrupt may not work #6179

Tejaskriya · 2024-02-06T08:40:34Z

What changes were proposed in this pull request?

Container balancer tries to interrupt current balancing thread upon being stopped. This may fail when the interrupt is before the thread even starts i.e., it hasn't been picked up by the scheduler. As a result, the balancer stop is delayed by the interval for which the thread goes to sleep next, which may be 5m by default for delayStart.
In this PR, a check is added in run() to check whether the containerBalancer is running on the scm. If it is running, only then it goes to sleep for delayStart. Otherwise the sleep is skipped.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8683

How was this patch tested?

Existing tests. No regression is caused by the refactoring in this PR.
Successful CI run on my fork: https://github.com/Tejaskriya/ozone/actions/runs/7796903702

sodonnel · 2024-02-06T11:38:17Z

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java

    LOG.info("Container Balancer waiting for {} to stop", balancingThread);
    try {
-      balancingThread.join();
+      while (balancingThread.isAlive()) {


When you interrupt the thread, as soon as it hits the Thread.sleep call, it should throw an interrupted exception and exit, so this retry logic should not be needed. I suspect there is some other bug in the interrupt handling that is causing this problem and we should try to get to the bottom of that.

@sodonnel
yes, thread will get interrupted as as soon it goes for sleep. Tested.
Here, issue is,

before thread actually schedule to run, and if there is interrupt, that gets ignored.

@Tejaskriya we can revert loop, but can add a check in run() before going for sleep for the delay, isBalancerRunning(). That will be case where it can ignore interrupt and go for sleep for 5 min as startup delay.

Thank you for the reviews. I have added a check in run() to check if balancer is running. Could you please review it again?

adoroszlai · 2024-02-09T09:01:52Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerTask.java

+        if (isBalancerRunning()) {
+          Thread.sleep(Duration.ofSeconds(delayDuration).toMillis());
+        }


I think the bug report is not about this sleep, rather the one in:

ozone/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerTask.java

Lines 229 to 237 in 60fa8dd

long sleepTime = 3 * nodeReportInterval;

LOG.info("ContainerBalancer will sleep for {} ms while waiting " +

"for updated usage information from Datanodes.", sleepTime);

Thread.sleep(sleepTime);

} catch (InterruptedException e) {

LOG.info("Container Balancer was interrupted while waiting for" +

"datanodes refreshing volume usage info");

Thread.currentThread().interrupt();

return;

This reverts commit 60fa8dd.

Tejaskriya · 2024-02-26T08:41:41Z

Through some testing on a branch on my fork, we saw that in certain environments, the thread ignores the interrupted flag when it is yet to go to sleep. CI run in which test runs take 1-7mins: https://github.com/Tejaskriya/ozone/actions/runs/8004080183/job/21860744227
So if the Thread.interrupt() is called before Thread.sleep(), then the interrupt is ignored.
This is fixed by retrying the interrupt (as if there is an interrupt when the thread is already sleeping, the interrupt is not ignored). CI run in which fix was implemented and all tests take <1min): https://github.com/Tejaskriya/ozone/actions/runs/8015385744
This fixes the delays caused in both scenarios:

Thread sleeping in delayStart before starting
Thread sleeping to wait for updated usage information from datanodes
@sumitagrawl @sodonnel @adoroszlai please do review the latest changes, thank you!

sumitagrawl

@Tejaskriya Thanks for verifying and experimenting, interrupt seems getting missed so retry seems better working here.
LGTM.

adoroszlai · 2024-02-27T09:30:04Z

@siddhantsangwan @sodonnel would you like to take another look?

sodonnel · 2024-02-27T12:17:24Z

I am happy for this to be committed. I feel there is still something we don't understand, as these retries should not be needed, but its probably not worth spending any more time on.

siddhantsangwan · 2024-02-28T04:36:36Z

@siddhantsangwan @sodonnel would you like to take another look?

Thanks for asking, yes I'm taking another look.

siddhantsangwan

The root cause may turn out to be something else, but this looks good to me. Merging this. Thanks everyone!

) (cherry picked from commit 8c4ab8e) Change-Id: Ib5da84e949a6a655aaf6d1f8c3e0f03de2560b7f

adoroszlai · 2024-03-19T08:43:13Z

The root cause may turn out to be something else,

The good news is that this is specific to unit tests. The bad one is it also affects other tests that exercise similar threading code.

Root cause: SUREFIRE-1815

Created repro with unit test, including some logging
Changing to NOP logger (which just ignores all log messages) restores correct behavior
Upgrading Surefire version to 3.0.0-M6 restores correct behavior
Using a simple Java program instead of unit test restores correct behavior

Unfortunately upgrading Surefire from 3.0.0-M5 breaks other behavior (#6075).

adoroszlai · 2024-03-19T14:53:20Z

Root cause: SUREFIRE-1815

Created #6406 to downgrade to 3.0.0-M4 work around this.

tejaskriya added 2 commits February 6, 2024 11:53

HDDS-8683. Container balancer thread interrupt may not work

800b237

Remove duplicate line

3bcfb20

Tejaskriya marked this pull request as ready for review February 6, 2024 08:43

adoroszlai requested review from sumitagrawl and siddhantsangwan February 6, 2024 10:17

sodonnel reviewed Feb 6, 2024

View reviewed changes

Check isBalancerRunning before sleep for delayStart

60fa8dd

adoroszlai reviewed Feb 9, 2024

View reviewed changes

Tejaskriya marked this pull request as draft February 13, 2024 09:39

Revert "Check isBalancerRunning before sleep for delayStart"

025a9c5

This reverts commit 60fa8dd.

Tejaskriya marked this pull request as ready for review February 26, 2024 08:41

sumitagrawl approved these changes Feb 27, 2024

View reviewed changes

siddhantsangwan approved these changes Feb 28, 2024

View reviewed changes

siddhantsangwan merged commit 8c4ab8e into apache:master Feb 28, 2024
35 checks passed

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Mar 15, 2024

HDDS-8683. Container balancer thread interrupt may not work (apache#6179

92c3d87

) (cherry picked from commit 8c4ab8e) Change-Id: Ib5da84e949a6a655aaf6d1f8c3e0f03de2560b7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-8683. Container balancer thread interrupt may not work #6179

HDDS-8683. Container balancer thread interrupt may not work #6179

Tejaskriya commented Feb 6, 2024 •

edited

Loading

sodonnel Feb 6, 2024

sumitagrawl Feb 7, 2024

Tejaskriya Feb 9, 2024

adoroszlai Feb 9, 2024

Tejaskriya commented Feb 26, 2024

sumitagrawl left a comment

adoroszlai commented Feb 27, 2024

sodonnel commented Feb 27, 2024

siddhantsangwan commented Feb 28, 2024

siddhantsangwan left a comment

adoroszlai commented Mar 19, 2024 •

edited

Loading

adoroszlai commented Mar 19, 2024

	long sleepTime = 3 * nodeReportInterval;
	LOG.info("ContainerBalancer will sleep for {} ms while waiting " +
	"for updated usage information from Datanodes.", sleepTime);
	Thread.sleep(sleepTime);
	} catch (InterruptedException e) {
	LOG.info("Container Balancer was interrupted while waiting for" +
	"datanodes refreshing volume usage info");
	Thread.currentThread().interrupt();
	return;

HDDS-8683. Container balancer thread interrupt may not work #6179

HDDS-8683. Container balancer thread interrupt may not work #6179

Conversation

Tejaskriya commented Feb 6, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

sodonnel Feb 6, 2024

Choose a reason for hiding this comment

sumitagrawl Feb 7, 2024

Choose a reason for hiding this comment

Tejaskriya Feb 9, 2024

Choose a reason for hiding this comment

adoroszlai Feb 9, 2024

Choose a reason for hiding this comment

Tejaskriya commented Feb 26, 2024

sumitagrawl left a comment

Choose a reason for hiding this comment

adoroszlai commented Feb 27, 2024

sodonnel commented Feb 27, 2024

siddhantsangwan commented Feb 28, 2024

siddhantsangwan left a comment

Choose a reason for hiding this comment

adoroszlai commented Mar 19, 2024 • edited Loading

adoroszlai commented Mar 19, 2024

Tejaskriya commented Feb 6, 2024 •

edited

Loading

adoroszlai commented Mar 19, 2024 •

edited

Loading