-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-8683. Container balancer thread interrupt may not work #6179
Conversation
LOG.info("Container Balancer waiting for {} to stop", balancingThread); | ||
try { | ||
balancingThread.join(); | ||
while (balancingThread.isAlive()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you interrupt the thread, as soon as it hits the Thread.sleep call, it should throw an interrupted exception and exit, so this retry logic should not be needed. I suspect there is some other bug in the interrupt handling that is causing this problem and we should try to get to the bottom of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel
yes, thread will get interrupted as as soon it goes for sleep. Tested.
Here, issue is,
- before thread actually schedule to run, and if there is interrupt, that gets ignored.
@Tejaskriya we can revert loop, but can add a check in run() before going for sleep for the delay, isBalancerRunning(). That will be case where it can ignore interrupt and go for sleep for 5 min as startup delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reviews. I have added a check in run() to check if balancer is running. Could you please review it again?
if (isBalancerRunning()) { | ||
Thread.sleep(Duration.ofSeconds(delayDuration).toMillis()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bug report is not about this sleep, rather the one in:
Lines 229 to 237 in 60fa8dd
long sleepTime = 3 * nodeReportInterval; | |
LOG.info("ContainerBalancer will sleep for {} ms while waiting " + | |
"for updated usage information from Datanodes.", sleepTime); | |
Thread.sleep(sleepTime); | |
} catch (InterruptedException e) { | |
LOG.info("Container Balancer was interrupted while waiting for" + | |
"datanodes refreshing volume usage info"); | |
Thread.currentThread().interrupt(); | |
return; |
This reverts commit 60fa8dd.
Through some testing on a branch on my fork, we saw that in certain environments, the thread ignores the interrupted flag when it is yet to go to sleep. CI run in which test runs take 1-7mins: https://github.com/Tejaskriya/ozone/actions/runs/8004080183/job/21860744227
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tejaskriya Thanks for verifying and experimenting, interrupt seems getting missed so retry seems better working here.
LGTM.
@siddhantsangwan @sodonnel would you like to take another look? |
I am happy for this to be committed. I feel there is still something we don't understand, as these retries should not be needed, but its probably not worth spending any more time on. |
Thanks for asking, yes I'm taking another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause may turn out to be something else, but this looks good to me. Merging this. Thanks everyone!
) (cherry picked from commit 8c4ab8e) Change-Id: Ib5da84e949a6a655aaf6d1f8c3e0f03de2560b7f
The good news is that this is specific to unit tests. The bad one is it also affects other tests that exercise similar threading code. Root cause: SUREFIRE-1815
Unfortunately upgrading Surefire from 3.0.0-M5 breaks other behavior (#6075). |
Created #6406 to downgrade to 3.0.0-M4 work around this. |
What changes were proposed in this pull request?
Container balancer tries to interrupt current balancing thread upon being stopped. This may fail when the interrupt is before the thread even starts i.e., it hasn't been picked up by the scheduler. As a result, the balancer stop is delayed by the interval for which the thread goes to sleep next, which may be 5m by default for delayStart.
In this PR, a check is added in run() to check whether the containerBalancer is running on the scm. If it is running, only then it goes to sleep for delayStart. Otherwise the sleep is skipped.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8683
How was this patch tested?
Existing tests. No regression is caused by the refactoring in this PR.
Successful CI run on my fork: https://github.com/Tejaskriya/ozone/actions/runs/7796903702