-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4134] [SPARK-7835] [WIP] Dynamic allocation: Tone down kill error messages #6310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We should not askWithReplay in a synchronized block. In this particular case, the CoarseGrainedSchedulerBackend actor tried to acquire the same lock when replying, leading to a deadlock.
|
Not sure the whole |
|
From what I see here: https://github.com/akka/akka/blob/d9db42b75715cb2ca98c84efbdb8666cb046bfa2/akka-remote/src/main/scala/akka/remote/Endpoint.scala ignoring logging messages below |
|
Test build #33214 has finished for PR 6310 at commit
|
|
@harishreedharan yeah I'm just going to bump it to |
|
By the way I updated the screenshot and fixed a heartbeat receiver race condition that was introduced in this patch previously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin I believe these two lines are functionally equivalent to what you had in the old code (L384). I believe the only difference is that I moved the askWithReply out of the synchronized block. Please let me me know if this is not the case.
What it used to look like:
val newTotal = (numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size
- filteredExecutorIds.size)
...
executorsPendingToRemove ++= filteredExecutorIds
|
Looks sane to me. |
|
Test build #33279 has finished for PR 6310 at commit
|
|
Test build #33283 has finished for PR 6310 at commit
|
|
Test build #33277 timed out for PR 6310 at commit |
Now the heartbeat receiver will never create a new entry for an executor that has not registered.
|
After testing further I found that the race condition wasn't fully resolved. This should be actually fixed in the latest commit. (By the way all the race condition does is print out more error messages; it doesn't actually change execution behavior in any way). |
|
Test build #33311 has finished for PR 6310 at commit
|
This covers not only the change in behavior introduced in this patch, but also existing behavior (e.g. expire dead hosts) that was simply not tested. This commit also refactors the test in a way that eliminates the existing duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these here to increase test determinism
It turns out that `send` doesn't actually transmit messages to `receiveAndReply`. This means we need to duplicate a few messages in both `receive` and `receiveAndReply` if we want both `send` and `askWithRetry` to work.
|
Test build #33374 has finished for PR 6310 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make SystemClock as a default value to save one constructor, please?
|
Test build #33368 has finished for PR 6310 at commit
|
|
Test build #33366 has finished for PR 6310 at commit
|
|
Test build #33376 has finished for PR 6310 at commit
|
|
I discovered a few more issues with this patch. Putting this on hold for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to factor the changes in this class out to a separate patch
|
I'm going to open a new patch later once I spend some time to figure out the issue with this one. In the mean time let's not block the refactoring of the Note to self: DO NOT delete this branch! |
[SPARK-4134]: We should not log
ERRORorWARNINGif executors are killed by the user.[SPARK-7835]: Refactor
HeartbeatReceiverSuiteto increase coverage (needed by SPARK-4134)~60% of this patch is test code.
Right now we get a bunch of scary error messages if the user kills executors manually (or uses dynamic allocation to do so):
This patch tones down these error messages, since these are really not errors. Now the output looks like:
I tested this on a real YARN cluster with #6301.