Skip to content

Conversation

@andrewor14
Copy link
Contributor

[SPARK-4134]: We should not log ERROR or WARNING if executors are killed by the user.
[SPARK-7835]: Refactor HeartbeatReceiverSuite to increase coverage (needed by SPARK-4134)

~60% of this patch is test code.


Right now we get a bunch of scary error messages if the user kills executors manually (or uses dynamic allocation to do so):

screen shot 2015-05-20 at 3 42 38 pm

This patch tones down these error messages, since these are really not errors. Now the output looks like:

screen shot 2015-05-21 at 1 17 02 pm

I tested this on a real YARN cluster with #6301.

Andrew Or added 3 commits May 20, 2015 15:41
We should not askWithReplay in a synchronized block. In this
particular case, the CoarseGrainedSchedulerBackend actor tried
to acquire the same lock when replying, leading to a deadlock.
@andrewor14 andrewor14 changed the title [SPARK-4134] Dynamic allocation: Tone down kill error messages [SPARK-4134] [WIP] Dynamic allocation: Tone down kill error messages May 21, 2015
@harishreedharan
Copy link
Contributor

Not sure the whole Logging trait etc will allow us to hide the logging per class, but if it does you could change the configuration to log only ERROR level messages for ReliableDeliverySupervisor

@harishreedharan
Copy link
Contributor

From what I see here: https://github.com/akka/akka/blob/d9db42b75715cb2ca98c84efbdb8666cb046bfa2/akka-remote/src/main/scala/akka/remote/Endpoint.scala ignoring logging messages below ERROR seems ok for that class.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33214 has finished for PR 6310 at commit 677df8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

@andrewor14 andrewor14 changed the title [SPARK-4134] [WIP] Dynamic allocation: Tone down kill error messages [SPARK-4134] Dynamic allocation: Tone down kill error messages May 21, 2015
@andrewor14
Copy link
Contributor Author

@harishreedharan yeah I'm just going to bump it to ERROR in the log4j conf. After an offline discussion with @zsxwing we decided that it's safe to do so.

@andrewor14
Copy link
Contributor Author

By the way I updated the screenshot and fixed a heartbeat receiver race condition that was introduced in this patch previously.

@andrewor14
Copy link
Contributor Author

It would be good for @zsxwing and @vanzin to have a look, since this modifies some code both of you have touched in the recent past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin I believe these two lines are functionally equivalent to what you had in the old code (L384). I believe the only difference is that I moved the askWithReply out of the synchronized block. Please let me me know if this is not the case.

What it used to look like:

val newTotal = (numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size
  - filteredExecutorIds.size)
...
executorsPendingToRemove ++= filteredExecutorIds

@vanzin
Copy link
Contributor

vanzin commented May 21, 2015

Looks sane to me.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33279 has finished for PR 6310 at commit 95b26b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33283 has finished for PR 6310 at commit d5ab486.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33277 timed out for PR 6310 at commit 3bf89d4 after a configured wait of 150m.

Now the heartbeat receiver will never create a new entry for an
executor that has not registered.
@andrewor14
Copy link
Contributor Author

After testing further I found that the race condition wasn't fully resolved. This should be actually fixed in the latest commit. (By the way all the race condition does is print out more error messages; it doesn't actually change execution behavior in any way).

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33311 has finished for PR 6310 at commit 17af2ee.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

Andrew Or added 2 commits May 22, 2015 14:52
This covers not only the change in behavior introduced in this
patch, but also existing behavior (e.g. expire dead hosts) that
was simply not tested. This commit also refactors the test in a
way that eliminates the existing duplicate code.
@andrewor14 andrewor14 changed the title [SPARK-4134] Dynamic allocation: Tone down kill error messages [SPARK-4134] [SPARK-7835] Dynamic allocation: Tone down kill error messages May 22, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these here to increase test determinism

Andrew Or added 2 commits May 22, 2015 15:39
It turns out that `send` doesn't actually transmit messages to
`receiveAndReply`. This means we need to duplicate a few messages
in both `receive` and `receiveAndReply` if we want both `send`
and `askWithRetry` to work.
It's a method, not a value.
@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33374 has finished for PR 6310 at commit 45b134b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make SystemClock as a default value to save one constructor, please?

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33368 has finished for PR 6310 at commit 44f0617.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2015

Test build #33366 has finished for PR 6310 at commit ec1cbf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

@SparkQA
Copy link

SparkQA commented May 23, 2015

Test build #33376 has finished for PR 6310 at commit 9337e9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorLossReason(val message: String, val isError: Boolean = true)
    • case class RemoveExecutor(executorId: String, reason: String, isError: Boolean)

@andrewor14 andrewor14 changed the title [SPARK-4134] [SPARK-7835] Dynamic allocation: Tone down kill error messages [SPARK-4134] [SPARK-7835] [WIP] Dynamic allocation: Tone down kill error messages May 24, 2015
@andrewor14
Copy link
Contributor Author

I discovered a few more issues with this patch. Putting this on hold for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to factor the changes in this class out to a separate patch

@andrewor14
Copy link
Contributor Author

I'm going to open a new patch later once I spend some time to figure out the issue with this one. In the mean time let's not block the refactoring of the HeartbeatReceiverSuite on this (see #7173).

Note to self: DO NOT delete this branch!

@andrewor14 andrewor14 closed this Jul 2, 2015
@andrewor14 andrewor14 deleted the da-tone-down branch September 23, 2015 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants