Skip to content

Conversation

@andrewor14
Copy link
Contributor

This is based on @KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls sc.killExecutor("1") on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:

  • HeartbeatReceiver kills an executor that has not been sending heartbeats
  • ExecutorAllocationManager kills an executor that has been idle
  • The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

@andrewor14
Copy link
Contributor Author

@vanzin @srowen could you have a look?

@andrewor14 andrewor14 changed the title [SPARK-9795] Dynamic allocation: avoid double counting when killing same executor [SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice Aug 10, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It looks like unknownExecutors is useful only for that one log statement. If we don't need that log line, we could reduce the number of set copies and traversals. How about something like this (this is less scala-like, but reduces the number of traversals and copies):

val knownExecutors = new HashSet[String]
executorsIds.foreach { id =>
  if (executorDataMap.contains(id)) {
    knownExecutors += id
  }
}

This also makes the other changes in this file unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine, but in this patch I wanted to limit the scope of the changes so I'm going to leave this as is.

@harishreedharan
Copy link
Contributor

LGTM

1 similar comment
@vanzin
Copy link
Contributor

vanzin commented Aug 10, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40318 has finished for PR 8078 at commit fb149da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #1430 timed out for PR 8078 at commit fb149da after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #1429 timed out for PR 8078 at commit fb149da after a configured wait of 175m.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40479 has finished for PR 8078 at commit fb149da.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

retest this please

1 similar comment
@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40508 has finished for PR 8078 at commit fb149da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #1460 has finished for PR 8078 at commit fb149da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@andrewor14
Copy link
Contributor Author

retest this please, just in case

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40593 has finished for PR 8078 at commit fb149da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

retest this please. Pretty sure I didn't change any SQL...

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40596 has finished for PR 8078 at commit fb149da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40608 has finished for PR 8078 at commit fb149da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

Alright, the latest commit actually passed tests so I'm going to merge this into master 1.5.

asfgit pushed a commit that referenced this pull request Aug 12, 2015
…ame executor twice

This is based on KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes #8078 from andrewor14/da-double-kill.

(cherry picked from commit be5d191)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in be5d191 Aug 12, 2015
@andrewor14 andrewor14 deleted the da-double-kill branch August 12, 2015 16:26
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…ame executor twice

This is based on KaiXinXiaoLei's changes in apache#7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes apache#8078 from andrewor14/da-double-kill.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants