Skip to content

Conversation

@andrewor14
Copy link
Contributor

The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

@andrewor14
Copy link
Contributor Author

@sryza

@sryza
Copy link
Contributor

sryza commented May 20, 2015

This seems like a good idea to me. I could even support going lower (e.g. hundreds of ms) on the add time, given that we now cap at the number of executors needed to satisfy tasks and cancel pending requests.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33186 has finished for PR 6301 at commit 3320710.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14 andrewor14 changed the title [SPARK-7771] Dynamic allocation: lower default timeouts further [SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further May 21, 2015
@andrewor14
Copy link
Contributor Author

I just tried this out on a real cluster and it works well. The output from ExecutorAllocationManager are now more understandable:

INFO ExecutorAllocationManager: Lowering target number of executors to 305 because not all requests are actually needed (previously 307)
INFO ExecutorAllocationManager: Lowering target number of executors to 301 because not all requests are actually needed (previously 305)
INFO ExecutorAllocationManager: Lowering target number of executors to 298 because not all requests are actually needed (previously 301)
...

Before this patch, we were getting a bunch of Requesting 0 new executor(s) (see SPARK-7779).

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33204 has finished for PR 6301 at commit 5fcd3eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33205 has finished for PR 6301 at commit 2811492.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

@sryza It's good that we support it being that low but I'd rather be a little conservative for default values.

EDIT: actually we don't support it being in milliseconds currently because we cast the time to seconds, so this is the lowest it can ever get without some refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind bumping this up to info? It seems like there will be common situations (any time that there are sustained pending tasks but we've hit our max) where this will end up printing a log every second. Which is a lot of noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I'm not sure that was an intentional change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though at some point it might be good to log it in a limited way. It will introduce another variable but at least the user will know when s/he hits the cap.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33280 has finished for PR 6301 at commit 6d614a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

LGTM - makes sense to lower these thresholds.

@andrewor14
Copy link
Contributor Author

Thanks @pwendell @sryza. Merging into master and 1.4.

asfgit pushed a commit that referenced this pull request May 23, 2015
…further

The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes #6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages

(cherry picked from commit 3d8760d)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 3d8760d May 23, 2015
@andrewor14 andrewor14 deleted the da-minor branch May 23, 2015 00:40
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…further

The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes apache#6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…further

The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes apache#6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…further

The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes apache#6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants