[spark] ray on spark creates spark job using stage scheduling #31397

WeichenXu123 · 2023-01-03T10:04:40Z

Why are these changes needed?

ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node
improve ray worker memory allocation computation.
refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-01-09T13:13:38Z

python/requirements_test.txt

+# TODO: Replace this with pyspark==3.4 once it is released.
+https://ml-team-public-read.s3.us-west-2.amazonaws.com/pyspark-3.4.0.dev0.tar.gz


Note:

Spark stage scheduling feature (on standalone mode spark cluster) is introduced in apache/spark 3.4 (it will be released this month)

So for testing purpose, I built a package from apache/spark master

ericl · 2023-01-09T19:06:58Z

@amogkam @jjyao can you review this?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

doc/source/ray-core/package-ref.rst

python/ray/util/spark/cluster_init.py

python/ray/util/spark/utils.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

amogkam

Thanks @weichen123! Overall lgtm- left some small comments.

Will leave to @jjyao for final approval and marge.

doc/source/ray-core/package-ref.rst

python/ray/util/spark/__init__.py

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

doc/source/cluster/vms/user-guides/community/spark.rst

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/ray/util/spark/cluster_init.py

jjyao · 2023-01-17T13:32:57Z

python/ray/util/spark/utils.py

+    )
+
+
+def get_avail_mem_per_ray_worker_node(


can be addressed in a follow-up PR: we should add type annotations even to private methods.

jjyao · 2023-01-17T13:37:57Z

You will also need @ericl approval for doc changes.

python/ray/util/spark/cluster_init.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2023-01-19T01:01:12Z

CC @ericl Could you approve the doc part of the PR ? Thank you!

WeichenXu123 · 2023-01-20T14:12:23Z

CC @ericl Would you help approve the doc changes in the PR? Thank you!

maxpumperla

approval for docs changes

jjyao · 2023-01-24T12:45:42Z

Failed tests are unrelated.

WeichenXu123 added 6 commits December 26, 2022 20:06

init

c382364

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

b86444f

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

82ae131

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

86e6acd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

d1f1f35

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

b344320

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested review from architkulkarni, wuisawesome, DmitriGekhtman, maxpumperla, pcmoritz and a team as code owners January 3, 2023 10:04

WeichenXu123 added 5 commits January 3, 2023 20:25

update

eb00816

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

55b4fc1

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix lint

422473e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update tests

9de7cdc

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix tests

36ae2ef

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

ericl assigned jjyao and amogkam Jan 6, 2023

WeichenXu123 added 2 commits January 9, 2023 20:33

update tests

2ec6fbf

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update test-req

3f4a648

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 commented Jan 9, 2023

View reviewed changes

WeichenXu123 added 3 commits January 11, 2023 16:19

set object mem mimimum

caafcdd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update tests

1b9e481

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

improve hook

ef535e2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Jan 11, 2023

View reviewed changes

WeichenXu123 added 3 commits January 12, 2023 16:07

fix get_max_num_concurrent_tasks

ece7321

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update test

98e7ff5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

add mem calc warning

01a78fc

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 4 commits January 12, 2023 17:59

fix

9ea2986

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

ea029dc

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update api

017fcde

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update tests

c287a3b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

amogkam reviewed Jan 14, 2023

View reviewed changes

doc/source/ray-core/package-ref.rst Outdated Show resolved Hide resolved

python/ray/util/spark/__init__.py Show resolved Hide resolved

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved

python/ray/util/spark/cluster_init.py Show resolved Hide resolved

WeichenXu123 added 3 commits January 16, 2023 11:38

address comments

9a73a0a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into stage-scheduling

1936024

add node_options valiation

e73d36d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 mentioned this pull request Jan 16, 2023

[spark] allow ray on spark to disable dashboard #31348

Closed

7 tasks

update

109ae7e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao reviewed Jan 16, 2023

View reviewed changes

WeichenXu123 added 4 commits January 17, 2023 11:11

address comments

d1d4bda

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

nit update

5182dd6

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix lint

bfa8f51

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix lint

4c6e301

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

jjyao approved these changes Jan 17, 2023

View reviewed changes

harupy reviewed Jan 18, 2023

View reviewed changes

python/ray/util/spark/cluster_init.py Outdated Show resolved Hide resolved

WeichenXu123 added 4 commits January 18, 2023 11:37

address comments

5463e05

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

nit updates

4fcb2ff

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

nit update

300fb2b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix lint

20df1f3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

maxpumperla approved these changes Jan 23, 2023

View reviewed changes

Merge branch 'master' into stage-scheduling

985cf97

jjyao merged commit aa7d5d9 into ray-project:master Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] ray on spark creates spark job using stage scheduling #31397

[spark] ray on spark creates spark job using stage scheduling #31397

WeichenXu123 commented Jan 3, 2023 •

edited

Loading

WeichenXu123 Jan 9, 2023

ericl commented Jan 9, 2023

amogkam left a comment

jjyao Jan 17, 2023

jjyao commented Jan 17, 2023

WeichenXu123 commented Jan 19, 2023

WeichenXu123 commented Jan 20, 2023

maxpumperla left a comment

jjyao commented Jan 24, 2023

		# TODO: Replace this with pyspark==3.4 once it is released.
		https://ml-team-public-read.s3.us-west-2.amazonaws.com/pyspark-3.4.0.dev0.tar.gz

		)


		def get_avail_mem_per_ray_worker_node(

[spark] ray on spark creates spark job using stage scheduling #31397

[spark] ray on spark creates spark job using stage scheduling #31397

Conversation

WeichenXu123 commented Jan 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

WeichenXu123 Jan 9, 2023

Choose a reason for hiding this comment

ericl commented Jan 9, 2023

amogkam left a comment

Choose a reason for hiding this comment

jjyao Jan 17, 2023

Choose a reason for hiding this comment

jjyao commented Jan 17, 2023

WeichenXu123 commented Jan 19, 2023

WeichenXu123 commented Jan 20, 2023

maxpumperla left a comment

Choose a reason for hiding this comment

jjyao commented Jan 24, 2023

WeichenXu123 commented Jan 3, 2023 •

edited

Loading