[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs #20761

szilard-nemeth · 2018-03-07T17:52:05Z

What changes were proposed in this pull request?

This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines.
The custom resources are defined with Spark properties, no additional CLI arguments were introduced.

The properties can be defined in the following form:

AM resources, client mode:
Format: spark.yarn.am.resource.<resource-name>
The property name follows the naming convention of YARN AM cores / memory properties: spark.yarn.am.memory and spark.yarn.am.cores

Driver resources, cluster mode:
Format: spark.yarn.driver.resource.<resource-name>
The property name follows the naming convention of driver cores / memory properties: spark.driver.memory and spark.driver.cores.

Executor resources:
Format: spark.yarn.executor.resource.<resource-name>
The property name follows the naming convention of executor cores / memory properties: spark.executor.memory / spark.executor.cores.

For the driver resources (cluster mode) and executor resources properties, we use the yarn prefix here as custom resource types are specific to YARN, currently.

Validation:
Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs:

"--conf", "spark.driver.memory=2G",
"--conf", "spark.yarn.driver.resource.memory=1G"

will not start execution and will print an error message.

How was this patch tested?

Unit tests + manual execution with Hadoop2 and Hadoop 3 builds.

Testing have been performed on a real cluster with Spark and YARN configured:
Cluster and client mode
Request Resource Types with lowercase and uppercase units
Start Spark job with only requesting standard resources (mem / cpu)
Error handling cases:

Request unknown resource type
Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config:

--conf spark.yarn.am.resource.memory=1G \
  --conf spark.yarn.driver.resource.memory=2G \
  --conf spark.yarn.executor.resource.memory=3G \

), ResourceTypeValidator handles these cases well, so it is not permitted

Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G, this is not permitted and handled well.

An example about how I ran the testcases:

cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/;
./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 1G \
  --driver-cores 1 \
  --executor-memory 1G \
  --executor-cores 1 \
  --conf spark.logConf=true \
  --conf spark.yarn.executor.resource.gpu=3G \
  --verbose \
  ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \
  10;

vanzin · 2018-03-07T19:00:23Z

Hi, please take some time to look at the how to contribute page:
https://spark.apache.org/contributing.html

You'll find the code conventions we use there. You have many style discrepancies in your code.

Also, given this is a YARN feature, there should be zero changes in core. All you're trying to do should be possible by just parsing user-provided config options in YARN's Client.scala.

vanzin · 2018-03-14T17:42:33Z

ok to test

SparkQA · 2018-03-14T18:06:59Z

Test build #88235 has finished for PR 20761 at commit 11aaa69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

The documentation needs to be clear that this is not available in all Hadoop versions.

I'm also confused about how cores and memory are handled; what if the user sets both?

Optimally that would not be allowed, and the existing Spark configs should be used, and the new code would only apply to other resources.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

szilard-nemeth · 2018-03-22T19:02:43Z

Thanks for your comments @vanzin !
The earliest I can resolve these and push the fixes is week after next, as I will not be available until then (vacation).

jerryshao · 2018-05-03T06:49:08Z

Hi @szyszy are you still going to work on this PR?

szilard-nemeth · 2018-05-03T07:08:06Z

Hey @jerryshao! Exactly, I almost addressed all of Marcelo's comments so please expect a PR update in the coming days.

jerryshao · 2018-05-03T08:08:04Z

Cool, thanks!

szilard-nemeth · 2018-05-13T17:52:49Z

@vanzin :
Please check my updates!

Reply to your main comment from above:
For the documentation: Please advise which documentation(s) should I update and where!
For your question about the case when user sets both: That’s why I created the ResourceTypeValidator class. As per our discussion, ResourceTypeValidator will not accept memory / vcores resource requests with the new yarn configs e.g. spark.yarn.executor.resource.memory will be denied with an error message.
I mentioned in another comment, but it’s worth to state here as well: I refactored ResourceTypeValidator and added a bunch of javadoc comments.
Also, I included how validation works in the PR's main description.
Do you agree with this kind of validation or do you have something else in mind?

SparkQA · 2018-05-13T18:08:10Z

Test build #90553 has finished for PR 20761 at commit e0f40e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Only a partial review. I need to pick up again from the validator code.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

galv

Thanks for this work. It's very valuable for me. I'm not a spark maintainer, but would be happy to finish reviewing this for you shortly. Do you need help with testing?

docs/running-on-yarn.md

galv · 2018-05-19T05:57:35Z

~~I may be missing something, but doesn't this change depend on Spark's yarn client being made hadoop 3.0 compatible?~~

Edit: Never mind, didn't realize you were using reflection to get around it.

...urce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceTypeHelperSuite.scala

galv

I hope this was helpful to you. Let me know if you have any questions. I'm thinking that you may be able to remove the ResourceTypeHelper class, which can make this commit quite a bit of smaller, depending on what you think.

...urce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceTypeHelperSuite.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

SparkQA · 2018-05-22T12:19:41Z

Test build #90955 has finished for PR 20761 at commit d1c7674.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-22T12:24:41Z

Test build #90958 has finished for PR 20761 at commit 6aab82f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-22T14:35:16Z

Test build #90968 has finished for PR 20761 at commit ec549f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

szilard-nemeth · 2018-05-22T19:34:46Z

@vanzin , @galv : I think all of your comments are either addressed or answered, please check once again!

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

galv

Have you integration tested this against a running yarn cluster?

I can probably finish up sometime tomorrow.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

docs/running-on-yarn.md

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala

galv · 2018-06-15T04:44:26Z

I would like to see this merged, though I got derailed by spark summit last week and other things. I will look this patch over again soon @szyszy If you're busy lately, perhaps I can take over the rest of the code changes suggested by @vanzin, if necessary (I get the impression that this PR is just about ready to merge).

@vanzin I appreciate your detailed responses. I'm curious whether you have any overarching serious concerns about this patch, e.g., about its design. I think the interface is fairly appropriate, but I thought I should check whether you think this PR will be ready to merge soon.

szilard-nemeth · 2018-06-15T08:46:48Z

Hi @galv!
Thanks for reaching out.
Actually I'm quite busy with other things so that's why I haven't been working on this lately.
It would be very great and I would be really happy if you could take this over.
Please confirm whether you could do this, the sooner I could work on this again is next week, approximately.

vanzin · 2018-06-15T16:14:14Z

I don't have issues with the design - I think the main two things I was concerned about were:

not adding another way to set existing Spark options like mem and cores, which has been addressed
the seemingly unnecessary complexity in certain parts of the code like the validator

SparkQA · 2018-10-10T18:45:16Z

Test build #97212 has finished for PR 20761 at commit 7e7a55a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T19:16:15Z

Test build #97213 has finished for PR 20761 at commit 55c7be9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala

SparkQA · 2018-10-10T20:36:23Z

Test build #97215 has finished for PR 20761 at commit adddf6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T21:07:39Z

Test build #97216 has finished for PR 20761 at commit f1c2e41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T21:50:32Z

Test build #97218 has finished for PR 20761 at commit f509153.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Almost there, one odd thing left in the validator.

vanzin · 2018-10-10T23:13:27Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala

I went and looked at the documentation because I remember this being confusing. The documentation mentions both memory and memory-mb as being valid, with the latter being preferred. So it sounds to me like you can use either, and that this code should disallow both.

You even initialize memory-mb in your tests, instead of memory.

Still waiting for a word on this.

Sure!
Did you mean this documentation?
https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
I think it's required to check all the keys for memory / vcore that YARN deprecates, as those will flow trough Spark and eventually reach YARN's ResourceInformation and it will just blow up as only memory-mb and vcores are the ones that are not deprecated. The reason why it haven't caused a problem with current Spark code as it is using the Resource object and not using ResourceInformation at all.
So we need to disallow these:

cpu-vcores

memory

mb

What do you think?

I'm not familiar with the YARN code or what it does here.

I'm just worried about users setting cpu/memory resources outside of the proper Spark settings, and also the inconsistency in your code (using both memory and memory-mb).

These are two separate things:

One is that I don't reject all the deprecated standard resources has been known to YARN (explained in previous comment) which I will address soon. To be exact, I need to reject not just the deprecateds, but all possible ways to define standard resources for the memory and CPU cores.

Using memory-mb is the only way to initialize the memory resource with the YARN client, with the method ResourceUtils.reinitializeResources.
I played around with this a bit, if I omit the standard resources and try to specify custom resources and then call ResourceUtils.reinitializeResources, an internal YARN exception will be thrown as it relies on the fact that when you invoke this method, you always specify the standard resources, too.
Unfortunately, invoking this method is the most simple way to build tests upon custom resource types, to my best knowledge, so I can't really do much about this.

and also the inconsistency in your code (using both memory and memory-mb).

What did you mean with this? The only use of "memory" all around the change is to prevent it from being used with the new resource configs.

What did you mean with this?

I meant you were initializing memory-mb in tests but checking only memory here. That smells like you should be checking memory-mb here.

There kinds of things should have comments in the code so in the future we know why they are that way.

Please see my last commit with the updates.
I only added some tests, so they are not extensive for every combination of spark resources and YARN standard resources. If you think I can add more testcases but I think this is fine as it is.

Sure, adding some explanatory comments with my next commit.

I think the code is now complete, please check!

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala

...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala

...ce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestTestHelper.scala

SparkQA · 2018-10-11T21:07:57Z

Test build #97282 has finished for PR 20761 at commit f3955f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…resource api

SparkQA · 2018-10-12T21:26:35Z

Test build #97320 has finished for PR 20761 at commit 72e014b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-12T21:46:36Z

Test build #97321 has finished for PR 20761 at commit dc2e382.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-10-13T01:13:50Z

Merging to master.

felixcheung · 2018-10-22T06:28:13Z

docs/running-on-yarn.md

+  <td><code>(none)</code></td>
+  <td>
+    Amount of resource to use for the YARN Application Master in client mode.
+    In cluster mode, use <code>spark.yarn.driver.resource.&lt;resource-type&gt;</code> instead.


nit: looks like spark.yarn.driver.resource.<resource-type> should be spark.yarn.driver.resource.{resource-type}

(yes, I realize resource-type is to be replaced with)

…like GPUs ## What changes were proposed in this pull request? This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines. The custom resources are defined with Spark properties, no additional CLI arguments were introduced. The properties can be defined in the following form: **AM resources, client mode:** Format: `spark.yarn.am.resource.<resource-name>` The property name follows the naming convention of YARN AM cores / memory properties: `spark.yarn.am.memory and spark.yarn.am.cores ` **Driver resources, cluster mode:** Format: `spark.yarn.driver.resource.<resource-name>` The property name follows the naming convention of driver cores / memory properties: `spark.driver.memory and spark.driver.cores.` **Executor resources:** Format: `spark.yarn.executor.resource.<resource-name>` The property name follows the naming convention of executor cores / memory properties: `spark.executor.memory / spark.executor.cores`. For the driver resources (cluster mode) and executor resources properties, we use the `yarn` prefix here as custom resource types are specific to YARN, currently. **Validation:** Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs: ``` "--conf", "spark.driver.memory=2G", "--conf", "spark.yarn.driver.resource.memory=1G" ``` will not start execution and will print an error message. ## How was this patch tested? Unit tests + manual execution with Hadoop2 and Hadoop 3 builds. Testing have been performed on a real cluster with Spark and YARN configured: Cluster and client mode Request Resource Types with lowercase and uppercase units Start Spark job with only requesting standard resources (mem / cpu) Error handling cases: - Request unknown resource type - Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config: ``` --conf spark.yarn.am.resource.memory=1G \ --conf spark.yarn.driver.resource.memory=2G \ --conf spark.yarn.executor.resource.memory=3G \ ``` ), ResourceTypeValidator handles these cases well, so it is not permitted - Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G, this is not permitted and handled well. An example about how I ran the testcases: ``` cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/; ./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1G \ --driver-cores 1 \ --executor-memory 1G \ --executor-cores 1 \ --conf spark.logConf=true \ --conf spark.yarn.executor.resource.gpu=3G \ --verbose \ ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \ 10; ``` Closes apache#20761 from szyszy/SPARK-20327. Authored-by: Szilard Nemeth <snemeth@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

szilard-nemeth force-pushed the SPARK-20327 branch from 48b6baa to 7c3f186 Compare March 14, 2018 17:09

vanzin reviewed Mar 19, 2018

View reviewed changes

szilard-nemeth force-pushed the SPARK-20327 branch from 11aaa69 to e0f40e0 Compare May 13, 2018 17:43

vanzin reviewed May 19, 2018

View reviewed changes

galv reviewed May 19, 2018

View reviewed changes

docs/running-on-yarn.md Outdated Show resolved Hide resolved

docs/running-on-yarn.md Outdated Show resolved Hide resolved

galv reviewed May 20, 2018

View reviewed changes

...urce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceTypeHelperSuite.scala Outdated Show resolved Hide resolved

galv reviewed May 20, 2018

View reviewed changes

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala Outdated Show resolved Hide resolved

szilard-nemeth force-pushed the SPARK-20327 branch from e0f40e0 to d1c7674 Compare May 22, 2018 12:13

szilard-nemeth force-pushed the SPARK-20327 branch from d1c7674 to 6aab82f Compare May 22, 2018 12:20

galv reviewed May 23, 2018

View reviewed changes

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala Outdated Show resolved Hide resolved

galv reviewed May 23, 2018

View reviewed changes

vanzin reviewed May 23, 2018

View reviewed changes

szilard-nemeth force-pushed the SPARK-20327 branch from f360e61 to 7e7a55a Compare October 10, 2018 18:41

szilard-nemeth force-pushed the SPARK-20327 branch from 7e7a55a to 55c7be9 Compare October 10, 2018 18:48

szilard-nemeth force-pushed the SPARK-20327 branch from 55c7be9 to adddf6e Compare October 10, 2018 20:11

vanzin reviewed Oct 10, 2018

View reviewed changes

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala Outdated Show resolved Hide resolved

szilard-nemeth force-pushed the SPARK-20327 branch from adddf6e to f1c2e41 Compare October 10, 2018 20:41

vanzin reviewed Oct 10, 2018

View reviewed changes

Szilard Nemeth added 9 commits October 12, 2018 13:59

SPARK-20327. Introduce custom resource type configs and use new yarn …

a1749af

…resource api

SPARK-20327. Fix review comments

52ef7d3

SPARK-20327. Fix YarnAllocatorSuite test

62d9727

SPARK-20327. Fix review comments

47258b4

SPARK-20327. fix review comments: removed testcase from ClientSuite

91ccddf

SPARK-20327. Fix review comments

e4f1991

SPARK-20327. Fix review comments

7e62d40

SPARK-20327. Fix review comments

f85c922

SPARK-20327. fix review comments

72e014b

szilard-nemeth force-pushed the SPARK-20327 branch from f3955f8 to 72e014b Compare October 12, 2018 20:59

SPARK-20327. Add explanatory comment for resource names

dc2e382

asfgit closed this in 3946de7 Oct 13, 2018

felixcheung reviewed Oct 22, 2018

View reviewed changes

[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs #20761

[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs #20761

Uh oh!

Conversation

szilard-nemeth commented Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Mar 7, 2018

Uh oh!

vanzin commented Mar 14, 2018

Uh oh!

SparkQA commented Mar 14, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szilard-nemeth commented Mar 22, 2018

Uh oh!

jerryshao commented May 3, 2018

Uh oh!

szilard-nemeth commented May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryshao commented May 3, 2018

Uh oh!

szilard-nemeth commented May 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 13, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

galv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

galv commented May 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

galv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented May 22, 2018

Uh oh!

SparkQA commented May 22, 2018

Uh oh!

SparkQA commented May 22, 2018

Uh oh!

szilard-nemeth commented May 22, 2018

szilard-nemeth commented Mar 7, 2018 •

edited

Loading

szilard-nemeth commented May 3, 2018 •

edited

Loading

szilard-nemeth commented May 13, 2018 •

edited

Loading

galv commented May 19, 2018 •

edited

Loading

szilard-nemeth Oct 12, 2018 •

edited

Loading