-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs #20761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, please take some time to look at the how to contribute page: You'll find the code conventions we use there. You have many style discrepancies in your code. Also, given this is a YARN feature, there should be zero changes in core. All you're trying to do should be possible by just parsing user-provided config options in YARN's |
48b6baa to
7c3f186
Compare
|
ok to test |
|
Test build #88235 has finished for PR 20761 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation needs to be clear that this is not available in all Hadoop versions.
I'm also confused about how cores and memory are handled; what if the user sets both?
Optimally that would not be allowed, and the existing Spark configs should be used, and the new code would only apply to other resources.
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
|
Thanks for your comments @vanzin ! |
|
Hi @szyszy are you still going to work on this PR? |
|
Hey @jerryshao! Exactly, I almost addressed all of Marcelo's comments so please expect a PR update in the coming days. |
|
Cool, thanks! |
11aaa69 to
e0f40e0
Compare
|
@vanzin : Reply to your main comment from above: |
|
Test build #90553 has finished for PR 20761 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a partial review. I need to pick up again from the validator code.
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
galv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work. It's very valuable for me. I'm not a spark maintainer, but would be happy to finish reviewing this for you shortly. Do you need help with testing?
|
Edit: Never mind, didn't realize you were using reflection to get around it. |
...urce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceTypeHelperSuite.scala
Outdated
Show resolved
Hide resolved
galv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this was helpful to you. Let me know if you have any questions. I'm thinking that you may be able to remove the ResourceTypeHelper class, which can make this commit quite a bit of smaller, depending on what you think.
...urce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceTypeHelperSuite.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
e0f40e0 to
d1c7674
Compare
|
Test build #90955 has finished for PR 20761 at commit
|
d1c7674 to
6aab82f
Compare
|
Test build #90958 has finished for PR 20761 at commit
|
|
Test build #90968 has finished for PR 20761 at commit
|
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
galv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you integration tested this against a running yarn cluster?
I can probably finish up sometime tomorrow.
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceTypeValidator.scala
Outdated
Show resolved
Hide resolved
|
I would like to see this merged, though I got derailed by spark summit last week and other things. I will look this patch over again soon @szyszy If you're busy lately, perhaps I can take over the rest of the code changes suggested by @vanzin, if necessary (I get the impression that this PR is just about ready to merge). @vanzin I appreciate your detailed responses. I'm curious whether you have any overarching serious concerns about this patch, e.g., about its design. I think the interface is fairly appropriate, but I thought I should check whether you think this PR will be ready to merge soon. |
|
Hi @galv! |
|
I don't have issues with the design - I think the main two things I was concerned about were:
|
f360e61 to
7e7a55a
Compare
|
Test build #97212 has finished for PR 20761 at commit
|
7e7a55a to
55c7be9
Compare
|
Test build #97213 has finished for PR 20761 at commit
|
55c7be9 to
adddf6e
Compare
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala
Outdated
Show resolved
Hide resolved
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #97215 has finished for PR 20761 at commit
|
adddf6e to
f1c2e41
Compare
|
Test build #97216 has finished for PR 20761 at commit
|
|
Test build #97218 has finished for PR 20761 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, one odd thing left in the validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went and looked at the documentation because I remember this being confusing. The documentation mentions both memory and memory-mb as being valid, with the latter being preferred. So it sounds to me like you can use either, and that this code should disallow both.
You even initialize memory-mb in your tests, instead of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting for a word on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Did you mean this documentation?
https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
I think it's required to check all the keys for memory / vcore that YARN deprecates, as those will flow trough Spark and eventually reach YARN's ResourceInformation and it will just blow up as only memory-mb and vcores are the ones that are not deprecated. The reason why it haven't caused a problem with current Spark code as it is using the Resource object and not using ResourceInformation at all.
So we need to disallow these:
- cpu-vcores
- memory
- mb
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the YARN code or what it does here.
I'm just worried about users setting cpu/memory resources outside of the proper Spark settings, and also the inconsistency in your code (using both memory and memory-mb).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two separate things:
- One is that I don't reject all the deprecated standard resources has been known to YARN (explained in previous comment) which I will address soon. To be exact, I need to reject not just the deprecateds, but all possible ways to define standard resources for the memory and CPU cores.
- Using
memory-mbis the only way to initialize the memory resource with the YARN client, with the methodResourceUtils.reinitializeResources.
I played around with this a bit, if I omit the standard resources and try to specify custom resources and then callResourceUtils.reinitializeResources, an internal YARN exception will be thrown as it relies on the fact that when you invoke this method, you always specify the standard resources, too.
Unfortunately, invoking this method is the most simple way to build tests upon custom resource types, to my best knowledge, so I can't really do much about this.
and also the inconsistency in your code (using both memory and memory-mb).
What did you mean with this? The only use of "memory" all around the change is to prevent it from being used with the new resource configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you mean with this?
I meant you were initializing memory-mb in tests but checking only memory here. That smells like you should be checking memory-mb here.
There kinds of things should have comments in the code so in the future we know why they are that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my last commit with the updates.
I only added some tests, so they are not extensive for every combination of spark resources and YARN standard resources. If you think I can add more testcases but I think this is fine as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, adding some explanatory comments with my next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code is now complete, please check!
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ResourceRequestHelper.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
...e-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestHelperSuite.scala
Outdated
Show resolved
Hide resolved
...ce-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ResourceRequestTestHelper.scala
Outdated
Show resolved
Hide resolved
|
Test build #97282 has finished for PR 20761 at commit
|
f3955f8 to
72e014b
Compare
|
Test build #97320 has finished for PR 20761 at commit
|
|
Test build #97321 has finished for PR 20761 at commit
|
|
Merging to master. |
| <td><code>(none)</code></td> | ||
| <td> | ||
| Amount of resource to use for the YARN Application Master in client mode. | ||
| In cluster mode, use <code>spark.yarn.driver.resource.<resource-type></code> instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like spark.yarn.driver.resource.<resource-type> should be spark.yarn.driver.resource.{resource-type}
(yes, I realize resource-type is to be replaced with)
…like GPUs ## What changes were proposed in this pull request? This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines. The custom resources are defined with Spark properties, no additional CLI arguments were introduced. The properties can be defined in the following form: **AM resources, client mode:** Format: `spark.yarn.am.resource.<resource-name>` The property name follows the naming convention of YARN AM cores / memory properties: `spark.yarn.am.memory and spark.yarn.am.cores ` **Driver resources, cluster mode:** Format: `spark.yarn.driver.resource.<resource-name>` The property name follows the naming convention of driver cores / memory properties: `spark.driver.memory and spark.driver.cores.` **Executor resources:** Format: `spark.yarn.executor.resource.<resource-name>` The property name follows the naming convention of executor cores / memory properties: `spark.executor.memory / spark.executor.cores`. For the driver resources (cluster mode) and executor resources properties, we use the `yarn` prefix here as custom resource types are specific to YARN, currently. **Validation:** Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs: ``` "--conf", "spark.driver.memory=2G", "--conf", "spark.yarn.driver.resource.memory=1G" ``` will not start execution and will print an error message. ## How was this patch tested? Unit tests + manual execution with Hadoop2 and Hadoop 3 builds. Testing have been performed on a real cluster with Spark and YARN configured: Cluster and client mode Request Resource Types with lowercase and uppercase units Start Spark job with only requesting standard resources (mem / cpu) Error handling cases: - Request unknown resource type - Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config: ``` --conf spark.yarn.am.resource.memory=1G \ --conf spark.yarn.driver.resource.memory=2G \ --conf spark.yarn.executor.resource.memory=3G \ ``` ), ResourceTypeValidator handles these cases well, so it is not permitted - Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G, this is not permitted and handled well. An example about how I ran the testcases: ``` cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/; ./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1G \ --driver-cores 1 \ --executor-memory 1G \ --executor-cores 1 \ --conf spark.logConf=true \ --conf spark.yarn.executor.resource.gpu=3G \ --verbose \ ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \ 10; ``` Closes apache#20761 from szyszy/SPARK-20327. Authored-by: Szilard Nemeth <snemeth@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
What changes were proposed in this pull request?
This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines.
The custom resources are defined with Spark properties, no additional CLI arguments were introduced.
The properties can be defined in the following form:
AM resources, client mode:
Format:
spark.yarn.am.resource.<resource-name>The property name follows the naming convention of YARN AM cores / memory properties:
spark.yarn.am.memory and spark.yarn.am.coresDriver resources, cluster mode:
Format:
spark.yarn.driver.resource.<resource-name>The property name follows the naming convention of driver cores / memory properties:
spark.driver.memory and spark.driver.cores.Executor resources:
Format:
spark.yarn.executor.resource.<resource-name>The property name follows the naming convention of executor cores / memory properties:
spark.executor.memory / spark.executor.cores.For the driver resources (cluster mode) and executor resources properties, we use the
yarnprefix here as custom resource types are specific to YARN, currently.Validation:
Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs:
will not start execution and will print an error message.
How was this patch tested?
Unit tests + manual execution with Hadoop2 and Hadoop 3 builds.
Testing have been performed on a real cluster with Spark and YARN configured:
Cluster and client mode
Request Resource Types with lowercase and uppercase units
Start Spark job with only requesting standard resources (mem / cpu)
Error handling cases:
), ResourceTypeValidator handles these cases well, so it is not permitted
An example about how I ran the testcases: