Skip to content

Conversation

@szilard-nemeth
Copy link

@szilard-nemeth szilard-nemeth commented Mar 7, 2018

What changes were proposed in this pull request?

This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines.
The custom resources are defined with Spark properties, no additional CLI arguments were introduced.

The properties can be defined in the following form:

AM resources, client mode:
Format: spark.yarn.am.resource.<resource-name>
The property name follows the naming convention of YARN AM cores / memory properties: spark.yarn.am.memory and spark.yarn.am.cores

Driver resources, cluster mode:
Format: spark.yarn.driver.resource.<resource-name>
The property name follows the naming convention of driver cores / memory properties: spark.driver.memory and spark.driver.cores.

Executor resources:
Format: spark.yarn.executor.resource.<resource-name>
The property name follows the naming convention of executor cores / memory properties: spark.executor.memory / spark.executor.cores.

For the driver resources (cluster mode) and executor resources properties, we use the yarn prefix here as custom resource types are specific to YARN, currently.

Validation:
Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs:

"--conf", "spark.driver.memory=2G",
"--conf", "spark.yarn.driver.resource.memory=1G"

will not start execution and will print an error message.

How was this patch tested?

Unit tests + manual execution with Hadoop2 and Hadoop 3 builds.

Testing have been performed on a real cluster with Spark and YARN configured:
Cluster and client mode
Request Resource Types with lowercase and uppercase units
Start Spark job with only requesting standard resources (mem / cpu)
Error handling cases:

  • Request unknown resource type
  • Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config:
--conf spark.yarn.am.resource.memory=1G \
  --conf spark.yarn.driver.resource.memory=2G \
  --conf spark.yarn.executor.resource.memory=3G \

), ResourceTypeValidator handles these cases well, so it is not permitted

  • Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G, this is not permitted and handled well.

An example about how I ran the testcases:

cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/;
./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 1G \
  --driver-cores 1 \
  --executor-memory 1G \
  --executor-cores 1 \
  --conf spark.logConf=true \
  --conf spark.yarn.executor.resource.gpu=3G \
  --verbose \
  ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \
  10;

@vanzin
Copy link
Contributor

vanzin commented Mar 7, 2018

Hi, please take some time to look at the how to contribute page:
https://spark.apache.org/contributing.html

You'll find the code conventions we use there. You have many style discrepancies in your code.

Also, given this is a YARN feature, there should be zero changes in core. All you're trying to do should be possible by just parsing user-provided config options in YARN's Client.scala.

@vanzin
Copy link
Contributor

vanzin commented Mar 14, 2018

ok to test

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Test build #88235 has finished for PR 20761 at commit 11aaa69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation needs to be clear that this is not available in all Hadoop versions.

I'm also confused about how cores and memory are handled; what if the user sets both?

Optimally that would not be allowed, and the existing Spark configs should be used, and the new code would only apply to other resources.

@szilard-nemeth
Copy link
Author

Thanks for your comments @vanzin !
The earliest I can resolve these and push the fixes is week after next, as I will not be available until then (vacation).

@jerryshao
Copy link
Contributor

Hi @szyszy are you still going to work on this PR?

@szilard-nemeth
Copy link
Author

szilard-nemeth commented May 3, 2018

Hey @jerryshao! Exactly, I almost addressed all of Marcelo's comments so please expect a PR update in the coming days.

@jerryshao
Copy link
Contributor

Cool, thanks!

@szilard-nemeth
Copy link
Author

szilard-nemeth commented May 13, 2018

@vanzin :
Please check my updates!

Reply to your main comment from above:
For the documentation: Please advise which documentation(s) should I update and where!
For your question about the case when user sets both: That’s why I created the ResourceTypeValidator class. As per our discussion, ResourceTypeValidator will not accept memory / vcores resource requests with the new yarn configs e.g. spark.yarn.executor.resource.memory will be denied with an error message.
I mentioned in another comment, but it’s worth to state here as well: I refactored ResourceTypeValidator and added a bunch of javadoc comments.
Also, I included how validation works in the PR's main description.
Do you agree with this kind of validation or do you have something else in mind?

@SparkQA
Copy link

SparkQA commented May 13, 2018

Test build #90553 has finished for PR 20761 at commit e0f40e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a partial review. I need to pick up again from the validator code.

Copy link

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work. It's very valuable for me. I'm not a spark maintainer, but would be happy to finish reviewing this for you shortly. Do you need help with testing?

@galv
Copy link

galv commented May 19, 2018

I may be missing something, but doesn't this change depend on Spark's yarn client being made hadoop 3.0 compatible?

Edit: Never mind, didn't realize you were using reflection to get around it.

Copy link

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this was helpful to you. Let me know if you have any questions. I'm thinking that you may be able to remove the ResourceTypeHelper class, which can make this commit quite a bit of smaller, depending on what you think.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90955 has finished for PR 20761 at commit d1c7674.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90958 has finished for PR 20761 at commit 6aab82f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90968 has finished for PR 20761 at commit ec549f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@szilard-nemeth
Copy link
Author

@vanzin , @galv : I think all of your comments are either addressed or answered, please check once again!

Copy link

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you integration tested this against a running yarn cluster?

I can probably finish up sometime tomorrow.

@galv
Copy link

galv commented Jun 15, 2018

I would like to see this merged, though I got derailed by spark summit last week and other things. I will look this patch over again soon @szyszy If you're busy lately, perhaps I can take over the rest of the code changes suggested by @vanzin, if necessary (I get the impression that this PR is just about ready to merge).

@vanzin I appreciate your detailed responses. I'm curious whether you have any overarching serious concerns about this patch, e.g., about its design. I think the interface is fairly appropriate, but I thought I should check whether you think this PR will be ready to merge soon.

@szilard-nemeth
Copy link
Author

Hi @galv!
Thanks for reaching out.
Actually I'm quite busy with other things so that's why I haven't been working on this lately.
It would be very great and I would be really happy if you could take this over.
Please confirm whether you could do this, the sooner I could work on this again is next week, approximately.

@vanzin
Copy link
Contributor

vanzin commented Jun 15, 2018

I don't have issues with the design - I think the main two things I was concerned about were:

  • not adding another way to set existing Spark options like mem and cores, which has been addressed
  • the seemingly unnecessary complexity in certain parts of the code like the validator

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97212 has finished for PR 20761 at commit 7e7a55a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97213 has finished for PR 20761 at commit 55c7be9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97215 has finished for PR 20761 at commit adddf6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97216 has finished for PR 20761 at commit f1c2e41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97218 has finished for PR 20761 at commit f509153.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, one odd thing left in the validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went and looked at the documentation because I remember this being confusing. The documentation mentions both memory and memory-mb as being valid, with the latter being preferred. So it sounds to me like you can use either, and that this code should disallow both.

You even initialize memory-mb in your tests, instead of memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting for a word on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!
Did you mean this documentation?
https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html
I think it's required to check all the keys for memory / vcore that YARN deprecates, as those will flow trough Spark and eventually reach YARN's ResourceInformation and it will just blow up as only memory-mb and vcores are the ones that are not deprecated. The reason why it haven't caused a problem with current Spark code as it is using the Resource object and not using ResourceInformation at all.
So we need to disallow these:

  • cpu-vcores
  • memory
  • mb

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the YARN code or what it does here.

I'm just worried about users setting cpu/memory resources outside of the proper Spark settings, and also the inconsistency in your code (using both memory and memory-mb).

Copy link
Author

@szilard-nemeth szilard-nemeth Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two separate things:

  1. One is that I don't reject all the deprecated standard resources has been known to YARN (explained in previous comment) which I will address soon. To be exact, I need to reject not just the deprecateds, but all possible ways to define standard resources for the memory and CPU cores.
  2. Using memory-mb is the only way to initialize the memory resource with the YARN client, with the method ResourceUtils.reinitializeResources.
    I played around with this a bit, if I omit the standard resources and try to specify custom resources and then call ResourceUtils.reinitializeResources, an internal YARN exception will be thrown as it relies on the fact that when you invoke this method, you always specify the standard resources, too.
    Unfortunately, invoking this method is the most simple way to build tests upon custom resource types, to my best knowledge, so I can't really do much about this.

and also the inconsistency in your code (using both memory and memory-mb).

What did you mean with this? The only use of "memory" all around the change is to prevent it from being used with the new resource configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you mean with this?

I meant you were initializing memory-mb in tests but checking only memory here. That smells like you should be checking memory-mb here.

There kinds of things should have comments in the code so in the future we know why they are that way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my last commit with the updates.
I only added some tests, so they are not extensive for every combination of spark resources and YARN standard resources. If you think I can add more testcases but I think this is fine as it is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, adding some explanatory comments with my next commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is now complete, please check!

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97282 has finished for PR 20761 at commit f3955f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97320 has finished for PR 20761 at commit 72e014b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97321 has finished for PR 20761 at commit dc2e382.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 13, 2018

Merging to master.

@asfgit asfgit closed this in 3946de7 Oct 13, 2018
<td><code>(none)</code></td>
<td>
Amount of resource to use for the YARN Application Master in client mode.
In cluster mode, use <code>spark.yarn.driver.resource.&lt;resource-type&gt;</code> instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like spark.yarn.driver.resource.&lt;resource-type&gt; should be spark.yarn.driver.resource.{resource-type}

(yes, I realize resource-type is to be replaced with)

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…like GPUs

## What changes were proposed in this pull request?

This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines.
The custom resources are defined with Spark properties, no additional CLI arguments were introduced.

The properties can be defined in the following form:

**AM resources, client mode:**
Format: `spark.yarn.am.resource.<resource-name>`
The property name follows the naming convention of YARN AM cores / memory properties: `spark.yarn.am.memory and spark.yarn.am.cores
`

**Driver resources, cluster mode:**
Format: `spark.yarn.driver.resource.<resource-name>`
The property name follows the naming convention of driver cores / memory properties: `spark.driver.memory and spark.driver.cores.`

**Executor resources:**
Format: `spark.yarn.executor.resource.<resource-name>`
The property name follows the naming convention of executor cores / memory properties: `spark.executor.memory / spark.executor.cores`.

For the driver resources (cluster mode) and executor resources properties, we use the `yarn` prefix here as custom resource types are specific to YARN, currently.

**Validation:**
Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs:
```
"--conf", "spark.driver.memory=2G",
"--conf", "spark.yarn.driver.resource.memory=1G"
```

will not start execution and will print an error message.

## How was this patch tested?
Unit tests + manual execution with Hadoop2 and Hadoop 3 builds.

Testing have been performed on a real cluster with Spark and YARN configured:
Cluster and client mode
Request Resource Types with lowercase and uppercase units
Start Spark job with only requesting standard resources (mem / cpu)
Error handling cases:
- Request unknown resource type
- Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config:
```
--conf spark.yarn.am.resource.memory=1G \
  --conf spark.yarn.driver.resource.memory=2G \
  --conf spark.yarn.executor.resource.memory=3G \
```
), ResourceTypeValidator handles these cases well, so it is not permitted
- Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G,  this is not permitted and handled well.

An example about how I ran the testcases:
```
cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/;
./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 1G \
  --driver-cores 1 \
  --executor-memory 1G \
  --executor-cores 1 \
  --conf spark.logConf=true \
  --conf spark.yarn.executor.resource.gpu=3G \
  --verbose \
  ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \
  10;
```

Closes apache#20761 from szyszy/SPARK-20327.

Authored-by: Szilard Nemeth <snemeth@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants