Skip to content

Conversation

@Yikun
Copy link
Member

@Yikun Yikun commented Nov 1, 2021

What changes were proposed in this pull request?

PodGroup is a group of pods with strong association and is mainly used in batch scheduling, is of a Custom Resource Definition (CRD) type in Kubernetes, PodGroup concept which was approved by Kuberentes community in KEP-583 Coscheduling.

image

This patch adds the PodGroup support for Kuberentes:

  • Add PodGroup configuration: Introduce configurations to enable PodGroup support: spark.kubernetes.enablePodGroup, and also adds two configurations (spark.kubernetes.podgroup.min.[cpu|memory]) to helps user specifing min CPU and min Memory for a PodGroup.
  • Add Volcano implementaions: if user specify the spark k8s scheduler as volcano, will create the PodGroup with minReousrce requirement in Volcano automically, If available resources in the cluster cannot satisfy the requirement, no pod in the PodGroup will be scheduled.
  • Driver/Executor pod would be labeled with scheduling.k8s.io/group-name key and value s"${kubernetesConf.resourceNamePrefix}-podgroup".

Such as, user can use below configuration to request a group of pods with 4 CPU/ 8G Mem as min requirement, the volcano will help user create these pods if the meet the min requirement (4 CPUU, 8G Mem), If available resources in the cluster cannot satisfy the requirement, no pod in the PodGroup will be scheduled.

  --conf spark.kubernetes.driver.scheduler.name=volcano \
  --conf spark.kubernetes.enablePodGroup=true \
  --conf spark.kubernetes.podgroup.min.cpu=4 \
  --conf spark.kubernetes.podgroup.min.memory=8G \

Why are the changes needed?

Provide feature to request minimum resources before scheduling jobs.

Does this PR introduce any user-facing change?

Yes, add podgroup related configuration.

How was this patch tested?

  • UT
  • e2e test:
# Setup K8S
minikube start --cpus 3 --memory 4096
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=spark
# Setup Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Submit job
bin/spark-submit \
  --master k8s://https://127.0.0.1:6443 \
  --deploy-mode cluster \
  --conf spark.kubernetes.driver.scheduler.name=volcano \
  --conf spark.kubernetes.enablePodGroup=true \
  --conf spark.executor.instances=1 \
  --conf spark.kubernetes.namespace=default \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=spark:latest \
  --class org.apache.spark.examples.SparkPi \
  --name spark-pi \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0-SNAPSHOT.jar

@Yikun
Copy link
Member Author

Yikun commented Nov 1, 2021

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144808 has finished for PR 34456 at commit c066dd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PodGroup extends CustomResource[PodGroupSpec, PodGroupStatus] with Namespaced
  • class PodGroupSpec extends KubernetesResource
  • class PodGroupStatus extends KubernetesResource

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49278/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49278/

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Test build #145065 has finished for PR 34456 at commit c066dd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PodGroup extends CustomResource[PodGroupSpec, PodGroupStatus] with Namespaced
  • class PodGroupSpec extends KubernetesResource
  • class PodGroupStatus extends KubernetesResource

@Yikun
Copy link
Member Author

Yikun commented Nov 15, 2021

I moved the PodGroup related API to a proper place, as a extension of k8s-client: fabric8io/kubernetes-client#3580 .

@github-actions github-actions bot added the BUILD label Nov 27, 2021
@Yikun Yikun marked this pull request as draft November 27, 2021 10:35
@Yikun Yikun changed the title [WIP][SPARK-36061][K8S] Add support for PodGroup [SPARK-36061][K8S] Add support for PodGroup Nov 27, 2021
<type>test-jar</type>
</dependency>

<dependency>
Copy link
Member Author

@Yikun Yikun Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volcano support in k8s-cli would be released at kubernetes-client v5.11
fabric8io/kubernetes-client#3580

TODO: neet to bump kubernetes-client version to latest when it publised.

spark/pom.xml

Line 207 in 7b50cf0

<kubernetes-client.version>5.10.1</kubernetes-client.version>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Pod creation.
*/
def getAdditionalKubernetesResources(): Seq[HasMetadata] = Seq.empty

Copy link
Member Author

@Yikun Yikun Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is submited as a separated PR in #34599 .(1st and 2nd commits)

You could only see the 3rd commit for more clearly understand, that is, we only add pod feature step in this PR.

@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2021

Codecov Report

Merging #34456 (7b50cf0) into master (d57f1bb) will decrease coverage by 7.98%.
The diff coverage is 74.35%.

❗ Current head 7b50cf0 differs from pull request most recent head c6a6688. Consider uploading reports for the commit c6a6688 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #34456      +/-   ##
==========================================
- Coverage   90.15%   82.17%   -7.99%     
==========================================
  Files         290      251      -39     
  Lines       62515    56994    -5521     
  Branches     9104     9281     +177     
==========================================
- Hits        56362    46833    -9529     
- Misses       4784     8947    +4163     
+ Partials     1369     1214     -155     
Flag Coverage Δ
unittests 82.15% <74.35%> (-7.99%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/pyspark/context.py 57.86% <ø> (-27.67%) ⬇️
python/pyspark/shuffle.py 18.73% <0.00%> (-53.68%) ⬇️
python/pyspark/rdd.py 40.40% <33.33%> (-52.10%) ⬇️
python/pyspark/cloudpickle/cloudpickle.py 50.78% <39.68%> (-4.60%) ⬇️
python/pyspark/ml/common.py 51.35% <40.90%> (-24.37%) ⬇️
python/pyspark/cloudpickle/cloudpickle_fast.py 66.25% <61.90%> (-3.24%) ⬇️
python/pyspark/pandas/sql_formatter.py 89.02% <89.02%> (ø)
python/pyspark/serializers.py 60.86% <91.11%> (-20.85%) ⬇️
python/pyspark/mllib/common.py 87.20% <92.85%> (-1.82%) ⬇️
python/pyspark/__init__.py 90.69% <100.00%> (-4.66%) ⬇️
... and 99 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04671bd...c6a6688. Read the comment docs.

@SparkQA
Copy link

SparkQA commented Nov 27, 2021

Test build #145675 has finished for PR 34456 at commit c6a6688.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50145/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I close this PR since the artifact doesn't exist for now. Please reopen this when you are ready.

[error] sbt.librarymanagement.ResolveException: Error downloading io.fabric8:volcano-model-v1beta1:5.10.1
[error]   Not found
[error]   Not found
[error]   not found: https://maven-central.storage-download.googleapis.com/maven2/io/fabric8/volcano-model-v1beta1/5.10.1/volcano-model-v1beta1-5.10.1.pom
[error]   not found: https://repo1.maven.org/maven2/io/fabric8/volcano-model-v1beta1/5.10.1/volcano-model-v1beta1-5.10.1.pom
[error]   not found: /home/jenkins/sparkivy/per-executor-caches/11/.m2/repository/io/fabric8/volcano-model-v1beta1/5.10.1/volcano-model-v1beta1-5.10.1.pom
[error]   not found: /home/jenkins/sparkivy/per-executor-caches/11/.ivy2/localio.fabric8/volcano-model-v1beta1/5.10.1/ivys/ivy.xml
[error] Error downloading io.fabric8:volcano-client:5.10.1
[error]   Not found
[error]   Not found
[error]   not found: https://maven-central.storage-download.googleapis.com/maven2/io/fabric8/volcano-client/5.10.1/volcano-client-5.10.1.pom
[error]   not found: https://repo1.maven.org/maven2/io/fabric8/volcano-client/5.10.1/volcano-client-5.10.1.pom
[error]   not found: /home/jenkins/sparkivy/per-executor-caches/11/.m2/repository/io/fabric8/volcano-client/5.10.1/volcano-client-5.10.1.pom
[error]   not found: /home/jenkins/sparkivy/per-executor-caches/11/.ivy2/localio.fabric8/volcano-client/5.10.1/ivys/ivy.xml

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it is/will be possible to setup Driver and/or Executor configurations with their own PodGroups.
What about a PodGroup that reserves resources (pods) for all actors (driver + executors) ?

<type>test-jar</type>
</dependency>

<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yikun
Copy link
Member Author

Yikun commented Feb 7, 2022

@martin-g Thanks for review, because the implementation it has a big different than before, so I just replace this by #35422 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants