[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

AlexandreBrown · 2023-06-24T15:48:09Z

Feature Area

/area sdk

What feature would you like to see?

A core production use case of KFP involves being able to run CPU and GPU workload on specific nodegroups that are more powerful and different from the nodegroup where Kubeflow is installed and usually they will have autoscaling as well.
In order to achieve this, we used to be able to simply specify which component would run on which node using node affinity + tolerations. This is no longer possible in KFP v2 yet I feel like such a core feature should be supported.

What is the use case or pain point?

The existing set_accelerator_type is far from being flexible enough to allow such use case. Here is a small list of examples that shows that set_accelerator_type is not flexible enough to support production use cases :

Does not work if the GPU is not one of the few (3) supported GPU : NVIDIA_TESLA_K80, TPU_V3 or cloud-tpus.google.com/v3. Otherwise we must use the generic nvidia.com/gpu which is not precise hence defeating the purpose of selecting an accelerator.
If you have 2 nodegroups with the same GPU but one should be reserved for inference and one should be reserved for pipeline exectuion (eg: training) then there is no way to cover such distinction purely based on set_accelerator_type('nvidia.com/gpu').
this method is only meant to be used for GPU but it is common to want to run CPU workload on specific nodegroups, reasons could include nodegroup isolation (to run workload that wont affect the nodegroup where Kubeflow core pods run) or to allow for more powerful CPU nodegroups to be used for pipeline while Kubeflow would remain on cheaper instances.

Is there a workaround currently?

Users can try to use external tools such as Kyverno to create mutating rules that a webhook can use to add a toleration and/or node affinity/node selector based on some predefined criteria such as a label name and value.
It's still a pain since it makes it way more involved than being able to use .add_node_affinity() and .add_toleration() on a component. In fact, we can't even add a label using kfp sdk anymore so matching has to be done on labels that are present by chance (we have no control to explicitly ensure their presence).
Also even using Kyverno, some cases might be hard or impossible to cover. For instance, if you have 2 kubeflow components, both will have the same labels yet you'd like one to run on a less expensive GPU nodegroup and only have one of the two run on a more powerful GPU nodegroup, then in that case since the pods have the same labels, the only way to specify which nodegroup it should run on is at component definition time (via KFP sdk) yet this is not supported currently in KFP v2.
Given that Kubeflow's main goal is to lower the barrier to run ML on kubernetes, I believe this workaround goes against such goal and should not be the only solution that is available to people. It would be in everyone's best interest if the KFP SDK adds back the add_node_affinity() and add_toleration() so that data scientists/ML specialists can easily specify where to run each component instead of relying on more advanced MLOps solutions that require more and more Kubernetes knowledge.

Love this idea? Give it a 👍.

The text was updated successfully, but these errors were encountered:

cjidboon94 · 2023-06-26T08:32:26Z

Additionally would be great to have the ability to set requests/limits for custom resources. cpu, memory and nvidia.com/gpu are obviously staples and cover most of the necessary resource requests/limits, but being able to use and experiment with other custom resources (e.g. to make GPU sharing between containers possible) is a big plus too. So in addition to the above, would like to see add_resource_request() and add_resource_limit() back in the new versions of KFP SDK.

jlyaoyuli · 2023-06-29T22:39:30Z

Hello @AlexandreBrown , thanks for proposing this. The node selector is already supported. https://www.kubeflow.org/docs/components/pipelines/v2/platform-specific-features/
For the node affinity and toleration is awaiting contributors!

mcrisafu · 2023-07-10T12:18:41Z

Hello @connor-mccarthy, as I have a high need for this feature, I already have an implementation. I would love to contribute it. How should we proceed? Can I open a PR or do we need to do a Design Review first? (CLA is already submitted)

connor-mccarthy · 2023-07-19T19:36:32Z

Hi, @mcrisafu! Thanks for your interest in contributing this.

I think this feature is sufficiently large to be deserving of a design. Please feel free to start there. You can add the doc to this issue. From there, we can decide whether it makes sense to discuss at an upcoming KFP community meeting [Meeting, Agenda].

mcrisafu · 2023-07-20T07:53:33Z

Hi @connor-mccarthy, thank you for your feedback. Here the requested design doc.

connor-mccarthy · 2023-08-07T23:47:16Z

@Linchin, when you have the chance, could you please take a look at this FR for the KFP BE?

Linchin · 2023-08-09T00:25:13Z

@mcrisafu Thank you for writing the design doc, which includes the general idea. Could you expand it to include more implementation details, ambiguities, potential challenges etc.?

schrodervictor · 2023-08-16T14:19:43Z

I would like to draw some attention to this topic. There is another issue referencing this one (#9768) where the implementation of tolerations and affinity is mentioned as part of a bigger plan. @cjidboon94 and @Linchin interacted with that one, but I think @connor-mccarthy and @AlexandreBrown not yet.

We really need this feature and I believe a lot of people do. Node selection, toleration and affinity settings are essential parts of effective pod scheduling in Kubernetes.

I offered help in that thread and could go ahead and start implementing these features, however I found this thread where @mcrisafu mentioned: "...I already have an implementation".

I would love to contribute, but of course there's no point doing duplicated work. Therefore, I ask:

Is there anything blocking the progress of this feature?
Is the implementation fully done or, if not, how advanced is it?
Can we help you in any way to accelerate the process?

We are happy to jump into code review, testing or the implementation itself, should it still be missing some (or several) parts.

mcrisafu · 2023-08-18T09:32:59Z

Thank you, @schrodervictor. I haven't had the time yet to update the document. I also believe that the suggestion from @cjidboon94 in #9768 is much better than my own "implementation."

We have decided not to migrate to KFP v2 just yet, as we have several issues beyond just toleration. We would greatly appreciate having #9768 implemented. Unfortunately, I don't understand the code base (and Go) well enough to do this myself.

From my perspective, it would be better to prioritize pushing the other feature instead of going with my sub-optimal hack. However, if you're still interested in the code, please take a look at this commit.

Linchin · 2023-08-23T18:29:29Z

A related PR: #9913

pythonking6 · 2023-10-17T20:44:20Z

This is a critical feature...

droctothorpe · 2024-01-10T15:55:31Z

Any updates on this? We really need this in order to migrate users to V2. Also, we'd be down to contribute to implementation.

github-actions · 2024-03-17T07:41:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

strickvl · 2024-03-17T08:54:54Z

Commenting so this doesn’t get closed. The feature is still needed.

github-actions · 2024-05-17T07:41:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

krzysztofkropatwa · 2024-05-17T07:44:29Z

Commenting so this doesn’t get closed. The feature is still very needed.

strickvl · 2024-05-17T07:49:19Z

+1

rimolive · 2024-05-23T21:11:46Z

/lifecycle frozen

droctothorpe · 2024-05-23T21:16:00Z

Happy to tackle affinity support, bandwidth permitting.

rimolive · 2024-06-04T11:47:07Z

@droctothorpe Thank you for volunteering! There is a draft PR to cover that issue, I pinged you in there so you can talk to the PR author to team up and finish the implementation.

AlexandreBrown added the kind/feature label Jun 24, 2023

google-oss-prow bot added the area/sdk label Jun 24, 2023

jlyaoyuli assigned connor-mccarthy Jun 29, 2023

AlexandreBrown changed the title ~~[feature] Add ability to specify node selector/node affinity & toleration using KFP V2~~ [feature] Add ability to specify node affinity & toleration using KFP V2 Jul 1, 2023

cjidboon94 mentioned this issue Jul 20, 2023

[feature/idea] Integrate kubernetes's models into kfp-kubernetes #9768

Open

connor-mccarthy assigned Linchin and unassigned connor-mccarthy Aug 7, 2023

Linchin removed their assignment Aug 28, 2023

PhilippeMoussalli mentioned this issue Sep 5, 2023

Create Vertex runner ml6team/fondant#393

Closed

zijianjoy mentioned this issue Dec 14, 2023

[question] how to add tolerations using sdk v2. #10252

Closed

luminoso mentioned this issue Dec 22, 2023

[sdk] parameters are ignored for pipelines generated with kfp-sdk 1.8.22 #10346

Closed

zijianjoy mentioned this issue Jan 11, 2024

[feature] Apply Pod Labels and Pod Annotations for Kubeflow Pipeline Tasks #10363

Closed

droctothorpe mentioned this issue Jan 31, 2024

feat(kubernetes_platform): Update kubernetes_platform go package to i… #10442

Merged

1 task

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 17, 2024

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 17, 2024

cjidboon94 mentioned this issue Mar 18, 2024

feat(kubernetes_platform): Update kubernetes_platform go package to include node affinities and pod (anti)affinities #10583

Merged

1 task

This was referenced Apr 5, 2024

chore(kfp-kubernetes): change type of affinity weight to int32 #10671

Merged

feat(Backend + SDK): Update kfp backend and kubernetes sdk to support node affinity #10672

Open

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 17, 2024

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 17, 2024

google-oss-prow bot added the lifecycle/frozen label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

AlexandreBrown commented Jun 24, 2023 •

edited

Loading

cjidboon94 commented Jun 26, 2023

jlyaoyuli commented Jun 29, 2023

mcrisafu commented Jul 10, 2023

connor-mccarthy commented Jul 19, 2023

mcrisafu commented Jul 20, 2023

connor-mccarthy commented Aug 7, 2023

Linchin commented Aug 9, 2023

schrodervictor commented Aug 16, 2023

mcrisafu commented Aug 18, 2023

Linchin commented Aug 23, 2023

pythonking6 commented Oct 17, 2023

droctothorpe commented Jan 10, 2024 •

edited

Loading

github-actions bot commented Mar 17, 2024

strickvl commented Mar 17, 2024

github-actions bot commented May 17, 2024

krzysztofkropatwa commented May 17, 2024

strickvl commented May 17, 2024

rimolive commented May 23, 2024

droctothorpe commented May 23, 2024

rimolive commented Jun 4, 2024

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

[feature] Add ability to specify node affinity & toleration using KFP V2 #9682

Comments

AlexandreBrown commented Jun 24, 2023 • edited Loading

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

cjidboon94 commented Jun 26, 2023

jlyaoyuli commented Jun 29, 2023

mcrisafu commented Jul 10, 2023

connor-mccarthy commented Jul 19, 2023

mcrisafu commented Jul 20, 2023

connor-mccarthy commented Aug 7, 2023

Linchin commented Aug 9, 2023

schrodervictor commented Aug 16, 2023

mcrisafu commented Aug 18, 2023

Linchin commented Aug 23, 2023

pythonking6 commented Oct 17, 2023

droctothorpe commented Jan 10, 2024 • edited Loading

github-actions bot commented Mar 17, 2024

strickvl commented Mar 17, 2024

github-actions bot commented May 17, 2024

krzysztofkropatwa commented May 17, 2024

strickvl commented May 17, 2024

rimolive commented May 23, 2024

droctothorpe commented May 23, 2024

rimolive commented Jun 4, 2024

AlexandreBrown commented Jun 24, 2023 •

edited

Loading

droctothorpe commented Jan 10, 2024 •

edited

Loading