[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

andre-lx · 2023-06-02T11:03:33Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?
Kubeflow deployment
KFP version:
1.8.2

Steps to reproduce

I didn't find any reference for this issue.

At this moment, 2 pods are created for each namespace. The pipelines are running smoothly, but this 2 pods per namespace are making our Kubernetes cluster a lot more expensive.

Imagine the following scenario:

500 users, each one with their own namespace, 2 pods per namespace = 1000 pods
each node runs up to 100 pods

You will get up to 10 nodes, only to have these 2 pods per user. Even if the user does not use the pipelines.

A practical example:

> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | wc -l
110
> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | grep ml-pipeline | wc -l
100

My question is, what can we do to improve this costs. For example, there are any way of not creating the pods, or creating the pods only when they are necessary.

Expected result

Since this takes a lot of unnecessary resources, should exist a way of improving this.

Thanks

Impacted by this bug? Give it a 👍.

The text was updated successfully, but these errors were encountered:

connor-mccarthy · 2023-06-08T22:49:47Z

/assign @zijianjoy

zijianjoy · 2023-06-12T18:30:06Z

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

andre-lx · 2023-06-23T13:50:13Z

Hi @zijianjoy

Thanks for the quick around.

This is indeed one possible solution on the short term and we will be using it.

Unfortunately, this removed the possibility to download artifacts as you mentioned and the artifacts page does not load, so I hope this gets solved in a future release.

Thanks again, André

juliusvonkohout · 2023-07-20T09:08:24Z

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here #8406 (comment)

andre-lx · 2023-07-20T09:15:33Z

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here #8406 (comment)

Hi @juliusvonkohout. This make sense.

For now, and for version 1.8.5 there are any workarround to fix this issue? Use the artifacts without the two pods per namespace?

Thanks

juliusvonkohout · 2023-07-20T09:27:48Z

@andre-lx I can help with the open source implementation, but solving this for a single user is more of a paid consulting question ;-). If you want that, reach out on slack. As a hint: it is doable in Kubeflow 1.7 but still as insecure as the current situation. You can put this on the agenda for the next KFP meeting or order consulting.

github-actions · 2023-10-19T07:41:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout · 2023-10-19T08:38:26Z

This issue is only becoming more relevant and is definitely not stale.

github-actions · 2024-01-18T07:41:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout · 2024-01-18T12:54:28Z

not stale

github-actions · 2024-03-20T07:41:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout · 2024-03-28T11:16:19Z

Not stale.

juliusvonkohout · 2024-05-21T10:16:38Z

@zijianjoy @rimolive can you freeze the lifecycle of the Issue? It is still relevant.

rimolive · 2024-06-04T11:48:23Z

Sure, @juliusvonkohout

/lifecycle frozen

andre-lx added area/backend kind/bug labels Jun 2, 2023

google-oss-prow bot assigned zijianjoy Jun 8, 2023

zijianjoy mentioned this issue Jul 19, 2023

[Multi User] Support separate metadata for each namespace #4790

Open

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 19, 2023

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 19, 2023

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 18, 2024

github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 19, 2024

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 20, 2024

github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 29, 2024

google-oss-prow bot added the lifecycle/frozen label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

andre-lx commented Jun 2, 2023 •

edited

Loading

connor-mccarthy commented Jun 8, 2023

zijianjoy commented Jun 12, 2023

andre-lx commented Jun 23, 2023 •

edited

Loading

juliusvonkohout commented Jul 20, 2023

andre-lx commented Jul 20, 2023

juliusvonkohout commented Jul 20, 2023

github-actions bot commented Oct 19, 2023

juliusvonkohout commented Oct 19, 2023

github-actions bot commented Jan 18, 2024

juliusvonkohout commented Jan 18, 2024

github-actions bot commented Mar 20, 2024

juliusvonkohout commented Mar 28, 2024

juliusvonkohout commented May 21, 2024

rimolive commented Jun 4, 2024

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

Comments

andre-lx commented Jun 2, 2023 • edited Loading

Environment

Steps to reproduce

Expected result

connor-mccarthy commented Jun 8, 2023

zijianjoy commented Jun 12, 2023

andre-lx commented Jun 23, 2023 • edited Loading

juliusvonkohout commented Jul 20, 2023

andre-lx commented Jul 20, 2023

juliusvonkohout commented Jul 20, 2023

github-actions bot commented Oct 19, 2023

juliusvonkohout commented Oct 19, 2023

github-actions bot commented Jan 18, 2024

juliusvonkohout commented Jan 18, 2024

github-actions bot commented Mar 20, 2024

juliusvonkohout commented Mar 28, 2024

juliusvonkohout commented May 21, 2024

rimolive commented Jun 4, 2024

andre-lx commented Jun 2, 2023 •

edited

Loading

andre-lx commented Jun 23, 2023 •

edited

Loading