Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

Open
andre-lx opened this issue Jun 2, 2023 · 14 comments

Comments

@andre-lx
Copy link

andre-lx commented Jun 2, 2023

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Kubeflow deployment
  • KFP version:
    1.8.2

Steps to reproduce

I didn't find any reference for this issue.

At this moment, 2 pods are created for each namespace. The pipelines are running smoothly, but this 2 pods per namespace are making our Kubernetes cluster a lot more expensive.

Imagine the following scenario:

  • 500 users, each one with their own namespace, 2 pods per namespace = 1000 pods
  • each node runs up to 100 pods

You will get up to 10 nodes, only to have these 2 pods per user. Even if the user does not use the pipelines.

A practical example:

> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | wc -l
110
> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | grep ml-pipeline | wc -l
100

My question is, what can we do to improve this costs. For example, there are any way of not creating the pods, or creating the pods only when they are necessary.

Expected result

Since this takes a lot of unnecessary resources, should exist a way of improving this.

Thanks

Impacted by this bug? Give it a 👍.

@connor-mccarthy
Copy link
Member

/assign @zijianjoy

@zijianjoy
Copy link
Collaborator

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@andre-lx
Copy link
Author

andre-lx commented Jun 23, 2023

Hi @zijianjoy

Thanks for the quick around.

This is indeed one possible solution on the short term and we will be using it.

Unfortunately, this removed the possibility to download artifacts as you mentioned and the artifacts page does not load, so I hope this gets solved in a future release.

Thanks again, André

@juliusvonkohout
Copy link
Member

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here #8406 (comment)

@andre-lx
Copy link
Author

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here #8406 (comment)

Hi @juliusvonkohout. This make sense.

For now, and for version 1.8.5 there are any workarround to fix this issue? Use the artifacts without the two pods per namespace?

Thanks

@juliusvonkohout
Copy link
Member

@andre-lx I can help with the open source implementation, but solving this for a single user is more of a paid consulting question ;-). If you want that, reach out on slack. As a hint: it is doable in Kubeflow 1.7 but still as insecure as the current situation. You can put this on the agenda for the next KFP meeting or order consulting.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 19, 2023
@juliusvonkohout
Copy link
Member

This issue is only becoming more relevant and is definitely not stale.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 19, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 18, 2024
@juliusvonkohout
Copy link
Member

not stale

@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 19, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 20, 2024
@juliusvonkohout
Copy link
Member

Not stale.

@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 29, 2024
@juliusvonkohout
Copy link
Member

@zijianjoy @rimolive can you freeze the lifecycle of the Issue? It is still relevant.

@rimolive
Copy link
Member

rimolive commented Jun 4, 2024

Sure, @juliusvonkohout

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants