Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Helm charts needs to have minimal resource requirements specified in default values for all containers #643

Open
kad opened this issue Dec 13, 2024 · 9 comments
Labels
feature New feature or request

Comments

@kad
Copy link

kad commented Dec 13, 2024

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Multiple Nodes

Description

Currently, all the containers in helm charts by default are missing resource requests/limits. This makes the Pod QoS in Kubernetes to be BestEffort, which is least priority workload on the node, leading to instability of performance.

Default values for all containers should be requesting some minimal amount of CPU and memory resources.
We don't need to specify limits for now, but minimal requests are mandatory to make containers into Burstable Pod QoS.

Minimal values can be easily collected from any of the benchmark runs, during "steady" state of execution. Values not needed to be precise, but some minimal viable numbers should be added.

@kad kad added the feature New feature or request label Dec 13, 2024
@lianhao
Copy link
Collaborator

lianhao commented Dec 16, 2024

The minimum resource requests will be very different for different models for those components directly using LLM models. For other components which doesn't directly involves models, that could be done.

@kad
Copy link
Author

kad commented Dec 16, 2024

Even for different LLM models, we can have some minimal requirements: e.g. 1 vcpu, 100Mb of RAM.
For model which we consider default in helm chart, we can add more realistic minimal requirements.

@eero-t
Copy link
Contributor

eero-t commented Dec 16, 2024

@kad There's some discussion about how to best handle the problem with (model) limits here: #431

@kad
Copy link
Author

kad commented Dec 16, 2024

@kad There's some discussion about how to best handle the problem with (model) limits here: #431

that discussion in #431 seems to be stalled, but yes, my point is along the lines that is mentioned in some of the comments:

  • we have worst Pod QoS now without specifying anything: Best Efforts. We need to get all containers to Burstable QoS.
  • limits are really model specific and might be trieckier to manage. However, for this issue it is not needed currently. We need only requests to be spcified.
  • Too big requests can be a problem, but we need to start with some minimal viable values, based on current observations.

@yongfengdu
Copy link
Collaborator

Do we need to set the minimal resource request for all helm charts, or just the compute intensive one like TEI/TGI/vLLM?

@eero-t
Copy link
Contributor

eero-t commented Dec 19, 2024

Do we need to set the minimal resource request for all helm charts, or just the compute intensive one like TEI/TGI/vLLM?

All should have them. Application is not much use if backend inferencing is idling because services before it are throttled, or even being evicted.

Suitable (minimum) requests should be much easier to specify for those other services, as their resource usage does not depend on which model user has set in Helm values.

@yongfengdu
Copy link
Collaborator

2 different scenarios.
Application(e2e Examples) should have these resource settings based on experience and the HW, Models etc. This is performance tuning and we can have recommendation profiles for different flavor.
But here for this issue, there is no best setting as @lianhao said, we are not really optimizing the App, we just set the minimal requests to make it being Burstable QoS.(I'm hesitating to add the same settings to dozens of helm charts without obvious benefit.)
Another options is to leave the common helm-charts not changed, and add the resource settings for the e2e examples.

Do we need to set the minimal resource request for all helm charts, or just the compute intensive one like TEI/TGI/vLLM?

All should have them. Application is not much use if backend inferencing is idling because services before it are throttled, or even being evicted.

Suitable (minimum) requests should be much easier to specify for those other services, as their resource usage does not depend on which model user has set in Helm values.

@eero-t
Copy link
Contributor

eero-t commented Dec 20, 2024

According to k8s HPA docs: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#how-does-a-horizontalpodautoscaler-work

CPU/mem resource requests are also needed, and affect, CPU/mem utilization targets for autoscaling:

For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod.
...
Please note that if some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric.

I.e. when HPA is used with CPU components, the requests must be fairly accurate, not just some minimum values.

But that's anyway needed for deployments to work reasonably, when multiple components are allowed to same nodes. Scaling just makes the issue more visible.

@eero-t
Copy link
Contributor

eero-t commented Dec 20, 2024

2 different scenarios. Application(e2e Examples) should have these resource settings based on experience and the HW, Models etc. This is performance tuning and we can have recommendation profiles for different flavor. But here for this issue, there is no best setting as @lianhao said, we are not really optimizing the App, we just set the minimal requests to make it being Burstable QoS.(I'm hesitating to add the same settings to dozens of helm charts without obvious benefit.)

Some reasonable minimum request values are better than no values. It guarantees that application can do at least some progress when node is constrained, instead of being completely throttled, or even evicted. Or do you see some downside?

More accurate resource requests would naturally be better, so that deployments get (autoscaled +) scheduled correctly, and are guaranteed correct amount of resources even in constrained situations, but that can be optimized later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants