-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Helm charts needs to have minimal resource requirements specified in default values for all containers #643
Comments
The minimum resource requests will be very different for different models for those components directly using LLM models. For other components which doesn't directly involves models, that could be done. |
Even for different LLM models, we can have some minimal requirements: e.g. 1 vcpu, 100Mb of RAM. |
that discussion in #431 seems to be stalled, but yes, my point is along the lines that is mentioned in some of the comments:
|
Do we need to set the minimal resource request for all helm charts, or just the compute intensive one like TEI/TGI/vLLM? |
All should have them. Application is not much use if backend inferencing is idling because services before it are throttled, or even being evicted. Suitable (minimum) requests should be much easier to specify for those other services, as their resource usage does not depend on which model user has set in Helm values. |
2 different scenarios.
|
According to k8s HPA docs: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#how-does-a-horizontalpodautoscaler-work CPU/mem resource requests are also needed, and affect, CPU/mem utilization targets for autoscaling:
I.e. when HPA is used with CPU components, the requests must be fairly accurate, not just some minimum values. But that's anyway needed for deployments to work reasonably, when multiple components are allowed to same nodes. Scaling just makes the issue more visible. |
Some reasonable minimum request values are better than no values. It guarantees that application can do at least some progress when node is constrained, instead of being completely throttled, or even evicted. Or do you see some downside? More accurate resource requests would naturally be better, so that deployments get (autoscaled +) scheduled correctly, and are guaranteed correct amount of resources even in constrained situations, but that can be optimized later. |
Priority
Undecided
OS type
Ubuntu
Hardware type
Xeon-GNR
Running nodes
Multiple Nodes
Description
Currently, all the containers in helm charts by default are missing resource requests/limits. This makes the Pod QoS in Kubernetes to be BestEffort, which is least priority workload on the node, leading to instability of performance.
Default values for all containers should be requesting some minimal amount of CPU and memory resources.
We don't need to specify
limits
for now, but minimalrequests
are mandatory to make containers into Burstable Pod QoS.Minimal values can be easily collected from any of the benchmark runs, during "steady" state of execution. Values not needed to be precise, but some minimal viable numbers should be added.
The text was updated successfully, but these errors were encountered: