-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lws generated replicas is incorrect #390
Comments
Hi @lklkxcxc It seems to be related to HPA as well, should we also submit an issue to https://github.com/kubeflow/arena ? |
@yankay thanks |
I met same problem, was this fixed in v0.5.1? @lklkxcxc |
Could you please help reproduce this problem directly using the LWS manifest instead of using Kubeflow? |
@yankay I didn't use Kubeflow, I just installed LWS with helm, and tried the minimal example, then it trapped into a dead loop and stopped until it exhausted all the resources of cluster, which is quite dangerous and definitely unacceptable. The root cause is quite simple, the pod webhook keeps marking pod as leader pod, then it creates sts and continues to creating child leader pod, until all resources are gone. This patch can fix though I'm not sure if there's any side effect:
|
By the way, this issue can bring BIG problem, which makes the project dangerous to be deployed on any cluster! Please fix this with highest priority. |
Hi @WeiZhang555 |
I tried 'main' and "v0.5.0", neither works. @yankay |
Seems the same as #391, but I cannot reproduce it yet |
HI @WeiZhang555 |
HI @WeiZhang555 @lklkxcxc |
@yankay I think I got your point. lws/pkg/controllers/pod_controller.go Line 337 in 10a741c
LWS replies on Start Ordinal Feature, which is supported after v1.31. I am using Kubernetes v1.26.5. To make LWS work on my env, I need to remove dependency on this feature. |
I use kubeflow generate distributed-serving job,one master and one work but create 12 pods:
kubectl get pod output:
kubectl get LeaderWorkerSet vllm-alpha-distributed-serving -o yaml output:
The text was updated successfully, but these errors were encountered: