Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaler fails only when failing to get counts from all the interceptor endpoints #903

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Mizhentaotuo
Copy link

@Mizhentaotuo Mizhentaotuo commented Jan 25, 2024

Provide a description of what has been changed
We observe behavior that the scaler fails and exit the loop when failing to get counts from any of the interceptor replica.
Not sure this is the intended behavior but sometimes one interceptor replica is down only because it is on a spot node. When the node is down and the endpoints of the interceptor service is not updated yet, the scaler still try to get from and endpoint which does not exist. And most of the time the killed interceptor pod will heal itself.

Checklist

Fixes #
Change so that the scaler fails only when fetching all the counts failed.

Comment:
I am new to this, not sure the existing version is the intended behavior. Please let me know if there is a better way or it can be handled by any config value that I am not aware of. Appreciated.

@Mizhentaotuo Mizhentaotuo requested a review from a team as a code owner January 25, 2024 18:19
Signed-off-by: mingzhe <whitelmz@hotmail.com>
Signed-off-by: Mizhentaotuo

Signed-off-by: mingzhe <whitelmz@hotmail.com>
Signed-off-by: Mizhentaotuo

Signed-off-by: mingzhe <whitelmz@hotmail.com>
@JorTurFer
Copy link
Member

Hey!
Thanks for the PR, but I don't get the point (but probably I'm missing something), the interceptor cache is updated every second, so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something?
We decided to restart the scaler if the metrics aren't available as getting all the endpoints it's the only way to ensure that the traffic is measured well, otherwise the scaler could respond with a wrong value which triggers the scaling in, impacting the users.

@JorTurFer
Copy link
Member

Said that, I guess that we could try to figure out a better way to hit the interceptors, something like getting the ready pods and going through calculating the endpoints in scaler side instead of using k8s endpoints.

@Mizhentaotuo
Copy link
Author

Mizhentaotuo commented Jan 26, 2024

Hey! Thanks for the PR, but I don't get the point (but probably I'm missing something), the interceptor cache is updated every second, so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something? We decided to restart the scaler if the metrics aren't available as getting all the endpoints it's the only way to ensure that the traffic is measured well, otherwise the scaler could respond with a wrong value which triggers the scaling in, impacting the users.

Hey! Thanks a lot for the quick reply.

so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something?

No, no I agree that is the expected behavior. But the behavior on our cluster is that the scaler failed because one node is removed by gcp, and it spin up another node which takes some time (could be 1 min). could be that the endpoints list is updated, but just because the node is not ready yet, so the pod is not ready either? This part I do not know much about.
in any case, I think your idea of getting the ready pod first is nice, I will try to update the PR.
Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants