-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling ceases to function after a while #870
Comments
I restarted the keda external scaler deployment (which had nothing of note in its logs). Now services are scaling, I will leave one up and try and hit the scaler 9090 port directly if/when it happens again. It appears the service became unresponsive. Perhaps a liveness probe is needed? I'm not sure what 9091 "health" is. I can't hit it over http or gprc |
So, do you think that the problem is the external scaler? I want to improve its reliability with probes and having more than 1 instance, but I haven't had time yet. Are you willing to contribute with it? Current healthcheck for scaler is registered here: Lines 163 to 169 in f06fcb9
|
Hi @JorTurFer. I have no idea, just making guesses. Again I have a scaled object again in this state, the HPA has a status of I can call the external-scaler to the degree that I can use postman to "reflect" the endpoints. I have some events popping up on the HPA after triggering "GetMetrics":
I'm willing but not able to contribute. I'm not a go programmer and don't really understand how this all comes together. |
Could you set 2 replicas for the scaler and check if this happens less? |
Based on the logs you sent, it looks that it's the scaler who is failing because it's rejecting connections from KEDA operator |
I think that this PR solves the issue. The problem is that I'm not totally sure if it will be compatible with v0.6.0 because there are other important changes, but I'd say that they don't affect the scaler (they affect principally the interceptor and operator). |
Hi @JorTurFer, I've plugged that in. It did start. I'll leave it running and see how it goes 👍 Also set to two replicas. |
Doesn't this log
Suggest that it made contact but hasn't got a metric for If I run
I get But if I run this:
it (looks like) it works |
Do you see any error related in KEDA operator logs (not in the HTTP Add-on operator)? |
No nothing related appearing in the keda operator log. |
mm... interesting... |
I encountered this, and sent the offending container a SIGQUIT to get a goroutine dump Looks like some kind of deadlock probably. https://gist.github.com/werdnum/284eeba82a94254e261edd8d7e5b57dd |
Thanks for the report ❤️ , I'll check it when I have some time |
Taking a look at the goroutine dump, at least one problem is that you don't close |
Ah, I think it's related to that error handling but it's a different problem - if any attempt to fetch metrics fails, the queue pinger goroutine just exits and never tries to fetch the data again. Should change that return to a log statement. http-add-on/scaler/queue_pinger.go Line 102 in 15718d1
|
Yeah, nice research! I think that you're right! |
@worldspawn , do you see any error like |
I'll try to send a PR tomorrow, but just for completeness, I was trying to figure out why the Should additionally set things up such that those RPCs that run forever are cancelled before attempting to stop the grpc server. In the mean time I'm going to add a liveness probe checking that port 9090 is still open, as |
Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes kedacore#870.
I put a bunch of fixes into my fork, but need to sort out licensing, unit tests if I can find a good way to test my changes, etc. See branches on https://github.com/werdnum/keda-http-add-on/branches |
If you have licensing issues, I can continue from your work. About how to test this... probably unit test can be complicated, but we can introduce e2e test for this. Currently, we execute all the tests in parallel so this can't be done because it'd affect to other test cases, be we have this already solved in KEDA e2e tests executing e2e test in different ways based on folders. I can port that approach to here for including this e2e test |
I think it'll be fine licensing wise, since the project is Apache licensed. Let me take a look at the PR template to see what else I need to do. |
Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes kedacore#870. Signed-off-by: Andrew Garrett <andrewgarrett@google.com>
… count metric fetch fails (#876) * Queue Pinger: Don't return error if fetchAndSaveCounts fails. Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes #870. Signed-off-by: Andrew Garrett <andrewgarrett@google.com> * Update CHANGELOG Signed-off-by: Andrew Garrett <andrewgarrett@google.com> * Update CHANGELOG.md Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> * Update CHANGELOG.md Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> --------- Signed-off-by: Andrew Garrett <andrewgarrett@google.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Report
After "some time" scaling ceases to work. The HPA's go to
ScalingLimited
the desired number of replicas is less than the minimum (scale to zero scenario). The interceptor logsExpected Behavior
I expected it to scale up
Actual Behavior
No scaling is performed. Errors are logged
Steps to Reproduce the Problem
I dont have steps. Problem appears inconsistently. I'm hoping the logs mean something to you.
Logs from KEDA HTTP operator
HTTP Add-on Version
0.6.0
Kubernetes Version
1.28
Platform
Microsoft Azure
Anything else?
No response
The text was updated successfully, but these errors were encountered: