Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling ceases to function after a while #870

Closed
worldspawn opened this issue Jan 2, 2024 · 21 comments · Fixed by #876
Closed

Scaling ceases to function after a while #870

worldspawn opened this issue Jan 2, 2024 · 21 comments · Fixed by #876
Labels
bug Something isn't working

Comments

@worldspawn
Copy link
Contributor

Report

After "some time" scaling ceases to work. The HPA's go to ScalingLimited the desired number of replicas is less than the minimum (scale to zero scenario). The interceptor logs

{"level":"error","ts":1704162481.9762473,"caller":"interceptor/proxy_handlers.go:73","msg":"wait function failed, not forwarding request","error":"context marked done while waiting for deployment teamcity to reach > 0 replicas: context canceled","stacktrace":"main.newForwardingHandler.func1\n\[tgithub.com/kedacore/http-add-on/interceptor/proxy_handlers.go:73](http://tgithub.com/kedacore/http-add-on/interceptor/proxy_handlers.go:73)\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2122\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/counting.go:34](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/counting.go:34)\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/routing.go:61](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/routing.go:61)\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/logging.go:42](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/logging.go:42)\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2936\nnet/http.(*conn).serve\n\tnet/http/server.go:1995"}

Expected Behavior

I expected it to scale up

Actual Behavior

No scaling is performed. Errors are logged

Steps to Reproduce the Problem

I dont have steps. Problem appears inconsistently. I'm hoping the logs mean something to you.

Logs from KEDA HTTP operator

2024-01-02T02:29:01Z	ERROR	external_scaler	error	{"type": "ScaledObject", "namespace": "keda", "name": "keda-add-ons-http-interceptor", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.172.137:9090: connect: connection refused\""}
github.com/kedacore/keda/v2/pkg/scalers.(*externalScaler).GetMetricSpecForScaling
	/workspace/pkg/scalers/external_scaler.go:172
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricSpecForScalingForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:103
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:684
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState
	/workspace/pkg/scaling/scale_handler.go:590
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:241
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:180
2024-01-02T02:29:01Z	ERROR	external_scaler	error	{"type": "ScaledObject", "namespace": "keda", "name": "keda-add-ons-http-interceptor", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.172.137:9090: connect: connection refused\""}
github.com/kedacore/keda/v2/pkg/scalers.(*externalScaler).GetMetricSpecForScaling
	/workspace/pkg/scalers/external_scaler.go:172
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricSpecForScalingForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:112
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:684

HTTP Add-on Version

0.6.0

Kubernetes Version

1.28

Platform

Microsoft Azure

Anything else?

No response

@worldspawn
Copy link
Contributor Author

I restarted the keda external scaler deployment (which had nothing of note in its logs). Now services are scaling, I will leave one up and try and hit the scaler 9090 port directly if/when it happens again. It appears the service became unresponsive. Perhaps a liveness probe is needed?

I'm not sure what 9091 "health" is. I can't hit it over http or gprc

@JorTurFer
Copy link
Member

So, do you think that the problem is the external scaler? I want to improve its reliability with probes and having more than 1 instance, but I haven't had time yet. Are you willing to contribute with it?

Current healthcheck for scaler is registered here:

http-add-on/scaler/main.go

Lines 163 to 169 in f06fcb9

hs := health.NewServer()
hs.SetServingStatus("liveness", grpc_health_v1.HealthCheckResponse_SERVING)
hs.SetServingStatus("readiness", grpc_health_v1.HealthCheckResponse_SERVING)
grpc_health_v1.RegisterHealthServer(
grpcServer,
hs,
)

@worldspawn
Copy link
Contributor Author

worldspawn commented Jan 3, 2024

Hi @JorTurFer. I have no idea, just making guesses.

Again I have a scaled object again in this state, the HPA has a status of ScalingLimited, the service has 1 replica but should have scaled to 0. I manually scaled to 0 (it did not scale back up to 1). I then made a http request and it scaled to 1, the status of the HPA changed to ScalingActive.

I can call the external-scaler to the degree that I can use postman to "reflect" the endpoints. I have some events popping up on the HPA after triggering "GetMetrics":

unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase
invalid metrics (1 invalid out of 1), first error is: failed to get s0-http-metabase_002Fmetabase external metric value: failed to get s0-http-metabase_002Fmetabase external metric: unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase

I'm willing but not able to contribute. I'm not a go programmer and don't really understand how this all comes together.

@JorTurFer
Copy link
Member

Could you set 2 replicas for the scaler and check if this happens less?
I'll update the scaler healthchecks before next release

@JorTurFer
Copy link
Member

Based on the logs you sent, it looks that it's the scaler who is failing because it's rejecting connections from KEDA operator

@JorTurFer
Copy link
Member

JorTurFer commented Jan 4, 2024

I think that this PR solves the issue. The problem is that I'm not totally sure if it will be compatible with v0.6.0 because there are other important changes, but I'd say that they don't affect the scaler (they affect principally the interceptor and operator).
Would you are willing to give a quick try out? If you are, use this tag for the scaler docker.io/jorturfer/http-add-on-scaler:scaler-probes. It has been built from using PR's code I can't guarantee that it works at 100% due to the changes I said, but if it starts it will work. I ask for checking if the issue is solved before merging the PR

@worldspawn
Copy link
Contributor Author

worldspawn commented Jan 4, 2024

Hi @JorTurFer, I've plugged that in. It did start. I'll leave it running and see how it goes 👍

Also set to two replicas.

@worldspawn
Copy link
Contributor Author

worldspawn commented Jan 4, 2024

Doesn't this log

invalid metrics (1 invalid out of 1), first error is: failed to get s0-http-metabase_002Fmetabase external metric value: failed to get s0-http-metabase_002Fmetabase external metric: unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase

Suggest that it made contact but hasn't got a metric for s0-http-metabase_002fmetabase?

If I run

kubectl.exe --context drawboard-shared-admin get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}"

I get Error from server (NotFound): the server could not find the requested resource

But if I run this:

kubectl.exe --context drawboard-shared-admin get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/metabase/s0-http-metabase_002Fmetabase?labelSelector=scaledobject.keda.sh%2Fname%3Dmetabase"

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-http-metabase_002Fmetabase","metricLabels":null,"timestamp":"2024-01-04T02:18:22Z","value":"0"}]}

it (looks like) it works

@JorTurFer
Copy link
Member

JorTurFer commented Jan 4, 2024

Do you see any error related in KEDA operator logs (not in the HTTP Add-on operator)?

@worldspawn
Copy link
Contributor Author

No nothing related appearing in the keda operator log.

@JorTurFer
Copy link
Member

mm... interesting...
Could you share the logs from all the components? (interceptor and scaler it's enough just one of the instances)

@werdnum
Copy link
Contributor

werdnum commented Jan 9, 2024

I encountered this, and sent the offending container a SIGQUIT to get a goroutine dump

Looks like some kind of deadlock probably.

https://gist.github.com/werdnum/284eeba82a94254e261edd8d7e5b57dd

@JorTurFer
Copy link
Member

Thanks for the report ❤️ , I'll check it when I have some time

@werdnum
Copy link
Contributor

werdnum commented Jan 9, 2024

Taking a look at the goroutine dump, at least one problem is that you don't close countsCh if fetchCounts returns an error, but I'm not sure yet why that would lead to a deadlock instead of a goroutine leak.

@werdnum
Copy link
Contributor

werdnum commented Jan 9, 2024

Ah, I think it's related to that error handling but it's a different problem - if any attempt to fetch metrics fails, the queue pinger goroutine just exits and never tries to fetch the data again.

Should change that return to a log statement.

return fmt.Errorf("error getting request counts: %w", err)

@JorTurFer
Copy link
Member

Yeah, nice research! I think that you're right!
Can you open a PR fixing it?

@JorTurFer
Copy link
Member

@worldspawn , do you see any error like error getting request counts in scaler logs?

@werdnum
Copy link
Contributor

werdnum commented Jan 9, 2024

I'll try to send a PR tomorrow, but just for completeness, I was trying to figure out why the errgroup doesn't just make the whole process crash. The answer is because the process wants to call grpcServer.GracefulStop first, but there's an active StreamIsActive RPC.

Should additionally set things up such that those RPCs that run forever are cancelled before attempting to stop the grpc server.

In the mean time I'm going to add a liveness probe checking that port 9090 is still open, as GracefulStop closes it before waiting for other requests to finish.

werdnum added a commit to werdnum/keda-http-add-on that referenced this issue Jan 10, 2024
Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash.

Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state.

Fixes kedacore#870.
@werdnum
Copy link
Contributor

werdnum commented Jan 10, 2024

I put a bunch of fixes into my fork, but need to sort out licensing, unit tests if I can find a good way to test my changes, etc.

See branches on https://github.com/werdnum/keda-http-add-on/branches

@JorTurFer
Copy link
Member

If you have licensing issues, I can continue from your work. About how to test this... probably unit test can be complicated, but we can introduce e2e test for this.
Maybe an e2e test where we kill/scale to zero the interceptor for a while can reproduce the connectivity issues between scaler and intercepts, triggering this case (we can test it before and after fixing the issue to double check that works).

Currently, we execute all the tests in parallel so this can't be done because it'd affect to other test cases, be we have this already solved in KEDA e2e tests executing e2e test in different ways based on folders. I can port that approach to here for including this e2e test

@werdnum
Copy link
Contributor

werdnum commented Jan 10, 2024

I think it'll be fine licensing wise, since the project is Apache licensed. Let me take a look at the PR template to see what else I need to do.

werdnum added a commit to werdnum/keda-http-add-on that referenced this issue Jan 10, 2024
Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash.

Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state.

Fixes kedacore#870.

Signed-off-by: Andrew Garrett <andrewgarrett@google.com>
JorTurFer added a commit that referenced this issue Jan 15, 2024
… count metric fetch fails (#876)

* Queue Pinger: Don't return error if fetchAndSaveCounts fails.

Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash.

Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state.

Fixes #870.

Signed-off-by: Andrew Garrett <andrewgarrett@google.com>

* Update CHANGELOG

Signed-off-by: Andrew Garrett <andrewgarrett@google.com>

* Update CHANGELOG.md

Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>

* Update CHANGELOG.md

Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>

---------

Signed-off-by: Andrew Garrett <andrewgarrett@google.com>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
@github-project-automation github-project-automation bot moved this from To Triage to Done in Roadmap - KEDA HTTP Add-On Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
3 participants