Scaling ceases to function after a while #870

worldspawn · 2024-01-02T02:34:07Z

Report

After "some time" scaling ceases to work. The HPA's go to ScalingLimited the desired number of replicas is less than the minimum (scale to zero scenario). The interceptor logs

{"level":"error","ts":1704162481.9762473,"caller":"interceptor/proxy_handlers.go:73","msg":"wait function failed, not forwarding request","error":"context marked done while waiting for deployment teamcity to reach > 0 replicas: context canceled","stacktrace":"main.newForwardingHandler.func1\n\[tgithub.com/kedacore/http-add-on/interceptor/proxy_handlers.go:73](http://tgithub.com/kedacore/http-add-on/interceptor/proxy_handlers.go:73)\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2122\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/counting.go:34](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/counting.go:34)\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/routing.go:61](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/routing.go:61)\[ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP](http://ngithub.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP)\n\[tgithub.com/kedacore/http-add-on/interceptor/middleware/logging.go:42](http://tgithub.com/kedacore/http-add-on/interceptor/middleware/logging.go:42)\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2936\nnet/http.(*conn).serve\n\tnet/http/server.go:1995"}

Expected Behavior

I expected it to scale up

Actual Behavior

No scaling is performed. Errors are logged

Steps to Reproduce the Problem

I dont have steps. Problem appears inconsistently. I'm hoping the logs mean something to you.

Logs from KEDA HTTP operator

2024-01-02T02:29:01Z	ERROR	external_scaler	error	{"type": "ScaledObject", "namespace": "keda", "name": "keda-add-ons-http-interceptor", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.172.137:9090: connect: connection refused\""}
github.com/kedacore/keda/v2/pkg/scalers.(*externalScaler).GetMetricSpecForScaling
	/workspace/pkg/scalers/external_scaler.go:172
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricSpecForScalingForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:103
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:684
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState
	/workspace/pkg/scaling/scale_handler.go:590
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:241
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:180
2024-01-02T02:29:01Z	ERROR	external_scaler	error	{"type": "ScaledObject", "namespace": "keda", "name": "keda-add-ons-http-interceptor", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.172.137:9090: connect: connection refused\""}
github.com/kedacore/keda/v2/pkg/scalers.(*externalScaler).GetMetricSpecForScaling
	/workspace/pkg/scalers/external_scaler.go:172
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricSpecForScalingForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:112
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:684

HTTP Add-on Version

0.6.0

Kubernetes Version

1.28

Platform

Microsoft Azure

Anything else?

No response

The text was updated successfully, but these errors were encountered:

worldspawn · 2024-01-02T03:13:21Z

I restarted the keda external scaler deployment (which had nothing of note in its logs). Now services are scaling, I will leave one up and try and hit the scaler 9090 port directly if/when it happens again. It appears the service became unresponsive. Perhaps a liveness probe is needed?

I'm not sure what 9091 "health" is. I can't hit it over http or gprc

JorTurFer · 2024-01-03T10:30:09Z

So, do you think that the problem is the external scaler? I want to improve its reliability with probes and having more than 1 instance, but I haven't had time yet. Are you willing to contribute with it?

Current healthcheck for scaler is registered here:

http-add-on/scaler/main.go

Lines 163 to 169 in f06fcb9

    
           hs := health.NewServer() 
        
           hs.SetServingStatus("liveness", grpc_health_v1.HealthCheckResponse_SERVING) 
        
           hs.SetServingStatus("readiness", grpc_health_v1.HealthCheckResponse_SERVING) 
        
           grpc_health_v1.RegisterHealthServer( 
        
           	grpcServer, 
        
           	hs, 
        
           )

worldspawn · 2024-01-03T21:40:39Z

Hi @JorTurFer. I have no idea, just making guesses.

Again I have a scaled object again in this state, the HPA has a status of ScalingLimited, the service has 1 replica but should have scaled to 0. I manually scaled to 0 (it did not scale back up to 1). I then made a http request and it scaled to 1, the status of the HPA changed to ScalingActive.

I can call the external-scaler to the degree that I can use postman to "reflect" the endpoints. I have some events popping up on the HPA after triggering "GetMetrics":

unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase

invalid metrics (1 invalid out of 1), first error is: failed to get s0-http-metabase_002Fmetabase external metric value: failed to get s0-http-metabase_002Fmetabase external metric: unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase

I'm willing but not able to contribute. I'm not a go programmer and don't really understand how this all comes together.

JorTurFer · 2024-01-03T23:36:32Z

Could you set 2 replicas for the scaler and check if this happens less?
I'll update the scaler healthchecks before next release

JorTurFer · 2024-01-03T23:56:29Z

Based on the logs you sent, it looks that it's the scaler who is failing because it's rejecting connections from KEDA operator

JorTurFer · 2024-01-04T00:54:19Z

I think that this PR solves the issue. The problem is that I'm not totally sure if it will be compatible with v0.6.0 because there are other important changes, but I'd say that they don't affect the scaler (they affect principally the interceptor and operator).
Would you are willing to give a quick try out? If you are, use this tag for the scaler docker.io/jorturfer/http-add-on-scaler:scaler-probes. It has been built from using PR's code I can't guarantee that it works at 100% due to the changes I said, but if it starts it will work. I ask for checking if the issue is solved before merging the PR

worldspawn · 2024-01-04T02:06:26Z

Hi @JorTurFer, I've plugged that in. It did start. I'll leave it running and see how it goes 👍

Also set to two replicas.

worldspawn · 2024-01-04T02:13:18Z

Doesn't this log

invalid metrics (1 invalid out of 1), first error is: failed to get s0-http-metabase_002Fmetabase external metric value: failed to get s0-http-metabase_002Fmetabase external metric: unable to get external metric metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: rpc error: code = Unknown desc = error when getting metric values no matching metrics found for s0-http-metabase_002fmetabase

Suggest that it made contact but hasn't got a metric for s0-http-metabase_002fmetabase?

If I run

kubectl.exe --context drawboard-shared-admin get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/metabase/s0-http-metabase_002Fmetabase/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: metabase,},MatchExpressions:[]LabelSelectorRequirement{},}"

I get Error from server (NotFound): the server could not find the requested resource

But if I run this:

kubectl.exe --context drawboard-shared-admin get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/metabase/s0-http-metabase_002Fmetabase?labelSelector=scaledobject.keda.sh%2Fname%3Dmetabase"

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-http-metabase_002Fmetabase","metricLabels":null,"timestamp":"2024-01-04T02:18:22Z","value":"0"}]}

it (looks like) it works

JorTurFer · 2024-01-04T11:23:05Z

Do you see any error related in KEDA operator logs (not in the HTTP Add-on operator)?

worldspawn · 2024-01-05T15:05:17Z

No nothing related appearing in the keda operator log.

JorTurFer · 2024-01-05T21:31:43Z

mm... interesting...
Could you share the logs from all the components? (interceptor and scaler it's enough just one of the instances)

werdnum · 2024-01-09T08:29:21Z

I encountered this, and sent the offending container a SIGQUIT to get a goroutine dump

Looks like some kind of deadlock probably.

https://gist.github.com/werdnum/284eeba82a94254e261edd8d7e5b57dd

JorTurFer · 2024-01-09T08:47:30Z

Thanks for the report ❤️ , I'll check it when I have some time

werdnum · 2024-01-09T09:05:02Z

Taking a look at the goroutine dump, at least one problem is that you don't close countsCh if fetchCounts returns an error, but I'm not sure yet why that would lead to a deadlock instead of a goroutine leak.

werdnum · 2024-01-09T13:20:25Z

Ah, I think it's related to that error handling but it's a different problem - if any attempt to fetch metrics fails, the queue pinger goroutine just exits and never tries to fetch the data again.

Should change that return to a log statement.

http-add-on/scaler/queue_pinger.go

Line 102 in 15718d1

return fmt.Errorf("error getting request counts: %w", err)

JorTurFer · 2024-01-09T13:27:21Z

Yeah, nice research! I think that you're right!
Can you open a PR fixing it?

JorTurFer · 2024-01-09T13:29:45Z

@worldspawn , do you see any error like error getting request counts in scaler logs?

werdnum · 2024-01-09T13:44:28Z

I'll try to send a PR tomorrow, but just for completeness, I was trying to figure out why the errgroup doesn't just make the whole process crash. The answer is because the process wants to call grpcServer.GracefulStop first, but there's an active StreamIsActive RPC.

Should additionally set things up such that those RPCs that run forever are cancelled before attempting to stop the grpc server.

In the mean time I'm going to add a liveness probe checking that port 9090 is still open, as GracefulStop closes it before waiting for other requests to finish.

Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes kedacore#870.

werdnum · 2024-01-10T04:56:29Z

I put a bunch of fixes into my fork, but need to sort out licensing, unit tests if I can find a good way to test my changes, etc.

See branches on https://github.com/werdnum/keda-http-add-on/branches

JorTurFer · 2024-01-10T07:48:31Z

If you have licensing issues, I can continue from your work. About how to test this... probably unit test can be complicated, but we can introduce e2e test for this.
Maybe an e2e test where we kill/scale to zero the interceptor for a while can reproduce the connectivity issues between scaler and intercepts, triggering this case (we can test it before and after fixing the issue to double check that works).

Currently, we execute all the tests in parallel so this can't be done because it'd affect to other test cases, be we have this already solved in KEDA e2e tests executing e2e test in different ways based on folders. I can port that approach to here for including this e2e test

werdnum · 2024-01-10T08:33:50Z

I think it'll be fine licensing wise, since the project is Apache licensed. Let me take a look at the PR template to see what else I need to do.

Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes kedacore#870. Signed-off-by: Andrew Garrett <andrewgarrett@google.com>

… count metric fetch fails (#876) * Queue Pinger: Don't return error if fetchAndSaveCounts fails. Returning the error here cancels the whole queuePinger.start loop, causing the server to (theoretically) crash. Due to a separate bug, instead of crashing, the server can get wedged in an unhealthy state. Fixes #870. Signed-off-by: Andrew Garrett <andrewgarrett@google.com> * Update CHANGELOG Signed-off-by: Andrew Garrett <andrewgarrett@google.com> * Update CHANGELOG.md Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> * Update CHANGELOG.md Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> --------- Signed-off-by: Andrew Garrett <andrewgarrett@google.com> Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es> Co-authored-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>

worldspawn added the bug Something isn't working label Jan 2, 2024

keda-automation added this to Roadmap - KEDA HTTP Add-On Jan 2, 2024

github-project-automation bot moved this to To Triage in Roadmap - KEDA HTTP Add-On Jan 2, 2024

JorTurFer mentioned this issue Jan 4, 2024

fix: Improve Scaler reliability adding probes and 3 replicas #872

Merged

2 tasks

werdnum mentioned this issue Jan 10, 2024

Fix issue where external scaler gets wedged in a bad state when queue count metric fetch fails #876

Merged

3 tasks

JorTurFer closed this as completed in #876 Jan 15, 2024

github-project-automation bot moved this from To Triage to Done in Roadmap - KEDA HTTP Add-On Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling ceases to function after a while #870

Scaling ceases to function after a while #870

worldspawn commented Jan 2, 2024

worldspawn commented Jan 2, 2024

JorTurFer commented Jan 3, 2024

worldspawn commented Jan 3, 2024 •

edited

Loading

JorTurFer commented Jan 3, 2024

JorTurFer commented Jan 3, 2024

JorTurFer commented Jan 4, 2024 •

edited

Loading

worldspawn commented Jan 4, 2024 •

edited

Loading

worldspawn commented Jan 4, 2024 •

edited

Loading

JorTurFer commented Jan 4, 2024 •

edited

Loading

worldspawn commented Jan 5, 2024

JorTurFer commented Jan 5, 2024

werdnum commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

werdnum commented Jan 9, 2024

werdnum commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

werdnum commented Jan 9, 2024

werdnum commented Jan 10, 2024

JorTurFer commented Jan 10, 2024

werdnum commented Jan 10, 2024

Scaling ceases to function after a while #870

Scaling ceases to function after a while #870

Comments

worldspawn commented Jan 2, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA HTTP operator

HTTP Add-on Version

Kubernetes Version

Platform

Anything else?

worldspawn commented Jan 2, 2024

JorTurFer commented Jan 3, 2024

worldspawn commented Jan 3, 2024 • edited Loading

JorTurFer commented Jan 3, 2024

JorTurFer commented Jan 3, 2024

JorTurFer commented Jan 4, 2024 • edited Loading

worldspawn commented Jan 4, 2024 • edited Loading

worldspawn commented Jan 4, 2024 • edited Loading

JorTurFer commented Jan 4, 2024 • edited Loading

worldspawn commented Jan 5, 2024

JorTurFer commented Jan 5, 2024

werdnum commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

werdnum commented Jan 9, 2024

werdnum commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

JorTurFer commented Jan 9, 2024

werdnum commented Jan 9, 2024

werdnum commented Jan 10, 2024

JorTurFer commented Jan 10, 2024

werdnum commented Jan 10, 2024

worldspawn commented Jan 3, 2024 •

edited

Loading

JorTurFer commented Jan 4, 2024 •

edited

Loading

worldspawn commented Jan 4, 2024 •

edited

Loading

worldspawn commented Jan 4, 2024 •

edited

Loading

JorTurFer commented Jan 4, 2024 •

edited

Loading