KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

sktemkar · 2023-10-04T23:32:28Z

Report

I have a KEDA(V 2.10.1) enabled in an AKS(V 1.26.6) cluster using the helm chart. It created 2 metrics pods
But the scaling is not working and only 1 worker pod is scaled for the jobs.

The logs of one of the metric server is giving the ERROR- "grpc: addrConn.createTransport failed to connect". For the other metric server, it is showing as connection established.

Err: connection error: desc = "transport: Error while dialing: dial tcp 10.105.162.91:9666: connect: connection timed out"
W1004 22:54:13.289988 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
"Addr": "keda-operator.kube-system.svc.cluster.local:9666",
"ServerName": "keda-operator.kube-system.svc.cluster.local:9666",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing: dial tcp XX.XX.XX.XX:9666: connect: connection timed out"

Expected Behavior

The worker pods should scale up to multiple pods as and when the jobs requests increases.

Actual Behavior

The workers pods should scale up as and when the job requests increases.

Steps to Reproduce the Problem

Installed KEDA(V 2.10.1) in Azure AKS(V 1.26.6) using helm bicep code.
Setup airflow in the AKS using the HELM chart.

Logs from KEDA operator

example

2023-10-04T23:30:31Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io "keda-admission" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:731
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:700
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
2023-10-04T23:30:31Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}

KEDA Version

2.10.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

v-shenoy · 2023-10-05T08:07:41Z

@tomkerkhove @JorTurFer This is actually regarding a KEDA installation using the AKS add-on.

You can ignore the missing webhook configuration error, that's an error in the add-on Helm chart that we're already planning on fixing. But I need a bit of help here in understanding the constant timeout that the metric server is undergoing when trying to communicate with the operator.

Let me know what information would be required to diagnose this further and I can provide those and work alongside you.

zroubalik · 2023-10-05T08:15:06Z

@v-shenoy thanks for clarification.

Is the timeout message there appearing constantly? Or just during a startup (that's okay), in that case the Metrics Server waits till operator is up. You should see this message in the logs: https://github.com/kedacore/keda/blob/8adb70e97a08a4690613eef4c4f00391e44e1603/pkg/provider/provider.go#L84C38-L84C97

v-shenoy · 2023-10-05T09:19:35Z

There are two replicas for the metric server. One of them is able to connect successfully, the other one is continuously timing out. We had multiple clusters face this issue. In some of them, restarting the metric server deployment was enough, but not in all.

JorTurFer · 2023-10-05T10:10:44Z

Do you see errors on KEDA operator pod? That message is printed by the MS because it tries to establish the gRPC connection with the operator for getting metrics (after KEDA 2.9, the metric server is just a proxy for the HPA controller but all the work is done by the operator)

v-shenoy · 2023-10-05T13:24:56Z

Besides the missing webhook configuration, I don't think we were seeing any other errors in the operator pod. Plus, one of the metric servers did connect successfully. Correct me if I am missing something, @sktemkar.

JorTurFer · 2023-10-15T18:24:56Z

Any update?

v-shenoy · 2023-10-16T06:58:07Z

I think the AKS system pods were being throttled due to the size of the system pool nodes being small. @sktemkar increased the nodes size as well as added the CriticalAddonsOnly=true:NoSchedule taint to the KEDA pods (we have the corresponding toleration enabled in the add-on). It seems to be working for now, but the plan is to monitor for a few more days and see if the issue re-occurs.

JorTurFer · 2023-12-11T07:20:27Z

Any update on this? can we close the issue?

sktemkar · 2024-02-07T03:34:14Z

this issue is fixed after increasing the size of the system node pool, adding critical app taint and redeploying keda configuration.

sktemkar added the bug Something isn't working label Oct 4, 2023

sktemkar closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

sktemkar commented Oct 4, 2023

v-shenoy commented Oct 5, 2023 •

edited

Loading

zroubalik commented Oct 5, 2023

v-shenoy commented Oct 5, 2023

JorTurFer commented Oct 5, 2023

v-shenoy commented Oct 5, 2023

JorTurFer commented Oct 15, 2023

v-shenoy commented Oct 16, 2023

JorTurFer commented Dec 11, 2023

sktemkar commented Feb 7, 2024

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

Comments

sktemkar commented Oct 4, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

v-shenoy commented Oct 5, 2023 • edited Loading

zroubalik commented Oct 5, 2023

v-shenoy commented Oct 5, 2023

JorTurFer commented Oct 5, 2023

v-shenoy commented Oct 5, 2023

JorTurFer commented Oct 15, 2023

v-shenoy commented Oct 16, 2023

JorTurFer commented Dec 11, 2023

sktemkar commented Feb 7, 2024

v-shenoy commented Oct 5, 2023 •

edited

Loading