Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052

Closed
sktemkar opened this issue Oct 4, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@sktemkar
Copy link

sktemkar commented Oct 4, 2023

Report

I have a KEDA(V 2.10.1) enabled in an AKS(V 1.26.6) cluster using the helm chart. It created 2 metrics pods
But the scaling is not working and only 1 worker pod is scaled for the jobs.

The logs of one of the metric server is giving the ERROR- "grpc: addrConn.createTransport failed to connect". For the other metric server, it is showing as connection established.

Err: connection error: desc = "transport: Error while dialing: dial tcp 10.105.162.91:9666: connect: connection timed out"
W1004 22:54:13.289988 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
"Addr": "keda-operator.kube-system.svc.cluster.local:9666",
"ServerName": "keda-operator.kube-system.svc.cluster.local:9666",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing: dial tcp XX.XX.XX.XX:9666: connect: connection timed out"

Expected Behavior

The worker pods should scale up to multiple pods as and when the jobs requests increases.

Actual Behavior

The workers pods should scale up as and when the job requests increases.

Steps to Reproduce the Problem

  1. Installed KEDA(V 2.10.1) in Azure AKS(V 1.26.6) using helm bicep code.
  2. Setup airflow in the AKS using the HELM chart.

Logs from KEDA operator

example

2023-10-04T23:30:31Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io "keda-admission" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:731
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:700
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
2023-10-04T23:30:31Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}

KEDA Version

2.10.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

No response

Anything else?

No response

@sktemkar sktemkar added the bug Something isn't working label Oct 4, 2023
@v-shenoy
Copy link
Contributor

v-shenoy commented Oct 5, 2023

@tomkerkhove @JorTurFer This is actually regarding a KEDA installation using the AKS add-on.

You can ignore the missing webhook configuration error, that's an error in the add-on Helm chart that we're already planning on fixing. But I need a bit of help here in understanding the constant timeout that the metric server is undergoing when trying to communicate with the operator.

Let me know what information would be required to diagnose this further and I can provide those and work alongside you.

@zroubalik
Copy link
Member

@v-shenoy thanks for clarification.

Is the timeout message there appearing constantly? Or just during a startup (that's okay), in that case the Metrics Server waits till operator is up. You should see this message in the logs: https://github.com/kedacore/keda/blob/8adb70e97a08a4690613eef4c4f00391e44e1603/pkg/provider/provider.go#L84C38-L84C97

@v-shenoy
Copy link
Contributor

v-shenoy commented Oct 5, 2023

There are two replicas for the metric server. One of them is able to connect successfully, the other one is continuously timing out. We had multiple clusters face this issue. In some of them, restarting the metric server deployment was enough, but not in all.

@JorTurFer
Copy link
Member

Do you see errors on KEDA operator pod? That message is printed by the MS because it tries to establish the gRPC connection with the operator for getting metrics (after KEDA 2.9, the metric server is just a proxy for the HPA controller but all the work is done by the operator)

@v-shenoy
Copy link
Contributor

v-shenoy commented Oct 5, 2023

Besides the missing webhook configuration, I don't think we were seeing any other errors in the operator pod. Plus, one of the metric servers did connect successfully. Correct me if I am missing something, @sktemkar.

@JorTurFer
Copy link
Member

Any update?

@v-shenoy
Copy link
Contributor

I think the AKS system pods were being throttled due to the size of the system pool nodes being small. @sktemkar increased the nodes size as well as added the CriticalAddonsOnly=true:NoSchedule taint to the KEDA pods (we have the corresponding toleration enabled in the add-on). It seems to be working for now, but the plan is to monitor for a few more days and see if the issue re-occurs.

@JorTurFer
Copy link
Member

Any update on this? can we close the issue?

@sktemkar
Copy link
Author

sktemkar commented Feb 7, 2024

this issue is fixed after increasing the size of the system node pool, adding critical app taint and redeploying keda configuration.

@sktemkar sktemkar closed this as completed Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants