-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect #5052
Comments
@tomkerkhove @JorTurFer This is actually regarding a KEDA installation using the AKS add-on. You can ignore the missing webhook configuration error, that's an error in the add-on Helm chart that we're already planning on fixing. But I need a bit of help here in understanding the constant timeout that the metric server is undergoing when trying to communicate with the operator. Let me know what information would be required to diagnose this further and I can provide those and work alongside you. |
@v-shenoy thanks for clarification. Is the timeout message there appearing constantly? Or just during a startup (that's okay), in that case the Metrics Server waits till operator is up. You should see this message in the logs: https://github.com/kedacore/keda/blob/8adb70e97a08a4690613eef4c4f00391e44e1603/pkg/provider/provider.go#L84C38-L84C97 |
There are two replicas for the metric server. One of them is able to connect successfully, the other one is continuously timing out. We had multiple clusters face this issue. In some of them, restarting the metric server deployment was enough, but not in all. |
Do you see errors on KEDA operator pod? That message is printed by the MS because it tries to establish the gRPC connection with the operator for getting metrics (after KEDA 2.9, the metric server is just a proxy for the HPA controller but all the work is done by the operator) |
Besides the missing webhook configuration, I don't think we were seeing any other errors in the operator pod. Plus, one of the metric servers did connect successfully. Correct me if I am missing something, @sktemkar. |
Any update? |
I think the AKS system pods were being throttled due to the size of the system pool nodes being small. @sktemkar increased the nodes size as well as added the |
Any update on this? can we close the issue? |
this issue is fixed after increasing the size of the system node pool, adding critical app taint and redeploying keda configuration. |
Report
I have a KEDA(V 2.10.1) enabled in an AKS(V 1.26.6) cluster using the helm chart. It created 2 metrics pods
But the scaling is not working and only 1 worker pod is scaled for the jobs.
The logs of one of the metric server is giving the ERROR- "grpc: addrConn.createTransport failed to connect". For the other metric server, it is showing as connection established.
Err: connection error: desc = "transport: Error while dialing: dial tcp 10.105.162.91:9666: connect: connection timed out"
W1004 22:54:13.289988 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
"Addr": "keda-operator.kube-system.svc.cluster.local:9666",
"ServerName": "keda-operator.kube-system.svc.cluster.local:9666",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing: dial tcp XX.XX.XX.XX:9666: connect: connection timed out"
Expected Behavior
The worker pods should scale up to multiple pods as and when the jobs requests increases.
Actual Behavior
The workers pods should scale up as and when the job requests increases.
Steps to Reproduce the Problem
Logs from KEDA operator
2023-10-04T23:30:31Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io "keda-admission" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:731
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:700
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
2023-10-04T23:30:31Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
KEDA Version
2.10.1
Kubernetes Version
1.26
Platform
Microsoft Azure
Scaler Details
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: