Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kafka] keda-operator-metrics-apiserver begins failing SSL handshake #2490

Closed
iAlex97 opened this issue Jan 17, 2022 · 5 comments
Closed

[Kafka] keda-operator-metrics-apiserver begins failing SSL handshake #2490

iAlex97 opened this issue Jan 17, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@iAlex97
Copy link

iAlex97 commented Jan 17, 2022

Report

I've set up a a few Keda scalers for deployments based off some Kafka topics, which begun scaling the deployment up. After some time, I've observed Kafka logs saying:

Jan 17 14:10:22 kafka4 kafka-server-start.sh[127212]: [2022-01-17 14:10:22,653] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1005] Failed authentication with /51.158.103.115 (SSL handshake failed) (org.apache.kafka.common.network.Selector)
Jan 17 14:10:23 kafka4 kafka-server-start.sh[127212]: [2022-01-17 14:10:23,071] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1005] Failed authentication with /51.158.103.115 (SSL handshake failed) (org.apache.kafka.common.network.Selector)
Jan 17 14:10:23 kafka4 kafka-server-start.sh[127212]: [2022-01-17 14:10:23,511] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1005] Failed authentication with /51.158.103.115 (SSL handshake failed) (org.apache.kafka.common.network.Selector)
Jan 17 14:10:23 kafka4 kafka-server-start.sh[127212]: [2022-01-17 14:10:23,944] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1005] Failed authentication with /51.158.103.115 (SSL handshake failed) (org.apache.kafka.common.network.Selector)

On all Kafka nodes I see the same warning about the same IP so all requests are coming from a single node. After finding out which node it was:

kubectl get nodes -o wide | grep 51.158.103.115
scw-web-crawler-flannel-pool-proxies-b6da4a90e   Ready      <none>   3d1h   v1.22.3   10.72.90.45     51.158.103.115    Ubuntu 20.04.1 LTS 4f7656de55   5.4.0-80-generic   containerd://1.5.5

Scheduled pods on this node:

Non-terminated Pods:          (7 in total)
  Namespace                   Name                                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                ------------  ----------  ---------------  -------------  ---
  keda                        keda-operator-metrics-apiserver-7549b7db99-cp5mr    100m (3%)     1 (35%)     100Mi (3%)       1000Mi (34%)   22h
  kube-system                 csi-node-9w4tz                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d1h
  kube-system                 flannel-shntz                                       100m (3%)     100m (3%)   50Mi (1%)        50Mi (1%)      3d1h
  kube-system                 konnectivity-agent-wkbpj                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d1h
  kube-system                 kube-proxy-wwbk6                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d1h
  kube-system                 metrics-server-74cf4b5cf9-24k7d                     100m (3%)     0 (0%)      300Mi (10%)      0 (0%)         37h
  kube-system                 node-problem-detector-2f59w                         10m (0%)      10m (0%)    80Mi (2%)        80Mi (2%)      3d1h

After seeing this I've concluded that the "culprit" is keda-operator-metrics-apiserver-7549b7db99-cp5mr, but it doesn't make sense for me as why the metrics server would try authenticating with Kafka.

Expected Behavior

Keda should properly and consistently scale deployments based off Kafka topics lag

Actual Behavior

After receiving those warnings on Kafka node, one of the scaled objects will constantly fail, causing it to use the Fallback

Steps to Reproduce the Problem

  1. Deploy keda using helm charts and provided values
  2. Setup TLS and auth secrets, trigger and scaled objects
  3. Observe scaling up and then scaling down due to falling back to original replica count

Logs from KEDA operator

2022-01-16T14:29:20.663Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2022-01-16T14:29:20.678Z	INFO	setup	Running on Kubernetes 1.22	{"version": "v1.22.3"}
2022-01-16T14:29:20.678Z	INFO	setup	Starting manager
2022-01-16T14:29:20.678Z	INFO	setup	KEDA Version: 2.5.0
2022-01-16T14:29:20.678Z	INFO	setup	Git Commit: 22d71631c7a6e0e0ea8aef434263321d65975c29
2022-01-16T14:29:20.678Z	INFO	setup	Go Version: go1.17.3
2022-01-16T14:29:20.678Z	INFO	setup	Go OS/Arch: linux/amd64
I0116 14:29:20.679368       1 leaderelection.go:248] attempting to acquire leader lease keda/operator.keda.sh...
2022-01-16T14:29:20.679Z	INFO	Starting metrics server	{"path": "/metrics"}
I0116 14:30:05.162680       1 leaderelection.go:258] successfully acquired lease keda/operator.keda.sh
2022-01-16T14:30:05.163Z	INFO	controller.clustertriggerauthentication	Starting EventSource	{"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2022-01-16T14:30:05.163Z	INFO	controller.clustertriggerauthentication	Starting Controller	{"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"}
2022-01-16T14:30:05.164Z	INFO	controller.triggerauthentication	Starting EventSource	{"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2022-01-16T14:30:05.164Z	INFO	controller.triggerauthentication	Starting Controller	{"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"}
2022-01-16T14:30:05.165Z	INFO	controller.scaledobject	Starting EventSource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2022-01-16T14:30:05.166Z	INFO	controller.scaledobject	Starting EventSource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v2beta2.HorizontalPodAutoscaler"}
2022-01-16T14:30:05.166Z	INFO	controller.scaledobject	Starting Controller	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"}
2022-01-16T14:30:05.165Z	INFO	controller.scaledjob	Starting EventSource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2022-01-16T14:30:05.166Z	INFO	controller.scaledjob	Starting Controller	{"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"}
2022-01-16T14:30:05.265Z	INFO	controller.clustertriggerauthentication	Starting workers	{"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "worker count": 1}
2022-01-16T14:30:05.267Z	INFO	controller.triggerauthentication	Starting workers	{"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "worker count": 1}
2022-01-16T14:30:05.267Z	INFO	controller.scaledjob	Starting workers	{"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "worker count": 1}
2022-01-16T14:30:05.268Z	INFO	controller.scaledobject	Starting workers	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "worker count": 5}
2022-01-16T14:30:05.268Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-content-crawler", "namespace": "default"}
2022-01-16T14:30:05.268Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-directory-crawler", "namespace": "default"}
2022-01-16T14:30:05.268Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-head-crawler", "namespace": "default"}
2022-01-16T14:30:05.268Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-parallel-proxy", "namespace": "default"}
2022-01-16T14:30:05.719Z	INFO	controller.scaledobject	Initializing Scaling logic according to ScaledObject Specification	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-content-crawler", "namespace": "default"}
2022-01-16T14:30:05.772Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-content-crawler", "namespace": "default"}
2022-01-16T14:30:05.927Z	INFO	controller.scaledobject	Initializing Scaling logic according to ScaledObject Specification	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-directory-crawler", "namespace": "default"}
2022-01-16T14:30:06.233Z	INFO	controller.scaledobject	Initializing Scaling logic according to ScaledObject Specification	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-head-crawler", "namespace": "default"}
2022-01-16T14:30:06.486Z	INFO	kafka_scaler	invalid offset found for topic fsn1.crawler.cmd.process.high.0 in group crawler-group-directory-prod and partition 611, probably no offset is committed yet
2022-01-16T14:30:06.542Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-directory-crawler", "namespace": "default"}
2022-01-16T14:30:07.146Z	INFO	controller.scaledobject	Initializing Scaling logic according to ScaledObject Specification	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-parallel-proxy", "namespace": "default"}
2022-01-16T14:30:07.192Z	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "kafka-scaledobject-parallel-proxy", "namespace": "default"}
2022-01-16T14:30:07.564Z	INFO	kafka_scaler	invalid offset found for topic fsn1.crawler.cmd.process.high.0 in group crawler-group-directory-prod and partition 611, probably no offset is committed yet
2022-01-16T14:30:56.069Z	INFO	kafka_scaler	invalid offset found for topic fsn1.crawler.cmd.process.high.0 in group crawler-group-directory-prod and partition 611, probably no offset is committed yet
2022-01-16T14:31:12.288Z	INFO	kafka_scaler	invalid offset found for topic fsn1.crawler.cmd.process.high.0 in group crawler-group-directory-prod and partition 611, probably no offset is committed yet

and some more repeating lines about the topic which did not had offsets committed yet, but that's not controlling the scaled object which fails.

KEDA Version

2.5.0

Kubernetes Version

1.22

Platform

Other

Scaler Details

Kafka

Anything else?

I've deployed Keda using the official helm chart with the default values, except http timeout which I've set to 30 seconds:

# Default values for keda.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image:
  keda:
    repository: ghcr.io/kedacore/keda
    # Allows people to override tag if they don't want to use the app version
    tag:
  metricsApiServer:
    repository: ghcr.io/kedacore/keda-metrics-apiserver
    # Allows people to override tag if they don't want to use the app version
    tag:
  pullPolicy: Always

crds:
  install: true

watchNamespace: ""

imagePullSecrets: []
operator:
  name: keda-operator  
  replicaCount: 1

metricsServer:
  # use ClusterFirstWithHostNet if `useHostNetwork: true` https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
  dnsPolicy: ClusterFirst
  useHostNetwork: false

# -- Custom labels to add into metadata
additionalLabels: {}
  # foo: bar

podAnnotations:
  keda: {}
  metricsAdapter: {}
podLabels:
  keda: {}
  metricsAdapter: {}

## See `kubectl explain poddisruptionbudget.spec` for more
## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
podDisruptionBudget: {}
#  minAvailable: 1
#  maxUnavailable: 1

rbac:
  create: true

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: keda-operator
  # Annotations to add to the service account
  annotations: {}

# Set to the value of the Azure Active Directory Pod Identity
# This will be set as a label on the KEDA Pod(s)
podIdentity:
  activeDirectory:
    identity: ""

# Set this if you are using an external scaler and want to communicate
# over TLS (recommended). This variable holds the name of the secret that
# will be mounted to the /grpccerts path on the Pod
grpcTLSCertsSecret: ""

# Set this if you are using HashiCorp Vault and want to communicate
# over TLS (recommended). This variable holds the name of the secret that
# will be mounted to the /vault path on the Pod
hashiCorpVaultTLS: ""

logging:
  operator:
    ## Logging level for KEDA Operator
    # allowed values: 'debug', 'info', 'error', or an integer value greater than 0, specified as string
    # default value: info
    level: info
    # allowed valuesL 'json' or 'console'
    # default value: console
    format: console
  metricServer:
    ## Logging level for Metrics Server
    # allowed values: '0' for info, '4' for debug, or an integer value greater than 0, specified as string
    # default value: 0
    level: 0

podSecurityContext: {}
  # fsGroup: 2000

securityContext: {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

service:
  type: ClusterIP
  portHttp: 80
  portHttpTarget: 8080
  portHttps: 443
  portHttpsTarget: 6443

  annotations: {}

# We provides the default values that we describe in our docs:
# https://keda.sh/docs/latest/operate/cluster/
# If you want to specify the resources (or totally remove the defaults), change or comment the following
# lines, adjust them as necessary, or simply add the curly braces after 'operator' and/or 'metricServer'
# and remove/comment the default values
resources: 
  limits:
    cpu: 1
    memory: 1000Mi
  requests:
    cpu: 100m
    memory: 100Mi

nodeSelector: {}

tolerations: []

affinity: {}
  # podAntiAffinity:
  #   requiredDuringSchedulingIgnoredDuringExecution:
  #   - labelSelector:
  #       matchExpressions:
  #       - key: app
  #         operator: In
  #         values:
  #         - keda-operator
  #         - keda-operator-metrics-apiserver
  #     topologyKey: "kubernetes.io/hostname"

## Optional priorityClassName for KEDA Operator and Metrics Adapter
priorityClassName: ""

## The default HTTP timeout in milliseconds that KEDA should use
## when making requests to external services. Removing this defaults to a
## reasonable default
http:
  timeout: 30000

## Extra environment variables that will be passed onto KEDA operator and metrics api service
env:
# - name: ENV_NAME
#   value: 'ENV-VALUE'

# Extra volumes and volume mounts for the deployment. Optional.
volumes:
  keda:
    extraVolumes: []
    extraVolumeMounts: []

  metricsApiServer:
    extraVolumes: []
    extraVolumeMounts: []

prometheus:
  metricServer:
    enabled: false
    port: 9022
    portName: metrics
    path: /metrics
    podMonitor:
      # Enables PodMonitor creation for the Prometheus Operator
      enabled: false
      interval:
      scrapeTimeout:
      namespace:
      additionalLabels: {}
  operator:
    enabled: false
    port: 8080
    path: /metrics
    podMonitor:
      # Enables PodMonitor creation for the Prometheus Operator
      enabled: false
      interval:
      scrapeTimeout:
      namespace:
      additionalLabels: {}
    prometheusRules:
      # Enables PrometheusRules creation for the Prometheus Operator
      enabled: false
      namespace:
      additionalLabels: {}
      alerts: []
        # - alert: KedaScalerErrors
        #   annotations:
        #     description: Keda scaledObject {{ $labels.scaledObject }} is experiencing errors with {{ $labels.scaler }} scaler
        #     summary: Keda Scaler {{ $labels.scaler }} Errors
        #   expr: sum by ( scaledObject , scaler) (rate(keda_metrics_adapter_scaler_errors[2m]))  > 0
        #   for: 2m
        #   labels:

Another strange thing I've just noticed is that operator metrics are not even enabled in this chart.

@iAlex97 iAlex97 added the bug Something isn't working label Jan 17, 2022
@iAlex97
Copy link
Author

iAlex97 commented Jan 17, 2022

CRDs used for scaling:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-kafka-credential
  namespace: default
spec:
  secretTargetRef:
    - parameter: sasl
      name: keda-kafka-secrets
      key: sasl
    - parameter: username
      name: keda-kafka-secrets
      key: username
    - parameter: password
      name: keda-kafka-secrets
      key: password
    - parameter: tls
      name: keda-kafka-secrets
      key: tls
    - parameter: ca
      name: keda-kafka-secrets
      key: ca
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject-content-crawler
  namespace: default
spec:
  scaleTargetRef:
    name: avro-crawler
  pollingInterval: 55
  idleReplicaCount: 0
  minReplicaCount: 1 # Optional Default 0
  maxReplicaCount: 1800 # Optional Default 100
  fallback: # Optional. Section to specify fallback options
    failureThreshold: 10
    replicas: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: broker-svc
        # Make sure that this consumer group name is the same one as the one that is consuming topics
        consumerGroup: crawler-group-normal-prod
        topic: fsn1.crawler.cmd.process.normal.0
        # Optional
        lagThreshold: "10"
        offsetResetPolicy: latest
        version: 3.0.0
      authenticationRef:
        name: keda-trigger-auth-kafka-credential
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject-directory-crawler
  namespace: default
spec:
  scaleTargetRef:
    name: avro-high-crawler
  pollingInterval: 50
  idleReplicaCount: 0
  minReplicaCount: 1 # Optional Default 0
  maxReplicaCount: 300 # Optional Default 100
  fallback: # Optional. Section to specify fallback options
    failureThreshold: 10
    replicas: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: broker-svc
        # Make sure that this consumer group name is the same one as the one that is consuming topics
        consumerGroup: crawler-group-directory-prod
        topic: fsn1.crawler.cmd.process.high.0
        # Optional
        lagThreshold: "10"
        offsetResetPolicy: latest
        version: 3.0.0
      authenticationRef:
        name: keda-trigger-auth-kafka-credential
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject-parallel-proxy
  namespace: default
spec:
  scaleTargetRef:
    name: parallel-proxy
  pollingInterval: 65
  idleReplicaCount: 0
  minReplicaCount: 1 # Optional Default 0
  maxReplicaCount: 150 # Optional Default 100
  fallback: # Optional. Section to specify fallback options
    failureThreshold: 10
    replicas: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: broker-svc
        # Make sure that this consumer group name is the same one as the one that is consuming topics
        consumerGroup: crawler-group-directory-prod
        topic: fsn1.crawler.cmd.process.high.0
        # Optional
        lagThreshold: "10"
        offsetResetPolicy: latest
        version: 3.0.0
      authenticationRef:
        name: keda-trigger-auth-kafka-credential
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject-head-crawler
  namespace: default
spec:
  scaleTargetRef:
    name: avro-head-crawler
  pollingInterval: 60
  idleReplicaCount: 0
  minReplicaCount: 1 # Optional Default 0
  maxReplicaCount: 1200 # Optional Default 100
  fallback: # Optional. Section to specify fallback options
    failureThreshold: 10
    replicas: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: broker-svc
        # Make sure that this consumer group name is the same one as the one that is consuming topics
        consumerGroup: crawler-group-head-prod
        topic: fsn1.crawler.cmd.process.head.0
        # Optional
        lagThreshold: "10"
        offsetResetPolicy: latest
        version: 3.0.0
      authenticationRef:
        name: keda-trigger-auth-kafka-credential

@zroubalik
Copy link
Member

KEDA Metrics Server needs to access Kafka to scrape metrics used for 1<->N scaling.

This is an open issure related to the not committed offset: #2033

Still I am not sure why is the Metrics Server not able to authenticate 🤷‍♂️

@iAlex97
Copy link
Author

iAlex97 commented Jan 23, 2022

I'll do some more investigating, maybe try with different versions of Keda as I've been using it for months and now is the first time I encounter such error.

@zroubalik
Copy link
Member

It might be related to the caching bug introduced in 2.5.0. This will be solved in 2.6.0, which should be released soon.

@iAlex97
Copy link
Author

iAlex97 commented Feb 1, 2022

I upgraded keda to 2.6.0 last night and the issue didn't come up, cheers!

@iAlex97 iAlex97 closed this as completed Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants