First request with 0 replicas sometimes takes ~20s #586

abatilo · 2023-01-29T19:06:50Z

Report

Sometimes when I send a request to an application that I have configured with scale to 0, the first request will take around 20 seconds to respond, even though the pod has been scheduled and ready after only 2-3 seconds.

Expected Behavior

I would expect that if I send a request to an application with 0 replicas, that once the pod is up and healthy that the interceptor will return immediately

Actual Behavior

Instead, it might take between 20-25 seconds before a response is made

Steps to Reproduce the Problem

I have a website up that hosts just a stock boilerplate React application + Golang backed service. Feel free to hit it. It's a domain/codebase that I like to use for testing things.

curl https://friendlyfaces.app

Below is the entire YAML that I use for deploying the site:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.kubernetes.io/topology-aware-hints: auto
  name: friendlyfaces
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: friendlyfaces
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: friendlyfaces
  name: friendlyfaces
spec:
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: friendlyfaces
  template:
    metadata:
      labels:
        app: friendlyfaces
    spec:
      containers:
      - command:
        - chamber
        - exec
        - honeycomb
        - --
        - friendlyfaces
        - --static-assets-path
        - /static
        image: 911907402684.dkr.ecr.us-west-2.amazonaws.com/friendlyfaces:03db570bbd7677cc5a45f1939de03dc474ee03da
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
        name: friendlyfaces
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 0.1
            memory: 128Mi
      serviceAccountName: friendlyfaces
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: friendlyfaces
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      - labelSelector:
          matchLabels:
            app: friendlyfaces
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: friendlyfaces
spec:
  dnsNames:
  - friendlyfaces.app
  - www.friendlyfaces.app
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt
  secretName: friendlyfaces-cert
---
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: friendlyfaces
spec:
  host: friendlyfaces.app
  replicas:
    max: 5
    min: 0
  scaleTargetRef:
    deployment: friendlyfaces
    port: 80
    service: friendlyfaces
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: friendlyfaces
spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`friendlyfaces.app`, `www.friendlyfaces.app`)
    middlewares:
    - name: friendlyfaces
    - name: security-headers
      namespace: traefik
    services:
    - name: keda-add-ons-http-interceptor-proxy
      namespace: keda
      port: 8080
  tls:
    secretName: friendlyfaces-cert
---
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: friendlyfaces
spec:
  headers:
    customRequestHeaders:
      host: friendlyfaces.app

Logs from KEDA HTTP operator

No logs about this as far as I can tell

What version of the KEDA HTTP Add-on are you running?

v0.4.0

Kubernetes Version

1.24

Platform

Amazon Web Services

Anything else?

I'd love to be a contributor and try to debug this on my own. I'm going to do my best to figure out what's happening but if anyone could give me ideas or point me in the right direction that'd be amazing too.

Thank you all for this project. It's really neat and I like it a lot.

The text was updated successfully, but these errors were encountered:

abatilo · 2023-01-29T20:49:35Z

Okay, something that I've noticed is that it appears that if I restart the interceptors, it's fast again:

k -n keda rollout restart deployment/keda-add-ons-http-interceptor

So maybe something with caching? Hmm

JorTurFer · 2023-01-31T18:41:28Z

Hi,
Thanks for reporting :)
Are you willing to contribute with the fix?

abatilo · 2023-01-31T18:55:22Z

Hi @JorTurFer!

Yes, sorry, I should have highlighted that more in my original issue submission:

I'd love to be a contributor and try to debug this on my own. I'm going to do my best to figure out what's happening but if anyone could give me ideas or point me in the right direction that'd be

I'd love to contribute the fix. I'm working through trying to find the problem. If you have any tips or ideas, I'd love to hear them though.

JorTurFer · 2023-01-31T19:01:32Z

oh, sorry, I read it fast and I didn't get that part :(
In this case, I'd check what component needs the restart (In this case, it seems that is the interceptor). Restarting the interceptor, you are removing caches but also enforcing new connections between interceptor and metrics server. I'd start with this communication, maybe adding some logs to the interceptor/metrics server to detect possible dropped connections.

TBH, I'm not sure about what kind of cache could be in the interceptor

The backoff is created in `interceptor/main.go` and `.Step()` is a pointer receiver, so `.Step()` will mutate the underling number of `.Steps` within the struct. When you have 0 replicas, the default 500ms timeout of the Dial is hit which basically guarantees that you will end up calling `.Step()`. The value inside the the struct will never be reset, and so every single time you're scaling from 0, you're guaranteed to reduce the number of `.Steps` by 1 and this will never reset until the application itself has been restarted. This change makes sure that every time we execute `DialContextWithRetry`, we start with a fresh `backoff` which will start the `.Steps` at 5 since it's a clone of the original `backoff`. This `backoff` is available in the context of the function returned by `DialContextWithRetry`, which will be the one that gets decremented, and then will be garbage collected and we get a brand new 5 steps the next time we execute `DialContextWithRetry`. Closes kedacore#586

abatilo · 2023-02-07T15:37:15Z

@JorTurFer We did it!

#598

I found the problem that was causing this and opened a PR to fix the problem. Please take a look when you can!

The backoff is created in `interceptor/main.go` and `.Step()` is a pointer receiver, so `.Step()` will mutate the underling number of `.Steps` within the struct. When you have 0 replicas, the default 500ms timeout of the Dial is hit which basically guarantees that you will end up calling `.Step()`. The value inside the the struct will never be reset, and so every single time you're scaling from 0, you're guaranteed to reduce the number of `.Steps` by 1 and this will never reset until the application itself has been restarted. This change makes sure that every time we execute `DialContextWithRetry`, we start with a fresh `backoff` which will start the `.Steps` at 5 since it's a clone of the original `backoff`. This `backoff` is available in the context of the function returned by `DialContextWithRetry`, which will be the one that gets decremented, and then will be garbage collected and we get a brand new 5 steps the next time we execute `DialContextWithRetry`. Closes kedacore#586 Signed-off-by: Aaron Batilo <AaronBatilo@gmail.com>

Closes #586

abatilo added the bug Something isn't working label Jan 29, 2023

JorTurFer assigned abatilo Jan 31, 2023

abatilo mentioned this issue Feb 7, 2023

fix: Create copy of backoff struct #598

Merged

JorTurFer closed this as completed in #598 Feb 7, 2023

JorTurFer pushed a commit that referenced this issue Feb 7, 2023

fix: Create copy of backoff struct (#598)

8f90980

Closes #586

trondhindenes mentioned this issue Oct 6, 2023

intereceptor service isn't created #755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First request with 0 replicas sometimes takes ~20s #586

First request with 0 replicas sometimes takes ~20s #586

abatilo commented Jan 29, 2023

abatilo commented Jan 29, 2023

JorTurFer commented Jan 31, 2023

abatilo commented Jan 31, 2023 •

edited

Loading

JorTurFer commented Jan 31, 2023 •

edited

Loading

abatilo commented Feb 7, 2023

First request with 0 replicas sometimes takes ~20s #586

First request with 0 replicas sometimes takes ~20s #586

Comments

abatilo commented Jan 29, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA HTTP operator

What version of the KEDA HTTP Add-on are you running?

Kubernetes Version

Platform

Anything else?

abatilo commented Jan 29, 2023

JorTurFer commented Jan 31, 2023

abatilo commented Jan 31, 2023 • edited Loading

JorTurFer commented Jan 31, 2023 • edited Loading

abatilo commented Feb 7, 2023

abatilo commented Jan 31, 2023 •

edited

Loading

JorTurFer commented Jan 31, 2023 •

edited

Loading