STAN scaler does not scale from 0 #519

cwoolum · 2019-12-18T20:17:07Z

A clear and concise description of what the bug is.

Expected Behavior

When there messages in the queue and the HPA is scaled down to 0, the scaler should create at least one instance of a pod to create the subscription with STAN.

Actual Behavior

When the loop runs, The logic checks if the pending count of messages is greater than 0 and that the queue names match. It never takes into account if the subscription does not exist.

Steps to Reproduce the Problem

Create a new ScaledObject that can scale to 0

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: test-scaledobject
  labels:
    deploymentName: test
spec:
  pollingInterval: 10   # Optional. Default: 30 seconds
  cooldownPeriod: 30   # Optional. Default: 300 seconds
  minReplicaCount: 0   # Optional. Default: 0
  maxReplicaCount: 5  # Optional. Default: 100  
  scaleTargetRef:
    deploymentName: test
  triggers:
  - type: stan
    metadata:
      natsServerMonitoringEndpoint: "stan.default:8222"
      queueGroup: "place_indexer"
      durableName: "ImDurable"
      subject: myQueue
      lagThreshold: "100"

cURL the STAN channel endpoint. It will return a response like:

{
  "name": "myQueue",
  "msgs": 500,
  "bytes": 0,
  "first_seq": 0,
  "last_seq": 0,
  "subscriptions": []

Keda will ignore the msgs field because there are no active subscriptions.
If minReplicaCount is set to 1, scaling works normally.

Specifications

Version: 1.0.0
Platform: Linux
Scaler(s): STAN

The text was updated successfully, but these errors were encountered:

balchua · 2019-12-21T02:59:19Z

Thanks @cwoolum for spotting this one. Its strange that I didn't see this one, sure tested it will min replicas 0. Anyway thanks!

balchua · 2019-12-24T02:51:05Z

I think i know how i missed this one. My initial deployment has a replica of 1. So atleast there is a registered subscriber. Once you get past that it can scale from 0 afterwards.

cwoolum · 2019-12-24T07:18:04Z

If it scales to zero though, the subscription will go away if you are draining and closing the connection. Even if you don't drain and close, the subscription will drop off after a short period(5 minutes?) And that prevents it from scaling back up.

balchua · 2019-12-24T08:10:58Z

I was testing it with this program https://github.com/balchua/gonuts. Subscriber seems to stick around.

cwoolum · 2019-12-24T16:09:43Z

https://github.com/balchua/gonuts/blob/c0ce3d6448fc1dafeaf5695136d008a04a696873/sub/main.go#L150

You are calling sc.Close() but not sc.Unsubscribe() when your subscriber app ends.

If your service scales up to 20 instances, you'll have 20 subscribers in STAN. Now say you scale back down to 1, each of those instance need to unsubscribe so that STAN doesn't keep trying to deliver messages to them, right?

I could be thinking about this the wrong way though.

balchua · 2019-12-25T00:39:39Z

Thanks for the explanation.👍 I remember setting the maxinflight to 1 so it round robins it to different subscribers. If the original subscriber goes down, i think stan will try to deliver it to the same subscriber until certain period of time. But i maybe wrong here too.

cwoolum · 2019-12-25T01:28:19Z

I do use MaxInFlight but set it to 50 since each one of my pods can handle that number of concurrent requests.

I saw that one of the newer versions of STAN/NATS does have a keepalive that it sends periodically to make sure a subscriber is still listening but I'm not sure of the relationship between a connection being open and a subscription existing for it. I'm going to open up an issue in the STAN repo and see if the team can provide some best practices on it.

cwoolum · 2019-12-25T16:47:42Z

I was able to find the docs for this. It looks like I was doing it wrong. For durable subscriptions, you don't want to Unsubscribe but only Close the connection.

The main difference between a queue subscription and a durable one, is that when the last member leaves the group, the state of the group will be maintained by the server. Later, when a member rejoins the group, the delivery will resume.
Note: For a durable queue subscription, the last member to  unsubscribe  (not simply close) causes the group to be removed from the server.

I'll try removing the Unsubscribe call from my code and see how that affects it.

balchua · 2019-12-26T11:52:12Z

@cwoolum thanks for checking the docs. Did it work for your when you removed the Unsubscribe?

cwoolum · 2019-12-27T00:24:39Z

I see the subscriptions are still there but I'm not sure if the messages are getting delivered successfully now. I'm seeing a large number of pending_count on the pods that were taken down as part of the scale down.

{
  "name": "place.inserted",
  "msgs": 653382,
  "bytes": 358537710,
  "first_seq": 1,
  "last_seq": 653382,
  "subscriptions": [
    {
      "client_id": "xxxplaceinserter-5595885df6-v57vq",
      "inbox": "_INBOX.AQ77TV2TOF1DVJUUW109X8",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD3HS",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653379,
      "pending_count": 0,
      "is_stalled": false
    },
    {
      "client_id": "xxxplaceinserter-7b79867b48-zf6js",
      "inbox": "_INBOX.ZQCLS4F06A4AMM1V78KYZ8",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD3Qt",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653382,
      "pending_count": 110,
      "is_stalled": false
    },
    {
      "client_id": "xxxplaceinserter-7b79867b48-tjwzt",
      "inbox": "_INBOX.SD4GQBUAJY4L8SPRIUYNGH",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD4Dx",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653382,
      "pending_count": 110,
      "is_stalled": false
    },
    {
      "client_id": "xxxplaceinserter-7b79867b48-2cd2v",
      "inbox": "_INBOX.XAERJTARSEMGR9N3ZUHZ1J",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD44W",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653382,
      "pending_count": 110,
      "is_stalled": false
    },
    {
      "client_id": "xxxplaceinserter-7b79867b48-mv9fr",
      "inbox": "_INBOX.7PWDZ4ZLPWOA311VJKKC6V",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD3yr",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653376,
      "pending_count": 110,
      "is_stalled": false
    },
    {
      "client_id": "xxxplaceinserter-7b79867b48-xzz2s",
      "inbox": "_INBOX.DSA21NRUGOLG1DBW8CIP52",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfD40k",
      "queue_name": "UpdatedDurable:place_inserter",
      "is_durable": true,
      "is_offline": false,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 653378,
      "pending_count": 110,
      "is_stalled": false
    }
  ]
}

cwoolum · 2019-12-27T00:58:13Z

After the scale down occurred, the number of subscribers dropped to 1. I'm not sure the time it took for that to happen though.

I opened an issue to ask the best practices for this.

nats-io/nats-streaming-server#1001

balchua · 2019-12-27T03:14:06Z

As far as i know stan will direct the pending messages to available subscribers. But how long before it takes that decision.

cwoolum · 2019-12-31T18:15:55Z

I'm going to keep testing to make sure none of the messages are getting stuck. I keep running into the error Operation cannot be fulfilled on scaledobjects.keda.k8s.io \"mytest-scaledobject\": the object has been modified; please apply your changes to the latest version and try again. I've tried to delete both the HPA and ScaledObject multiple times but it keeps occurring. Any idea?

balchua · 2020-01-01T00:00:52Z

When this error shows up does it stop scaling? Care to show what kubectl get hpa looks like? And also your scaledobject yaml.
Thanks!

zroubalik · 2020-01-02T22:05:41Z

@cwoolum could you please decribe steps leading to this error + more details on ScaledObject and HPA. thanks

cwoolum · 2020-01-03T16:12:10Z

Okay, the modified issue was due to an old CPU based HPA conflicting with the Keda Scaler but I was able to recreate the issue with the pod not scaling from 0. Here are the requested items.

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  creationTimestamp: '2020-01-03T06:52:29Z'
  finalizers:
    - finalizer.keda.k8s.io
  generation: 3
  labels:
    deploymentName: turnouttplaceindexer
  name: turnouttplaceindexer-scaledobject
  namespace: api
spec:
  cooldownPeriod: 30
  maxReplicaCount: 5
  minReplicaCount: 0
  pollingInterval: 10
  scaleTargetRef:
    deploymentName: turnouttplaceindexer
  triggers:
    - metadata:
        durableName: UpdatedDurable
        lagThreshold: '100'
        natsServerMonitoringEndpoint: 'stan.default:8222'
        queueGroup: place_indexer
        subject: place.updated
      type: stan
status:
  externalMetricNames:
    - lagThreshold
  lastActiveTime: '2020-01-03T15:56:05Z'

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: keda-hpa-turnouttplaceindexer
  namespace: api
  selfLink: >-
    /apis/autoscaling/v2beta1/namespaces/api/horizontalpodautoscalers/keda-hpa-turnouttplaceindexer
  uid: 8a0bd442-13df-43db-9897-0b96ffae7e4f
  resourceVersion: '25270090'
  creationTimestamp: '2020-01-03T15:52:05Z'
  ownerReferences:
    - apiVersion: keda.k8s.io/v1alpha1
      kind: ScaledObject
      name: turnouttplaceindexer-scaledobject
      uid: 8f9f9d7f-022a-4f95-ba70-c30cb155337c
      controller: true
      blockOwnerDeletion: true
spec:
  scaleTargetRef:
    kind: Deployment
    name: turnouttplaceindexer
    apiVersion: apps/v1
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: External
      external:
        metricName: lagThreshold
        metricSelector:
          matchLabels:
            deploymentName: turnouttplaceindexer
        targetAverageValue: '100'
status:
  currentReplicas: 0
  desiredReplicas: 0
  currentMetrics: null
  conditions:
    - type: AbleToScale
      status: 'True'
      lastTransitionTime: '2020-01-03T15:52:20Z'
      reason: SucceededGetScale
      message: the HPA controller was able to get the target's current scale
    - type: ScalingActive
      status: 'False'
      lastTransitionTime: '2020-01-03T15:57:12Z'
      reason: ScalingDisabled
      message: scaling is disabled since the replica count of the target is zero
    - type: ScalingLimited
      status: 'True'
      lastTransitionTime: '2020-01-03T15:55:40Z'
      reason: TooFewReplicas
      message: >-
        the desired replica count is increasing faster than the maximum scale
        rate

Here is the response from STAN:

{
  "name": "place.updated",
  "msgs": 1000000,
  "bytes": 131655679,
  "first_seq": 1752831,
  "last_seq": 2752830,
  "subscriptions": [
    {
      "client_id": "turnouttplaceindexer-68c66f7545-dhwqc",
      "inbox": "_INBOX.6Z7TV1D0B84HNLMDPIM8IW",
      "ack_inbox": "_INBOX.INczX1devhkKAX12cfuiaw",
      "queue_name": "UpdatedDurable:place_indexer",
      "is_durable": true,
      "is_offline": true,
      "max_inflight": 100,
      "ack_wait": 60,
      "last_sent": 2751730,
      "pending_count": 0,
      "is_stalled": false
    }
  ]
}

And here are the logs from KEDA

2020-01-03T15:54:24.630417316Z {"level":"info","ts":1578066864.6303,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
2020-01-03T15:54:24.730667858Z {"level":"info","ts":1578066864.7304275,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"scaledobject-controller"}
2020-01-03T15:54:24.830825399Z {"level":"info","ts":1578066864.8306682,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"scaledobject-controller","worker count":1}
2020-01-03T15:54:25.156227482Z {"level":"info","ts":1578066865.1561325,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:54:25.156241982Z {"level":"info","ts":1578066865.1561651,"logger":"controller_scaledobject","msg":"Detecting ScaleType from ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:54:25.156247382Z {"level":"info","ts":1578066865.1561697,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:54:25.181735746Z {"level":"info","ts":1578066865.1815946,"logger":"controller_scaledobject","msg":"Updated HPA according to ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject","HPA.Namespace":"api","HPA.Name":"keda-hpa-turnouttplaceindexer"}

2020-01-03T15:57:10.682897812Z {"level":"info","ts":1578067030.6824272,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:57:10.684467522Z {"level":"info","ts":1578067030.682496,"logger":"controller_scaledobject","msg":"Detecting ScaleType from ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:57:10.684494823Z {"level":"info","ts":1578067030.6825025,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject"}
2020-01-03T15:57:10.695899496Z {"level":"info","ts":1578067030.6957793,"logger":"controller_scaledobject","msg":"Updated HPA according to ScaledObject","Request.Namespace":"api","Request.Name":"turnouttplaceindexer-scaledobject","HPA.Namespace":"api","HPA.Name":"keda-hpa-turnouttplaceindexer"}
2020-01-03T15:57:10.718530941Z {"level":"info","ts":1578067030.7182388,"logger":"scalehandler","msg":"Successfully scaled deployment to 0 replicas","ScaledObject.Namespace":"api","ScaledObject.Name":"turnouttplaceindexer-scaledobject","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"api","Deployment.Name":"turnouttplaceindexer"}

There are messages pending for this subscription but Keda doesn't seem to be picking them up. The pods do call close on the NATS and STAN connections but do not Unsubscribe.

balchua · 2020-01-03T22:33:13Z

@cwoolum i think i know what the issue is, the stan scaler actually uses the channels' msg attribute and subtract that using the subscriber's last_sent. Unfortunately last_sent is a sequence number.
Looking at the STAN response you provided, msg=1000000 and last_sent=2751730 the resulting pending lag will be negative.
I think the better approach is to use the channel information's last_seq instead of msg.
Wdyt?

cwoolum · 2020-01-06T19:08:31Z

I think that makes sense. I created PR #533 to make the change. It also adds additional logging to help diagnose issues.

balchua · 2020-01-07T23:22:04Z

@cwoolum does PR #519 addresses this issue?

cwoolum · 2020-01-10T00:18:52Z

Yes, I just did a full deploy using the master image tag and scale up/down is working for me.

balchua · 2020-01-10T00:23:12Z

Thanks @cwoolum

cwoolum added the bug Something isn't working label Dec 18, 2019

cwoolum mentioned this issue Dec 18, 2019

Fix STAN scale from 0 [WIP] #520

Closed

cwoolum closed this as completed Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STAN scaler does not scale from 0 #519

STAN scaler does not scale from 0 #519

cwoolum commented Dec 18, 2019

balchua commented Dec 21, 2019

balchua commented Dec 24, 2019

cwoolum commented Dec 24, 2019 •

edited

Loading

balchua commented Dec 24, 2019

cwoolum commented Dec 24, 2019 •

edited

Loading

balchua commented Dec 25, 2019

cwoolum commented Dec 25, 2019

cwoolum commented Dec 25, 2019

balchua commented Dec 26, 2019

cwoolum commented Dec 27, 2019

cwoolum commented Dec 27, 2019 •

edited

Loading

balchua commented Dec 27, 2019

cwoolum commented Dec 31, 2019

balchua commented Jan 1, 2020 •

edited

Loading

zroubalik commented Jan 2, 2020

cwoolum commented Jan 3, 2020 •

edited

Loading

balchua commented Jan 3, 2020

cwoolum commented Jan 6, 2020

balchua commented Jan 7, 2020

cwoolum commented Jan 10, 2020

balchua commented Jan 10, 2020

STAN scaler does not scale from 0 #519

STAN scaler does not scale from 0 #519

Comments

cwoolum commented Dec 18, 2019

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

balchua commented Dec 21, 2019

balchua commented Dec 24, 2019

cwoolum commented Dec 24, 2019 • edited Loading

balchua commented Dec 24, 2019

cwoolum commented Dec 24, 2019 • edited Loading

balchua commented Dec 25, 2019

cwoolum commented Dec 25, 2019

cwoolum commented Dec 25, 2019

balchua commented Dec 26, 2019

cwoolum commented Dec 27, 2019

cwoolum commented Dec 27, 2019 • edited Loading

balchua commented Dec 27, 2019

cwoolum commented Dec 31, 2019

balchua commented Jan 1, 2020 • edited Loading

zroubalik commented Jan 2, 2020

cwoolum commented Jan 3, 2020 • edited Loading

balchua commented Jan 3, 2020

cwoolum commented Jan 6, 2020

balchua commented Jan 7, 2020

cwoolum commented Jan 10, 2020

balchua commented Jan 10, 2020

cwoolum commented Dec 24, 2019 •

edited

Loading

cwoolum commented Dec 24, 2019 •

edited

Loading

cwoolum commented Dec 27, 2019 •

edited

Loading

balchua commented Jan 1, 2020 •

edited

Loading

cwoolum commented Jan 3, 2020 •

edited

Loading