Reconciliation bug when updating from deployment to statefulset #1127

leandregagnonlewis · 2024-10-08T16:56:10Z

I just did an update to vmagent in multiple clusters. The main point of the update was to move from deployment with ephemeral storage to stateful set with pvc.

We went from this:

    resources:
      limits:
        cpu: "500m"
        memory: 2Gi
        ephemeral-storage: 2Gi
      requests:
        cpu: "250m"
        memory: 1Gi
        ephemeral-storage: 1Gi

to this

    resources:
      limits:
        cpu: "500m"
        memory: 2Gi
      requests:
        cpu: "250m"
        memory: 1Gi

    statefulMode: true
    statefulStorage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 12Gi

In some clusters, the update went well, but in others, I got the following error:

ReconcilationError
cannot handle rolling-update on sts: vmagent-vmagent-opentsdb, err: cannot sort statefulset pods: cannot parse pod id number: jpxft from name: vmagent-vmagent-opentsdb-7d45f9f8d-jpxft

When I got this error, the stateful set was correctly deployed, but the deployment was not deleted. To solve the error, I had to manually delete the deployment.

I think can pinpoint the reason why only some clusters got this error. It only happened in clusters in which I had a failed scheduling error for some of the stateful set pods. At this point, a node gets added to the cluster by an autoscaler and after a few minutes, the stateful set is deployed correctly, but the reconciliation error in vmagent never get resolved.

It also seems that there is a timeout that happens in the vmagent reconciliation process for these clusters where a scale up is needed. This is the first error event that is logged in the vmagent:

ReconcilationError
origin_Err=cannot wait for statefulSet to become ready: context deadline exceeded,podPhase="Pending",conditions=name="PodScheduled",status="False",message="0/82 nodes are available: 10 Insufficient cpu, 10 node(s) had untolerated taint {dedicated: ingress}, 10 node(s) had untolerated taint {tug.jive.com/reserved: postgres}, 15 node(s) had untolerated taint {critical-addons-only: true}, 25 Insufficient memory, 8 Too many pods, 9 node(s) had untolerated taint {subnet: public}, 9 node(s) were unschedulable. preemption: 0/82 nodes are available: 29 No preemption victims found for incoming pod, 53 Preemption is not helpful for scheduling."

My hypothesis is the following (only by looking at the logs, I did not get into the code yet). After the first timeout error, the reconciliation algorithm goes from deploy -> STS to STS -> STS which gives an error.

The text was updated successfully, but these errors were encountered:

Previously, during Deployment -> StatefulSet transition, it was possible to be in state, where deployment was not scheduled successfully. If `VMAgent` spec changed into `statefulMode`, operator incorrectly lists pods that belong to `Deployment` for sts rolling update. It produces error and could only be fixed by manual `Deployment` deletion. This commit filters all pods, that don't have `StatefulSet` in `OwnerReferences`. It fixes incorrect behavior of this kind of transition. Related issue: #1127 Signed-off-by: f41gh7 <nik@victoriametrics.com>

f41gh7 · 2024-10-15T17:52:02Z

Issue must be fixed at v0.48.4 release

f41gh7 added the bug Something isn't working label Oct 9, 2024

f41gh7 mentioned this issue Oct 14, 2024

reconcile/statefulset: fixes vmagent statefulmode transition #1128

Merged

f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Oct 15, 2024

f41gh7 closed this as completed Oct 15, 2024

f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciliation bug when updating from deployment to statefulset #1127

Reconciliation bug when updating from deployment to statefulset #1127

leandregagnonlewis commented Oct 8, 2024

f41gh7 commented Oct 15, 2024

Reconciliation bug when updating from deployment to statefulset #1127

Reconciliation bug when updating from deployment to statefulset #1127

Comments

leandregagnonlewis commented Oct 8, 2024

f41gh7 commented Oct 15, 2024