Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation bug when updating from deployment to statefulset #1127

Closed
leandregagnonlewis opened this issue Oct 8, 2024 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@leandregagnonlewis
Copy link
Contributor

I just did an update to vmagent in multiple clusters. The main point of the update was to move from deployment with ephemeral storage to stateful set with pvc.

We went from this:

    resources:
      limits:
        cpu: "500m"
        memory: 2Gi
        ephemeral-storage: 2Gi
      requests:
        cpu: "250m"
        memory: 1Gi
        ephemeral-storage: 1Gi

to this

    resources:
      limits:
        cpu: "500m"
        memory: 2Gi
      requests:
        cpu: "250m"
        memory: 1Gi

    statefulMode: true
    statefulStorage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 12Gi

In some clusters, the update went well, but in others, I got the following error:

ReconcilationError
cannot handle rolling-update on sts: vmagent-vmagent-opentsdb, err: cannot sort statefulset pods: cannot parse pod id number: jpxft from name: vmagent-vmagent-opentsdb-7d45f9f8d-jpxft

When I got this error, the stateful set was correctly deployed, but the deployment was not deleted. To solve the error, I had to manually delete the deployment.

I think can pinpoint the reason why only some clusters got this error. It only happened in clusters in which I had a failed scheduling error for some of the stateful set pods. At this point, a node gets added to the cluster by an autoscaler and after a few minutes, the stateful set is deployed correctly, but the reconciliation error in vmagent never get resolved.

It also seems that there is a timeout that happens in the vmagent reconciliation process for these clusters where a scale up is needed. This is the first error event that is logged in the vmagent:

ReconcilationError
origin_Err=cannot wait for statefulSet to become ready: context deadline exceeded,podPhase="Pending",conditions=name="PodScheduled",status="False",message="0/82 nodes are available: 10 Insufficient cpu, 10 node(s) had untolerated taint {dedicated: ingress}, 10 node(s) had untolerated taint {tug.jive.com/reserved: postgres}, 15 node(s) had untolerated taint {critical-addons-only: true}, 25 Insufficient memory, 8 Too many pods, 9 node(s) had untolerated taint {subnet: public}, 9 node(s) were unschedulable. preemption: 0/82 nodes are available: 29 No preemption victims found for incoming pod, 53 Preemption is not helpful for scheduling."

My hypothesis is the following (only by looking at the logs, I did not get into the code yet). After the first timeout error, the reconciliation algorithm goes from deploy -> STS to STS -> STS which gives an error.

@f41gh7 f41gh7 added the bug Something isn't working label Oct 9, 2024
f41gh7 added a commit that referenced this issue Oct 14, 2024
Previously, during Deployment -> StatefulSet transition, it was possible to be in state, where deployment was not scheduled
successfully. If `VMAgent` spec changed into `statefulMode`, operator incorrectly lists pods that belong to `Deployment` for sts rolling update.

 It produces error and could only be fixed by manual `Deployment` deletion.

 This commit filters all pods, that don't have `StatefulSet` in `OwnerReferences`. It fixes incorrect behavior of this kind of transition.

Related issue: #1127

Signed-off-by: f41gh7 <nik@victoriametrics.com>
f41gh7 added a commit that referenced this issue Oct 15, 2024
Previously, during Deployment -> StatefulSet transition, it was possible to be in state, where deployment was not scheduled
successfully. If `VMAgent` spec changed into `statefulMode`, operator incorrectly lists pods that belong to `Deployment` for sts rolling update.

 It produces error and could only be fixed by manual `Deployment` deletion.

 This commit filters all pods, that don't have `StatefulSet` in `OwnerReferences`. It fixes incorrect behavior of this kind of transition.

Related issue: #1127

Signed-off-by: f41gh7 <nik@victoriametrics.com>
@f41gh7 f41gh7 added the waiting for release The change was merged to upstream, but wasn't released yet. label Oct 15, 2024
@f41gh7
Copy link
Collaborator

f41gh7 commented Oct 15, 2024

Issue must be fixed at v0.48.4 release

@f41gh7 f41gh7 closed this as completed Oct 15, 2024
@f41gh7 f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants