-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconciliation bug when updating from deployment to statefulset #1127
Labels
bug
Something isn't working
Comments
f41gh7
added a commit
that referenced
this issue
Oct 14, 2024
Previously, during Deployment -> StatefulSet transition, it was possible to be in state, where deployment was not scheduled successfully. If `VMAgent` spec changed into `statefulMode`, operator incorrectly lists pods that belong to `Deployment` for sts rolling update. It produces error and could only be fixed by manual `Deployment` deletion. This commit filters all pods, that don't have `StatefulSet` in `OwnerReferences`. It fixes incorrect behavior of this kind of transition. Related issue: #1127 Signed-off-by: f41gh7 <nik@victoriametrics.com>
f41gh7
added a commit
that referenced
this issue
Oct 15, 2024
Previously, during Deployment -> StatefulSet transition, it was possible to be in state, where deployment was not scheduled successfully. If `VMAgent` spec changed into `statefulMode`, operator incorrectly lists pods that belong to `Deployment` for sts rolling update. It produces error and could only be fixed by manual `Deployment` deletion. This commit filters all pods, that don't have `StatefulSet` in `OwnerReferences`. It fixes incorrect behavior of this kind of transition. Related issue: #1127 Signed-off-by: f41gh7 <nik@victoriametrics.com>
Issue must be fixed at v0.48.4 release |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I just did an update to vmagent in multiple clusters. The main point of the update was to move from deployment with ephemeral storage to stateful set with pvc.
We went from this:
to this
In some clusters, the update went well, but in others, I got the following error:
When I got this error, the stateful set was correctly deployed, but the deployment was not deleted. To solve the error, I had to manually delete the deployment.
I think can pinpoint the reason why only some clusters got this error. It only happened in clusters in which I had a failed scheduling error for some of the stateful set pods. At this point, a node gets added to the cluster by an autoscaler and after a few minutes, the stateful set is deployed correctly, but the reconciliation error in vmagent never get resolved.
It also seems that there is a timeout that happens in the vmagent reconciliation process for these clusters where a scale up is needed. This is the first error event that is logged in the vmagent:
My hypothesis is the following (only by looking at the logs, I did not get into the code yet). After the first timeout error, the reconciliation algorithm goes from deploy -> STS to STS -> STS which gives an error.
The text was updated successfully, but these errors were encountered: