-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-agent is not waiting for Dataupload
completion before eviction during node image upgrade
#7759
Comments
This can be addressed by #7198 |
Meanwhile, I created a watchdog script that can run in |
I agree that Datamover micro-service will handle the restart of node-agent more gracefully.
Here what do you mean by "node image upgrade"? Does it mean it will cause OS restart and you want to block the eviction of the pod until the datamovement is complete? It is possible that the datamovement takes a long time. Please kindly let us know more details b/c this seems an interesting use case. |
Yes, it means K8s node OS upgrade i.e. there will be restart of nodes
Yes, something like that, i.e. wait until current Data Movement is completed. In our case, I have seen, the DataUpload for a backup(Schedule for namespace) takes 90Min, not sure it is even possible with container-lifecycle-hooks, because the kubernetes on cloud only respects "Node Drain Timeout"(We use AKS, you can see the docs -> set-node-drain-timeout-value). another approach is pick up the "Cancelled" DataUpload or DataDownload after node restart.(i.e. restart the failed backups immediately, kind of "re-try") |
@veerendra2 |
In our case, we enabled auto node image updates in our terraform, so every week, on Wednesday Azure automatically do the node OS upgrade if any. |
Is it possible to configure the time window that the reboot due to upgrade could commence? |
Hi, Yes, it is possible. For example, see -> https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster#maintenance_window resource "azurerm_kubernetes_cluster" "this" {
...
maintenance_window {
allowed {
day = var.maintenance_window_day
hours = var.maintenance_window_hours
}
}
...
} |
First of all, I don't think delay execution of node maintenance it is the ultimate solution:
I think the ultimate solution is to define a time window:
At present, Velero doesn't support backup window, so I suggest you set the maintenance window with which you make sure that there is no backup scheduled to that time window. |
We do backups every 4 hours and the maintenance is taking longer than that so there is no point in changing the maintenance window. |
During node image upgrade(maintenance) the
DataUploads
areCanceling
withfound a dataupload with status "InProgress" during the node-agent starting, mark it as cancel
. I would except node-agent should keep running and complete the on-going backup(DataUpload
) and then evacuate once on-goingDataUploads
are completed.What steps did you take and what happened:
--snapshot-move-data
, for example with below commandDataupload
gets cancelled withfound a dataupload with status "InProgress" during the node-agent starting, mark it as cancel
What did you expect to happen:
I would expect, the node-agent should complete the on-going
Dataupload
and then evacuate. Maybe use Container Lifecycle Hooks and let Kubernetes wait forDataUpload
completion before evacuate the node-agent poEnvironment:
/etc/os-release
): - RuntimeOS: ubuntuVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: