Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-agent is not waiting for Dataupload completion before eviction during node image upgrade #7759

Open
veerendra2 opened this issue Apr 30, 2024 · 10 comments

Comments

@veerendra2
Copy link


During node image upgrade(maintenance) the DataUploads are Canceling with found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel. I would except node-agent should keep running and complete the on-going backup(DataUpload) and then evacuate once on-going DataUploads are completed.


What steps did you take and what happened:

  1. Enable CSI Snapshot Data Movement
  2. Trigger a new backup with --snapshot-move-data, for example with below command
    $ velero backup create backup --include-namespaces [NAMESPACE] --include-resources persistentvolumeclaims --snapshot-move-data
  3. Trigger node image upgrade(Maintenance)
  4. node-agents gets restart and Dataupload gets cancelled with found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel

What did you expect to happen:
I would expect, the node-agent should complete the on-going Dataupload and then evacuate. Maybe use Container Lifecycle Hooks and let Kubernetes wait for DataUpload completion before evacuate the node-agent po

Environment:

$ velero version
Client:
	Version: v1.13.2
	Git commit: -
Server:
	Version: v1.13.2

$ k version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5
WARNING: version difference between client (1.30) and server (1.28) exceeds the supported minor version skew of +/-1

Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
  • Kubernetes installer & version: Azure Kubernetes Service(AKS)
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): - RuntimeOS: ubuntu

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

This can be addressed by #7198

@veerendra2
Copy link
Author

Meanwhile, I created a watchdog script that can run in CronJob https://github.com/veerendra2/velero-watchdog

@reasonerjt reasonerjt added the Needs info Waiting for information label Jun 21, 2024
@reasonerjt
Copy link
Contributor

I agree that Datamover micro-service will handle the restart of node-agent more gracefully.
But there's one thing I'm a little confused

@veerendra2

Trigger node image upgrade(Maintenance)

Here what do you mean by "node image upgrade"? Does it mean it will cause OS restart and you want to block the eviction of the pod until the datamovement is complete? It is possible that the datamovement takes a long time.

Please kindly let us know more details b/c this seems an interesting use case.

@veerendra2
Copy link
Author

@reasonerjt

Here what do you mean by "node image upgrade"? Does it mean it will cause OS restart..

Yes, it means K8s node OS upgrade i.e. there will be restart of nodes

...and you want to block the eviction of the pod until the datamovement is complete? It is possible that the datamovement takes a long time.

Yes, something like that, i.e. wait until current Data Movement is completed. In our case, I have seen, the DataUpload for a backup(Schedule for namespace) takes 90Min, not sure it is even possible with container-lifecycle-hooks, because the kubernetes on cloud only respects "Node Drain Timeout"(We use AKS, you can see the docs -> set-node-drain-timeout-value).

another approach is pick up the "Cancelled" DataUpload or DataDownload after node restart.(i.e. restart the failed backups immediately, kind of "re-try")

@Lyndon-Li
Copy link
Contributor

@veerendra2
What is the plan like for the upgrade in your case? E.g., what is the frequency? Is it a scheduled task or burst task?

@veerendra2
Copy link
Author

In our case, we enabled auto node image updates in our terraform, so every week, on Wednesday Azure automatically do the node OS upgrade if any.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 28, 2024

Is it possible to configure the time window that the reboot due to upgrade could commence?

@veerendra2
Copy link
Author

Hi, Yes, it is possible. For example, see -> https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster#maintenance_window

resource "azurerm_kubernetes_cluster" "this" {
...
  maintenance_window {
    allowed {
      day   = var.maintenance_window_day
      hours = var.maintenance_window_hours
    }
  }
...
}

@Lyndon-Li
Copy link
Contributor

First of all, I don't think delay execution of node maintenance it is the ultimate solution:

  1. There is not a reliable way to prevent Kubernetes from killing the pod container in an arbitrary timeout
  2. The node maintenance is platform specific, there is not a reliable way to interact with node maintenance in the same behavior across all platforms

I think the ultimate solution is to define a time window:

  • Either the backup solution supports the backup window which prevent the backups/restores running outside of the backup window
  • Or the platform supports the maintenance window which prevent the maintenance running outside of the time window

At present, Velero doesn't support backup window, so I suggest you set the maintenance window with which you make sure that there is no backup scheduled to that time window.

@monotek
Copy link

monotek commented Jul 1, 2024

We do backups every 4 hours and the maintenance is taking longer than that so there is no point in changing the maintenance window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants