-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925
Comments
Could you give more information about the two metrics They seems have default values.
|
Could you share more details on the backup workload? |
10 Volumes via CSI Driver. A working backups typically saves 1513 Kubernetes resources. We experience this issue in approx. 10 Kubernetes clusters. All 10 clusters configured identically (10 volumes, ~1500 resources) |
I guess, I found the root cause. For each backup, the azure plugin is spawned, but not terminated.
|
I see. |
In our case, we will remove the Azure Plugin in for now. It was a left over from the migration to CSI based backups and all old azure plugin based Backups are out of the rotation. |
|
I though, that part is bundled into velero? https://github.com/vmware-tanzu/velero/blob/main/pkg/util/azure/storage.go |
That part of code is used for filesystem backup to interact with cloud providers. |
okay ,thanks for confirmation. I guess then we may stuck |
@blackpiglet Hmm. In my experience, the plugin process typically gets terminated after each backup -- or more specifically, after the controller reconcile exits, since the GRPC connection is terminated. Maybe we should investigate whether 1) something has changed in 1.14 or 2) some issue with one or more plugins is preventing their exit? |
We have only Azure Plugin active. We had CSI Plugin in history, but its now merged in 1.14 |
@sseago So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently. |
Our current workaround is now a cronjob that restart velero all 2 hours. |
We check a cluster running with 1.13 velero and we can observe that no plugin instances are left. |
@jkroepke The risk here is that if you restart velero while a backup or restore is running, that backup/restore will be marked failed since an in-progress action can't be resumed after restart/crash. |
Yep, but without that cronjob, it's guaranteed that Velero get OOMed on backup. Does velero not support graceful shutdown? If a backup/restore is running, I would expect that SIGTERM is delayed until the backup/restore process is finished. |
@jkroepke |
@blackpiglet velero/velero-plugin-for-microsoft-azure:v1.10.0 - on initial report, we used 1.9 but then we also upgrade to 1.10 without any difference |
We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. |
Please maybe create an dedicated issue for this. |
Sure, will avoid any confusion here. |
The issue isn't related only to Azure plugin. Total Backup: 21
|
Did you see, that I attached a zip file with all debug logs? According to the issue template, I do all the things manually, what velero debug does. |
Can you pls share the commands? thanks |
velero/.github/ISSUE_TEMPLATE/bug_report.md Lines 18 to 24 in 255a51f
|
it reproduced with: Kopia Backup - working fine |
We have 2 scheudles with only CSI backups enabled. + Azure Blob Storage for manifests. |
@jkroepke @duduvaa That's good information. If kopia backup doesn't do this, this suggests that we're closing the object store plugin processes properly in the regular backup controller. The backup operation controller only reconciles when we have async operations -- which only happens for CSI and Datamover backups, so it may be in that code where this is occurring. It gives us a more focused area to dig into this. |
In our case, we do Azure File snapshots vis csi. That might be special. |
@jkroepke I'm seeing it in my dev cluster with aws object store plugin and CSI snapshots (no datamover). If I use fs-backup, I don't see it. |
Some debug progress, but still more to do. We know this happens only for CSI or CSI+datamover backups. I've isolated this to the backup finalizer controller. From the added logging, we see in the other controllers we initialize the needed plugins, then at end of reconcile, the relevant plugins are killed. In the finalizer controller case, though it's a bit different. In my cluster, 3 plugins were initialized (internal, object store, and the OADP openshift BIA/RIA plugin) -- then all 3 of these were killed. However, immediately after killing the aws plugin (again this is due to the |
If it helps, you can provide an extensive log build that I will run on my cluster. |
@jkroepke @sseago hi, i have the same issue, i suspect the codes here introduce the processes leak. Lines 729 to 743 in a91d2cb
On my machine, I deleted this section of the code, and the issue of leaks did not occur. but, the exact reason why this code was causing a process leak is unclear. |
I think i found the reason. The issue appears to be that each |
Which all versions are impacted due to this? |
The problem is introduced in #7554 , i think all verstions after that commit are impacted. |
@anshulahuja98 @yuanqijing Yes, I see the problem. The pattern of getting a plugin manager and deferring cleanup clients is used elsewhere to guarantee that the plugin process exits. However, in this case we're killing the plugin process while at the same time returning the backup store -- so the next time that someone uses that backup store, it re-initializes the plugin, but at this point it's no longer connected to a plugin manager, so it's never cleaned up.
We need to get the plugin manager at a higher level in the call stack so we clean up after we're done with the backup store. |
I've submitted #8012 which passes in the already-initialized plugin manager from the finalizer controller, so we don't need to spawn a new plugin process for get/put volume info. I've tested this locally and it resolves the leak. |
Once #8012 merges we should CP to release-1.14 for 1.14.1. |
Reopen for cherry-pick. |
Close as completed. |
What steps did you take and what happened:
Setup scheduled Backups via CSI driver. No restic, kopia or other file based copy action are in use.
Memory leak.
We have an additional to the failed daily schedule a hourly schedule. In grafana, I can see a memory increase after each backup.
What did you expect to happen:
No memory leak
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
Does work from a pod with in-cluster authentication:
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
velero.zip
Anything else you would like to add:
Environment:
velero version
): v1.14.0velero client config get features
):<NOT SET>
kubectl version
): v1.29.4/etc/os-release
): Azure LinuxVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: