-
Notifications
You must be signed in to change notification settings - Fork 3.9k
backupccl: OOM while restoring backup in 22.2 #103481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @cockroachdb/disaster-recovery |
Looks like golang's http2 package is ooming. I suppose this OOM is not occurring on later releases because they use newer versions of go. ![]() |
I'd be a little surprised if we'd be seeing a stdlib leak; seems more likely this is us just not sending as much over an http2-backed RPC (e.g. gRPC) |
Summarizing chatter over slack thread: The http client is ooming, and either of the following patches prevents an oom, at least in the repro above:
One unsatisfying note is that even after applying either patch, memory consumption seems to remain dangerously high, around 12 GiB in this repro. So, a backport of either patch could be a bandaid, but if we restore a longer chain even with the patches, we may still oom. The ultimate solution is to include memory monitoring in the restore data processor (#93324), but that's not backportable. I believe the only further action items are to consider the above backports to 22.2 (i.e. more testing). |
Should it just be a typical series of backports instead of a cherry pick? Do v23.1 first, then v22.2? |
#98906 is already in 23.1 |
(Previous question deleted, nm, I was reading wrong.) |
reassigning to @rhu713 to look into the potential for backport bandaids and ways to manipulate what pass to the api |
Uh oh!
There was an error while loading. Please reload this page.
While working on a roachtest (#103228), I saw a
RESTORE
fail because a couple of nodes went OOM. The backup was taken using the following command:Worth noting about this backup (may or may not be relevant):
At a certain point in the test, we attempt to restore this backup on a 4-node cluster on v22.2.9. This failed because two of the nodes went OOM a few minutes after the
RESTORE
statement:This backup does not contain a lot of data. The biggest table has ~2GiB of data in it:
More importantly, very similar backups in other tests can be successfully restored in 22.2, so I think something went wrong with this particular backup.
Reproduction
The issue can be very easily reproduced by attempting to restore this backup on a 22.2 cluster (I have since moved the backup to a bucket with longer TTL [2]). This happens even on a completely empty cluster, with no workloads running.
The commands below will create a node with 14GiB of memory, just like the nodes in the failed test.
Finally, note that this does not happen on master or 23.1.1.
[1] roachtest artifacts
[2] 9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV
Jira issue: CRDB-28023
The text was updated successfully, but these errors were encountered: