Skip to content

backupccl: OOM while restoring backup in 22.2 #103481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
renatolabs opened this issue May 16, 2023 · 9 comments
Closed

backupccl: OOM while restoring backup in 22.2 #103481

renatolabs opened this issue May 16, 2023 · 9 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@renatolabs
Copy link
Contributor

renatolabs commented May 16, 2023

While working on a roachtest (#103228), I saw a RESTORE fail because a couple of nodes went OOM. The backup was taken using the following command:

BACKUP INTO 'gs://cockroachdb-backup-testing/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' AS OF SYSTEM TIME '1684255935229011892.0000000000' WITH detached, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'

Worth noting about this backup (may or may not be relevant):

  • it was taken while the cluster was in mixed-version state.
  • an incremental backup was taken shortly (30s) after the full backup finished (for full logs, see [1]).
  • both jobs were paused and resumed a couple of times while they ran.
  • the backup does not include revision_history.

At a certain point in the test, we attempt to restore this backup on a 4-node cluster on v22.2.9. This failed because two of the nodes went OOM a few minutes after the RESTORE statement:

Screenshot 2023-05-16 at 6 07 37 PM

This backup does not contain a lot of data. The biggest table has ~2GiB of data in it:

$ ./cockroach sql --insecure -e "SELECT database_name, parent_schema_name, object_name, size_bytes FROM [SHOW BACKUP LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH check_files, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'] ORDER BY size_bytes DESC LIMIT 5"
database_name   parent_schema_name      object_name     size_bytes
tpcc    public  stock   3217512461
tpcc    public  order_line      1858665525
tpcc    public  customer        1848283823
bank    public  bank    1310901452
restore_1_22_2_9_to_current_database_bank_before_upgrade_in_22_2_9_1    public  bank    1310899634

More importantly, very similar backups in other tests can be successfully restored in 22.2, so I think something went wrong with this particular backup.

Reproduction

The issue can be very easily reproduced by attempting to restore this backup on a 22.2 cluster (I have since moved the backup to a bucket with longer TTL [2]). This happens even on a completely empty cluster, with no workloads running.

The commands below will create a node with 14GiB of memory, just like the nodes in the failed test.

$ roachprod create -n 1 $CLUSTER
$ roachprod stage $CLUSTER release v22.2.9
$ roachprod start $CLUSTER
$ roachprod ssh $CLUSTER
...
ubuntu@CLUSTER $ time ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7';"
ERROR: connection lost.

ERROR: -e: unexpected EOF
Failed running "sql"

real    2m46.609s
user    0m0.623s
sys     0m0.280s

Finally, note that this does not happen on master or 23.1.1.

[1] roachtest artifacts
[2] 9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV

Jira issue: CRDB-28023

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels May 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 16, 2023

cc @cockroachdb/disaster-recovery

@msbutler
Copy link
Collaborator

Nice Repro!

Here's what the heap looks like before the OOM. Looks like the http client is buffering too much data for some reason. Digging further.
image

@msbutler
Copy link
Collaborator

Looks like golang's http2 package is ooming. I suppose this OOM is not occurring on later releases because they use newer versions of go.

image

@dt
Copy link
Member

dt commented May 18, 2023

because they use newer versions of go.

I'd be a little surprised if we'd be seeing a stdlib leak; seems more likely this is us just not sending as much over an http2-backed RPC (e.g. gRPC)

@msbutler
Copy link
Collaborator

Summarizing chatter over slack thread:

The http client is ooming, and either of the following patches prevents an oom, at least in the repro above:

  1. Cherry pick backupccl: move openSSTs call into the restore workers #98906, which reduces the memory overhead of keeping open a bunch of large files. With this patch, we close files more eagerly. It could be worth backporting this.
  2. Updating golang.org/x/net to 0.8.0, which contains a patch of x/net/http2: Memory leak regression introduced in window updates change golang/go#56315.

One unsatisfying note is that even after applying either patch, memory consumption seems to remain dangerously high, around 12 GiB in this repro. So, a backport of either patch could be a bandaid, but if we restore a longer chain even with the patches, we may still oom.

The ultimate solution is to include memory monitoring in the restore data processor (#93324), but that's not backportable.

I believe the only further action items are to consider the above backports to 22.2 (i.e. more testing).

@shermanCRL
Copy link
Contributor

Cherry pick #98906, which reduces the memory overhead of keeping open a bunch of large files. With this patch, we close files more eagerly. It could be worth backporting this.

Should it just be a typical series of backports instead of a cherry pick? Do v23.1 first, then v22.2?

@msbutler
Copy link
Collaborator

#98906 is already in 23.1

@shermanCRL
Copy link
Contributor

(Previous question deleted, nm, I was reading wrong.)

@msbutler msbutler assigned rhu713 and unassigned msbutler May 22, 2023
@msbutler
Copy link
Collaborator

reassigning to @rhu713 to look into the potential for backport bandaids and ways to manipulate what pass to the api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants