backupccl: OOM while restoring backup in 22.2 #103481

renatolabs · 2023-05-16T22:42:01Z

While working on a roachtest (#103228), I saw a RESTORE fail because a couple of nodes went OOM. The backup was taken using the following command:

BACKUP INTO 'gs://cockroachdb-backup-testing/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' AS OF SYSTEM TIME '1684255935229011892.0000000000' WITH detached, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'

Worth noting about this backup (may or may not be relevant):

it was taken while the cluster was in mixed-version state.
an incremental backup was taken shortly (30s) after the full backup finished (for full logs, see [1]).
both jobs were paused and resumed a couple of times while they ran.
the backup does not include revision_history.

At a certain point in the test, we attempt to restore this backup on a 4-node cluster on v22.2.9. This failed because two of the nodes went OOM a few minutes after the RESTORE statement:

This backup does not contain a lot of data. The biggest table has ~2GiB of data in it:

$ ./cockroach sql --insecure -e "SELECT database_name, parent_schema_name, object_name, size_bytes FROM [SHOW BACKUP LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH check_files, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'] ORDER BY size_bytes DESC LIMIT 5"
database_name   parent_schema_name      object_name     size_bytes
tpcc    public  stock   3217512461
tpcc    public  order_line      1858665525
tpcc    public  customer        1848283823
bank    public  bank    1310901452
restore_1_22_2_9_to_current_database_bank_before_upgrade_in_22_2_9_1    public  bank    1310899634

More importantly, very similar backups in other tests can be successfully restored in 22.2, so I think something went wrong with this particular backup.

Reproduction

The issue can be very easily reproduced by attempting to restore this backup on a 22.2 cluster (I have since moved the backup to a bucket with longer TTL [2]). This happens even on a completely empty cluster, with no workloads running.

The commands below will create a node with 14GiB of memory, just like the nodes in the failed test.

$ roachprod create -n 1 $CLUSTER
$ roachprod stage $CLUSTER release v22.2.9
$ roachprod start $CLUSTER
$ roachprod ssh $CLUSTER
...
ubuntu@CLUSTER $ time ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7';"
ERROR: connection lost.

ERROR: -e: unexpected EOF
Failed running "sql"

real    2m46.609s
user    0m0.623s
sys     0m0.280s

Finally, note that this does not happen on master or 23.1.1.

[1] roachtest artifacts
[2] 9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV

Jira issue: CRDB-28023

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-05-16T22:42:04Z

cc @cockroachdb/disaster-recovery

msbutler · 2023-05-18T14:18:28Z

Nice Repro!

Here's what the heap looks like before the OOM. Looks like the http client is buffering too much data for some reason. Digging further.

msbutler · 2023-05-18T14:51:26Z

Looks like golang's http2 package is ooming. I suppose this OOM is not occurring on later releases because they use newer versions of go.

dt · 2023-05-18T15:05:20Z

because they use newer versions of go.

I'd be a little surprised if we'd be seeing a stdlib leak; seems more likely this is us just not sending as much over an http2-backed RPC (e.g. gRPC)

msbutler · 2023-05-19T12:51:53Z

Summarizing chatter over slack thread:

The http client is ooming, and either of the following patches prevents an oom, at least in the repro above:

Cherry pick backupccl: move openSSTs call into the restore workers #98906, which reduces the memory overhead of keeping open a bunch of large files. With this patch, we close files more eagerly. It could be worth backporting this.
Updating golang.org/x/net to 0.8.0, which contains a patch of x/net/http2: Memory leak regression introduced in window updates change golang/go#56315.

One unsatisfying note is that even after applying either patch, memory consumption seems to remain dangerously high, around 12 GiB in this repro. So, a backport of either patch could be a bandaid, but if we restore a longer chain even with the patches, we may still oom.

The ultimate solution is to include memory monitoring in the restore data processor (#93324), but that's not backportable.

I believe the only further action items are to consider the above backports to 22.2 (i.e. more testing).

shermanCRL · 2023-05-19T13:17:37Z

Cherry pick #98906, which reduces the memory overhead of keeping open a bunch of large files. With this patch, we close files more eagerly. It could be worth backporting this.

Should it just be a typical series of backports instead of a cherry pick? Do v23.1 first, then v22.2?

msbutler · 2023-05-19T16:20:02Z

#98906 is already in 23.1

shermanCRL · 2023-05-19T17:46:17Z

(Previous question deleted, nm, I was reading wrong.)

msbutler · 2023-05-22T14:25:00Z

reassigning to @rhu713 to look into the potential for backport bandaids and ways to manipulate what pass to the api

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels May 16, 2023

exalate-issue-sync bot assigned msbutler May 17, 2023

renatolabs mentioned this issue May 17, 2023

roachtest: change some cluster settings during mixed-version backups #103228

Merged

msbutler assigned rhu713 and unassigned msbutler May 22, 2023

adityamaru mentioned this issue Jun 8, 2023

roachtest: backup-restore/mixed-version failed #104481

Closed

renatolabs mentioned this issue Oct 16, 2023

release-23.1: roachtest: backup-restore/mixed-version updates #112256

Merged

dt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023

jlinder unassigned rhu713 Oct 23, 2023

exalate-issue-sync bot closed this as completed Oct 23, 2023

This was referenced Oct 27, 2023

roachtest: backup-restore/mixed-version failed #112373

Closed

roachtest: backup-restore/mixed-version failed #113218

Closed

msbutler mentioned this issue Jun 17, 2024

roachtest: backup-restore/mixed-version failed #125677

Open

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

backupccl: OOM while restoring backup in 22.2 #103481

backupccl: OOM while restoring backup in 22.2 #103481

renatolabs commented May 16, 2023 •

edited

Loading

blathers-crl bot commented May 16, 2023

Uh oh!

msbutler commented May 18, 2023

Uh oh!

msbutler commented May 18, 2023

Uh oh!

dt commented May 18, 2023

Uh oh!

msbutler commented May 19, 2023

Uh oh!

shermanCRL commented May 19, 2023

Uh oh!

msbutler commented May 19, 2023

Uh oh!

shermanCRL commented May 19, 2023

Uh oh!

msbutler commented May 22, 2023

Uh oh!

backupccl: OOM while restoring backup in 22.2 #103481

backupccl: OOM while restoring backup in 22.2 #103481

Comments

renatolabs commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction

blathers-crl bot commented May 16, 2023

Uh oh!

msbutler commented May 18, 2023

Uh oh!

msbutler commented May 18, 2023

Uh oh!

dt commented May 18, 2023

Uh oh!

msbutler commented May 19, 2023

Uh oh!

shermanCRL commented May 19, 2023

Uh oh!

msbutler commented May 19, 2023

Uh oh!

shermanCRL commented May 19, 2023

Uh oh!

msbutler commented May 22, 2023

Uh oh!

renatolabs commented May 16, 2023 •

edited

Loading