Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup/KMS/GCS/n3cpu4 failed-- SHOW EXPERIMENTAL FINGERPRINTS causes an OOM crash #113816

Closed
cockroach-teamcity opened this issue Nov 4, 2023 · 4 comments · Fixed by #114289
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 4, 2023

roachtest.backup/KMS/GCS/n3cpu4 failed with artifacts on master @ 4d045594e8c65b56c82fcf2a1f14ee30cecfef3d:

(monitor.go:153).Wait: monitor failure: pq: query execution canceled
test artifacts and logs in: /artifacts/backup/KMS/GCS/n3cpu4/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-33175

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Nov 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Nov 4, 2023
@msbutler msbutler self-assigned this Nov 6, 2023
@msbutler msbutler closed this as completed Nov 6, 2023
@msbutler msbutler reopened this Nov 7, 2023
@msbutler msbutler added the X-noreuse Prevent automatic commenting from CI test failures label Nov 7, 2023
@msbutler msbutler changed the title roachtest: backup/KMS/GCS/n3cpu4 failed roachtest: backup/KMS/GCS/n3cpu4 failed-- SHOW EXPERIMENTAL FINGERPRINTS causes an OOM Nov 7, 2023
@msbutler msbutler changed the title roachtest: backup/KMS/GCS/n3cpu4 failed-- SHOW EXPERIMENTAL FINGERPRINTS causes an OOM roachtest: backup/KMS/GCS/n3cpu4 failed-- SHOW EXPERIMENTAL FINGERPRINTS causes an OOM crash Nov 7, 2023
@msbutler msbutler added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 7, 2023
@msbutler
Copy link
Collaborator

msbutler commented Nov 7, 2023

According to 2.dmesg.txt, node 2 crashed with an OOM error. test.log indicates this OOM occurred after the backup/restore roundtrip and during a SHOW EXPERIMENTAL FINGERPRINTS query. It's worth noting:

  • this test failed while running with metamorphic constants set to true. the test.log prints all the metamorphic constants that were set.
  • A similar test failure occurred while run on AWS. I'm still trying to determine if an OOM caused this failure as well.

According to the latest heap profile in the artifacts, most memory allocation on node 2 looks to be in sql execution land:
image

I'm handing this over to sql queries to determine if this OOM points to a real bug in how they account for memory usage.

(cc @yuzefovich who has commented on a related slack thread)

@msbutler msbutler added T-sql-queries SQL Queries Team and removed T-disaster-recovery labels Nov 7, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 7, 2023
@msbutler msbutler removed their assignment Nov 7, 2023
@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 7, 2023
@msbutler
Copy link
Collaborator

msbutler commented Nov 7, 2023

DR just had a another test failure with runtime assertions enabled where a SHOW EXPERIMENTAL FINGERPRINTS query ran out of the sql-memory budget (i.e. no OOM).

@yuzefovich
Copy link
Member

I think there is something to look at. These OOMs seem to require a combination of very large coldata-batch-size metamorphic variable as well as another direct-scans-enabled being enabled (which is disabled by default). Upon a quick inspection, I'm not certain that we always do proper memory accounting in the local fast path of "direct columnar scans" (meaning that we might be not accounting for the right half of the heap profile). This issue is not super urgent, so for now I'll just disable metamorphic randomization of direct-scans-enabled.

@yuzefovich yuzefovich self-assigned this Nov 8, 2023
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Nov 8, 2023
craig bot pushed a commit that referenced this issue Nov 8, 2023
113966: kvcoord: Reintroduce catchup scan semaphore for regular rangefeed r=miretskiy a=miretskiy

Re-introduce catchup scan semaphore limit, removed by #110919, for regular rangefeed.  This hard limit on the number of catchup scans is necessary to avoid OOMs when handling large scan rangefeeds (large fan-in factor) when executing many non-local ranges.

Fixes #113489

Release note: None

114000: colfetcher: disable metamorphic randomization for direct scans r=yuzefovich a=yuzefovich

This commit makes it so that we no longer - for now - use metamorphic randomization for the default value of
`sql.distsql.direct_columnar_scans.enabled` cluster setting that controls whether the direct columnar scans (aka "KV projection pushdown") is enabled. It appears that we might be missing some memory accounting in the local fast path of this feature, and some backup-related roachtests run into OOMs with binaries with "enabled assertions". Disabling this metamorphization for now seems good to silence failures in case of this now-known issue.

Informs: #113816

Epic: None

Release note: None

114026: kvnemesis: bump default steps to 100 r=erikgrinaker a=erikgrinaker

50 steps is usually too small to trigger interesting behaviors. Bump it to 100, which is still small enough to be easily debuggable.

The nightlies already run with 1000 steps.

Epic: none
Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@yuzefovich
Copy link
Member

I think I see what happened here. In short, it's unlucky set of circumstances.

The OOM requires metamorphic direct-scans-enabled to be enabled. This cluster setting controls whether the "direct columnar scans" (aka KV projection pushdown) is enabled. With the bank workload with payload of 10KiB we hit the worst behavior of the direct columnar scans possible. Namely, we currently do not provide any hint to the CFetcherWrapper about the number of rows we're expected to scan (#94850); thus, it falls back to exponential growth of row count for the coldata.Batch starting at 1, and we allocate bathches of sizes 1, 2, ..., 512. However, because we set TargetBytes limit to 10MiB and we have roughly 10KiB per row, we expect to scan about 1k rows total, in a single BatchResponse. So we happen to allocate a fresh coldata.Batch every time and we never happen to reuse it. This is what the right side in the image above is about.

I did take a look at the memory accounting story, and I think we are covered in all cases (as commit message of b19a4a8 describes). In short,

  • when we hit local fast path, that the footprint of not-serialized batches is included into BatchResponse.Size() and tracked by the txnKVFetcher
  • when we serialize the batch, then on the KV server side we're tracking the memory footprint of a single batch (that we are reusing, modulo the batch growth as described above) against kv-mem monitor. On the SQL client side, it'll be again included into BatchResponse.Size().

Now, the left side of the heap profile is due to us casting payload column from STRING to BYTES. Because we're allocating a fresh coldata.Batch as described in the first paragraph, we also need to allocate a fresh vector for the BYTES type too. This vector is accounted for.

I think the reason for hitting the OOM here was that the roachtest uses --max-sql-memory=50% which is way too aggressive, so in some cases GOMEMLIMIT might not save us. I'll reduce that.

Another thing is that I don't think we actually need string::bytes cast in EXPERIMENTAL_FINGERPRINTS, so we should be able to reduce the footprint in the worst case by half. I'll still disable the direct columnar scans explicitly for this roachtest.

craig bot pushed a commit that referenced this issue Nov 13, 2023
114279: backupccl: add TestOnlineRestoreS3 and TestOnlineRestoreBasic tests r=dt a=msbutler

This patch adds TestOnlineRestoreS3 to our cloud unit test suite. To run locally, run:
```
./dev test pkg/ccl/backupccl -f TestOnlineRestoreS3 -- --test_env=AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID --test_env=AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY --test_env=AWS_S3_BUCKET=cockroachdb-backup-cloud-nightly --test_env=COCKROACH_UNSAFE_RESTORE=true
```

This patch also adds TestOnlineRestoreBasic which runs online restore through
nodelocal. To run locally, run:
```
./dev test pkg/ccl/backupccl -f TestOnlineRestoreBasic -- --test_env=COCKROACH_UNSAFE_RESTORE=true
```

Both tests will be skipped in the nightlies until we pass
COCKROACH_UNSAFE_RESTORE to their setup.

Epic: none

Release note: none

114287: roachtest: remove duplicated code in backup for fingerprint r=yuzefovich a=yuzefovich

Epic: None

Release note: None

114289: colfetcher,roachtest: re-enable direct scans metamorphization r=yuzefovich a=yuzefovich

**sql: optimize EXPERIMENTAL_FINGERPRINTS**

This commit makes the internal query powering EXPERIMENTAL_FINGERPRINTS
more efficient. In particular, previously we always performed a cast of
non-BYTES types to BYTES; however, `fnv64` supports variadic input types
of both BYTES and STRING, so this commit counts the number of BYTES and
non-BYTES columns and casts the less frequent type to the other one (in
most cases this should eliminate redundant casts from STRING to BYTES).

Epic: None

Release note: None

**colfetcher,roachtest: re-enable direct scans metamorphization**

This commit makes a couple of tweaks to the backup roachtests which
allows us to make the default value of `sql.distsql.direct_columnar_scans.enabled`
a metamorphic variable again. In particular, we need to disable this
setting for backups on `bank` data set due to pathological case as
described in #113816. This also allows us to reduce `--max-sql-memory`
since we don't expect memory usage to be as high without direct scans.

Fixes: #113816.

Release note: None

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@craig craig bot closed this as completed in 40b8b25 Nov 13, 2023
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Nov 13, 2023
yuzefovich added a commit that referenced this issue Nov 15, 2023
This commit makes a couple of tweaks to the backup roachtests which
allows us to make the default value of `sql.distsql.direct_columnar_scans.enabled`
a metamorphic variable again. In particular, we need to disable this
setting for backups on `bank` data set due to pathological case as
described in #113816. This also allows us to reduce `--max-sql-memory`
since we don't expect memory usage to be as high without direct scans.

Fixes: #113816.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team X-noreuse Prevent automatic commenting from CI test failures
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants