-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: backup/KMS/GCS/n3cpu4 failed-- SHOW EXPERIMENTAL FINGERPRINTS causes an OOM crash #113816
Comments
According to
According to the latest heap profile in the artifacts, most memory allocation on node 2 looks to be in sql execution land: I'm handing this over to sql queries to determine if this OOM points to a real bug in how they account for memory usage. (cc @yuzefovich who has commented on a related slack thread) |
DR just had a another test failure with runtime assertions enabled where a SHOW EXPERIMENTAL FINGERPRINTS query ran out of the sql-memory budget (i.e. no OOM). |
I think there is something to look at. These OOMs seem to require a combination of very large |
113966: kvcoord: Reintroduce catchup scan semaphore for regular rangefeed r=miretskiy a=miretskiy Re-introduce catchup scan semaphore limit, removed by #110919, for regular rangefeed. This hard limit on the number of catchup scans is necessary to avoid OOMs when handling large scan rangefeeds (large fan-in factor) when executing many non-local ranges. Fixes #113489 Release note: None 114000: colfetcher: disable metamorphic randomization for direct scans r=yuzefovich a=yuzefovich This commit makes it so that we no longer - for now - use metamorphic randomization for the default value of `sql.distsql.direct_columnar_scans.enabled` cluster setting that controls whether the direct columnar scans (aka "KV projection pushdown") is enabled. It appears that we might be missing some memory accounting in the local fast path of this feature, and some backup-related roachtests run into OOMs with binaries with "enabled assertions". Disabling this metamorphization for now seems good to silence failures in case of this now-known issue. Informs: #113816 Epic: None Release note: None 114026: kvnemesis: bump default steps to 100 r=erikgrinaker a=erikgrinaker 50 steps is usually too small to trigger interesting behaviors. Bump it to 100, which is still small enough to be easily debuggable. The nightlies already run with 1000 steps. Epic: none Release note: None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
I think I see what happened here. In short, it's unlucky set of circumstances. The OOM requires metamorphic I did take a look at the memory accounting story, and I think we are covered in all cases (as commit message of b19a4a8 describes). In short,
Now, the left side of the heap profile is due to us casting I think the reason for hitting the OOM here was that the roachtest uses Another thing is that I don't think we actually need |
114279: backupccl: add TestOnlineRestoreS3 and TestOnlineRestoreBasic tests r=dt a=msbutler This patch adds TestOnlineRestoreS3 to our cloud unit test suite. To run locally, run: ``` ./dev test pkg/ccl/backupccl -f TestOnlineRestoreS3 -- --test_env=AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID --test_env=AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY --test_env=AWS_S3_BUCKET=cockroachdb-backup-cloud-nightly --test_env=COCKROACH_UNSAFE_RESTORE=true ``` This patch also adds TestOnlineRestoreBasic which runs online restore through nodelocal. To run locally, run: ``` ./dev test pkg/ccl/backupccl -f TestOnlineRestoreBasic -- --test_env=COCKROACH_UNSAFE_RESTORE=true ``` Both tests will be skipped in the nightlies until we pass COCKROACH_UNSAFE_RESTORE to their setup. Epic: none Release note: none 114287: roachtest: remove duplicated code in backup for fingerprint r=yuzefovich a=yuzefovich Epic: None Release note: None 114289: colfetcher,roachtest: re-enable direct scans metamorphization r=yuzefovich a=yuzefovich **sql: optimize EXPERIMENTAL_FINGERPRINTS** This commit makes the internal query powering EXPERIMENTAL_FINGERPRINTS more efficient. In particular, previously we always performed a cast of non-BYTES types to BYTES; however, `fnv64` supports variadic input types of both BYTES and STRING, so this commit counts the number of BYTES and non-BYTES columns and casts the less frequent type to the other one (in most cases this should eliminate redundant casts from STRING to BYTES). Epic: None Release note: None **colfetcher,roachtest: re-enable direct scans metamorphization** This commit makes a couple of tweaks to the backup roachtests which allows us to make the default value of `sql.distsql.direct_columnar_scans.enabled` a metamorphic variable again. In particular, we need to disable this setting for backups on `bank` data set due to pathological case as described in #113816. This also allows us to reduce `--max-sql-memory` since we don't expect memory usage to be as high without direct scans. Fixes: #113816. Release note: None Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
This commit makes a couple of tweaks to the backup roachtests which allows us to make the default value of `sql.distsql.direct_columnar_scans.enabled` a metamorphic variable again. In particular, we need to disable this setting for backups on `bank` data set due to pathological case as described in #113816. This also allows us to reduce `--max-sql-memory` since we don't expect memory usage to be as high without direct scans. Fixes: #113816. Release note: None
roachtest.backup/KMS/GCS/n3cpu4 failed with artifacts on master @ 4d045594e8c65b56c82fcf2a1f14ee30cecfef3d:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=true
,ROACHTEST_metamorphicBuild=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-33175
The text was updated successfully, but these errors were encountered: