export: integrate memory monitoring #104329

jayshrivastava · 2023-06-05T14:08:28Z

Right now, we read bytes from DistSQL and buffer bytes in the parquet library before writing them. We don't report the fact that we are buffering these bytes. EXPORT INTO PARQUET should report that these allocs are being held to memory monitors.

Jira issue: CRDB-28470

blathers-crl · 2023-06-05T14:08:30Z

cc @cockroachdb/cdc

Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104329 Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)

104234: importer: use new parquet writer for EXPORT INTO PARQUET r=miretskiy a=jayshrivastava ### roachtest: add test for EXPORT INTO PARQUET This change adds a rochtest which sets up a 3-node CRDB cluster on 8vCPU machines initialized with the TPC-C database containing 250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements concurrently, repeatedly for 10 minutes. The main purpose of this test is to use it for benchmarking. It sets up grafana as well so metrics can be observed during the test. This change also adds a daily roachtest which runs exports on a 100 warehouse TPCC database, without setting up grafana. Informs: #103317 Release note: None --- ### importer: use new parquet writer for EXPORT INTO PARQUET Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: #104329 Informs: #104278 Informs: #104279 Closes: #103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`) Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>

jayshrivastava · 2023-07-11T15:58:18Z

Closing this as fixed by #104234. Memory used by datums should be accounted for by sql row processing, but allocs used by an external lib cannot really be tracked. The above PR uses a best effort approach. Also, EXPORT will be deprecated or limited to small sizes, so we don't need to make this perfect #105297.

jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels Jun 5, 2023

jayshrivastava self-assigned this Jun 5, 2023

jayshrivastava mentioned this issue Jun 5, 2023

importer: use new parquet writer for EXPORT INTO PARQUET #104234

Merged

jayshrivastava changed the title ~~util/parquet: integrate memory monitoring~~ export: integrate memory monitoring Jun 7, 2023

jayshrivastava mentioned this issue Jun 13, 2023

export: parquet causes node to OOM #103317

Closed

jayshrivastava closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export: integrate memory monitoring #104329

export: integrate memory monitoring #104329

jayshrivastava commented Jun 5, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jun 5, 2023

jayshrivastava commented Jul 11, 2023

export: integrate memory monitoring #104329

export: integrate memory monitoring #104329

Comments

jayshrivastava commented Jun 5, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Jun 5, 2023

jayshrivastava commented Jul 11, 2023

jayshrivastava commented Jun 5, 2023 •

edited by exalate-issue-sync bot

Loading