Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export: integrate memory monitoring #104329

Closed
jayshrivastava opened this issue Jun 5, 2023 · 2 comments
Closed

export: integrate memory monitoring #104329

jayshrivastava opened this issue Jun 5, 2023 · 2 comments
Assignees
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc

Comments

@jayshrivastava
Copy link
Contributor

jayshrivastava commented Jun 5, 2023

Right now, we read bytes from DistSQL and buffer bytes in the parquet library before writing them. We don't report the fact that we are buffering these bytes. EXPORT INTO PARQUET should report that these allocs are being held to memory monitors.

Jira issue: CRDB-28470

@jayshrivastava jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels Jun 5, 2023
@jayshrivastava jayshrivastava self-assigned this Jun 5, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 5, 2023

cc @cockroachdb/cdc

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 5, 2023
Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: cockroachdb#104329
Informs: cockroachdb#104278
Informs: cockroachdb#104279
Closes: cockroachdb#103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)
craig bot pushed a commit that referenced this issue Jun 5, 2023
104234: importer: use new parquet writer for EXPORT INTO PARQUET r=miretskiy a=jayshrivastava

### roachtest: add test for EXPORT INTO PARQUET
This change adds a rochtest which sets up a 3-node CRDB cluster
on 8vCPU machines initialized with the TPC-C database containing
250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements
concurrently, repeatedly for 10 minutes. The main purpose of this
test is to use it for benchmarking. It sets up grafana as well
so metrics can be observed during the test.

This change also adds a daily roachtest which runs exports
on a 100 warehouse TPCC database, without setting up grafana.

Informs: #103317
Release note: None

---

### importer: use new parquet writer for EXPORT INTO PARQUET

Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: #104329
Informs: #104278
Informs: #104279
Closes: #103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)

Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
@jayshrivastava jayshrivastava changed the title util/parquet: integrate memory monitoring export: integrate memory monitoring Jun 7, 2023
@jayshrivastava
Copy link
Contributor Author

Closing this as fixed by #104234. Memory used by datums should be accounted for by sql row processing, but allocs used by an external lib cannot really be tracked. The above PR uses a best effort approach. Also, EXPORT will be deprecated or limited to small sizes, so we don't need to make this perfect #105297.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc
Projects
None yet
Development

No branches or pull requests

1 participant