-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export: integrate memory monitoring #104329
Labels
A-cdc
Change Data Capture
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-cdc
Comments
jayshrivastava
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-cdc
Change Data Capture
T-cdc
labels
Jun 5, 2023
cc @cockroachdb/cdc |
jayshrivastava
added a commit
to jayshrivastava/cockroach
that referenced
this issue
Jun 5, 2023
Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104329 Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)
craig bot
pushed a commit
that referenced
this issue
Jun 5, 2023
104234: importer: use new parquet writer for EXPORT INTO PARQUET r=miretskiy a=jayshrivastava ### roachtest: add test for EXPORT INTO PARQUET This change adds a rochtest which sets up a 3-node CRDB cluster on 8vCPU machines initialized with the TPC-C database containing 250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements concurrently, repeatedly for 10 minutes. The main purpose of this test is to use it for benchmarking. It sets up grafana as well so metrics can be observed during the test. This change also adds a daily roachtest which runs exports on a 100 warehouse TPCC database, without setting up grafana. Informs: #103317 Release note: None --- ### importer: use new parquet writer for EXPORT INTO PARQUET Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: #104329 Informs: #104278 Informs: #104279 Closes: #103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`) Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
jayshrivastava
changed the title
util/parquet: integrate memory monitoring
export: integrate memory monitoring
Jun 7, 2023
Closing this as fixed by #104234. Memory used by datums should be accounted for by sql row processing, but allocs used by an external lib cannot really be tracked. The above PR uses a best effort approach. Also, EXPORT will be deprecated or limited to small sizes, so we don't need to make this perfect #105297. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-cdc
Change Data Capture
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-cdc
Right now, we read bytes from DistSQL and buffer bytes in the parquet library before writing them. We don't report the fact that we are buffering these bytes. EXPORT INTO PARQUET should report that these allocs are being held to memory monitors.
Jira issue: CRDB-28470
The text was updated successfully, but these errors were encountered: