-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/parquet: support required repetitions #104279
Comments
cc @cockroachdb/cdc |
Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)
Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)
I'm not sure how important it is to support this feature. On the one hand: yes, if the database says "not null", it'd be nice |
Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104329 Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)
104234: importer: use new parquet writer for EXPORT INTO PARQUET r=miretskiy a=jayshrivastava ### roachtest: add test for EXPORT INTO PARQUET This change adds a rochtest which sets up a 3-node CRDB cluster on 8vCPU machines initialized with the TPC-C database containing 250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements concurrently, repeatedly for 10 minutes. The main purpose of this test is to use it for benchmarking. It sets up grafana as well so metrics can be observed during the test. This change also adds a daily roachtest which runs exports on a 100 warehouse TPCC database, without setting up grafana. Informs: #103317 Release note: None --- ### importer: use new parquet writer for EXPORT INTO PARQUET Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: #104329 Informs: #104278 Informs: #104279 Closes: #103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`) Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Closing this. What we do with avro is to treat every field as optional, this allows flexibility over time |
By default the library treats all columns as nullable, even when they are not. It would be ideal to treat columns as not nullable if specified.
Jira issue: CRDB-28443
Epic CRDB-27372
The text was updated successfully, but these errors were encountered: