-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-99: Add page size check properties #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I like this. ParquetProperties didn't really come into use until later which I think is why v1 is so convoluted. This is a good refactor and will help to expose more configurability as we've been hardcoding many of these properties. |
|
Thanks, Dan! Sorry to take so long getting back to this. I'll rebase and clean it up next week and we can get both PARQUET-99 (#250) and this one in for the 1.9.0 release. |
00478f9 to
e072bb9
Compare
|
Okay, I've cleaned up my changes on top of #250 and tests are now passing (at least locally). I think the next step is to either cherry-pick e072bb9 into #250 or just update and merge this one. Doesn't matter to me which one. @danielcweeks, could you review my commit on top of yours? @julienledem, you may want to have a look at the API changes as well. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package protected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
@rdblue This is a much needed cleanup and refactoring. Thanks for doing this! I made some comment above. Otherwise this looks good to me. |
e072bb9 to
c93b73e
Compare
|
Updated for Julien's comments. Thanks for reviewing! |
|
+1 |
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks. * `parquet.page.size.row.check.min` - minimum number of records between page size checks * `parquet.page.size.row.check.max` - maximum number of records between page size checks * `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records. This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules. This closes apache#250 Author: Daniel Weeks <dweeks@netflix.com> Author: Ryan Blue <blue@apache.org> Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits: c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config. 18f8d3a [Daniel Weeks] Spacing 2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off 71336ee [Daniel Weeks] Fixed param name 5d99072 [Daniel Weeks] Update page size checking for v2 writer 3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts 68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck a057f46 [Daniel Weeks] Fixed inverted property logic e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks. * `parquet.page.size.row.check.min` - minimum number of records between page size checks * `parquet.page.size.row.check.max` - maximum number of records between page size checks * `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records. This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules. This closes apache#250 Author: Daniel Weeks <dweeks@netflix.com> Author: Ryan Blue <blue@apache.org> Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits: c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config. 18f8d3a [Daniel Weeks] Spacing 2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off 71336ee [Daniel Weeks] Fixed param name 5d99072 [Daniel Weeks] Update page size checking for v2 writer 3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts 68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck a057f46 [Daniel Weeks] Fixed inverted property logic e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV2.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV2.java parquet-column/src/test/java/org/apache/parquet/column/impl/TestCorruptDeltaByteArrays.java parquet-column/src/test/java/org/apache/parquet/column/mem/TestMemColumn.java parquet-column/src/test/java/org/apache/parquet/io/PerfTest.java parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java parquet-column/src/test/java/org/apache/parquet/io/TestFiltered.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java parquet-pig/src/test/java/org/apache/parquet/pig/TupleConsumerPerfTest.java parquet-thrift/src/test/java/org/apache/parquet/thrift/TestParquetReadProtocol.java Resolution: Fixed changes that depended on the addition of an allocator argument Ignored adjacent changes that were flagged Passed page size at compressor instantiation instead of to the factory
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks. * `parquet.page.size.row.check.min` - minimum number of records between page size checks * `parquet.page.size.row.check.max` - maximum number of records between page size checks * `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records. This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules. This closes apache#250 Author: Daniel Weeks <dweeks@netflix.com> Author: Ryan Blue <blue@apache.org> Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits: c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config. 18f8d3a [Daniel Weeks] Spacing 2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off 71336ee [Daniel Weeks] Fixed param name 5d99072 [Daniel Weeks] Update page size checking for v2 writer 3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts 68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck a057f46 [Daniel Weeks] Fixed inverted property logic e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV2.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV2.java parquet-column/src/test/java/org/apache/parquet/column/impl/TestCorruptDeltaByteArrays.java parquet-column/src/test/java/org/apache/parquet/column/mem/TestMemColumn.java parquet-column/src/test/java/org/apache/parquet/io/PerfTest.java parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java parquet-column/src/test/java/org/apache/parquet/io/TestFiltered.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java parquet-pig/src/test/java/org/apache/parquet/pig/TupleConsumerPerfTest.java parquet-thrift/src/test/java/org/apache/parquet/thrift/TestParquetReadProtocol.java Resolution: Fixed changes that depended on the addition of an allocator argument Ignored adjacent changes that were flagged Passed page size at compressor instantiation instead of to the factory
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks.
parquet.page.size.row.check.min- minimum number of records between page size checksparquet.page.size.row.check.max- maximum number of records between page size checksparquet.page.size.check.estimate- whether to estimate the number of records before the next check, or to always use the minimum number of records.This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules.
This closes #250