Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 2, 2015

This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks.

  • parquet.page.size.row.check.min - minimum number of records between page size checks
  • parquet.page.size.row.check.max - maximum number of records between page size checks
  • parquet.page.size.check.estimate - whether to estimate the number of records before the next check, or to always use the minimum number of records.

This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules.

This closes #250

@danielcweeks
Copy link

I like this. ParquetProperties didn't really come into use until later which I think is why v1 is so convoluted.

This is a good refactor and will help to expose more configurability as we've been hardcoding many of these properties.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 5, 2015

Thanks, Dan! Sorry to take so long getting back to this. I'll rebase and clean it up next week and we can get both PARQUET-99 (#250) and this one in for the 1.9.0 release.

@rdblue rdblue force-pushed the parquet-properties-update branch from 00478f9 to e072bb9 Compare December 7, 2015 22:58
@rdblue
Copy link
Contributor Author

rdblue commented Dec 7, 2015

Okay, I've cleaned up my changes on top of #250 and tests are now passing (at least locally).

I think the next step is to either cherry-pick e072bb9 into #250 or just update and merge this one. Doesn't matter to me which one. @danielcweeks, could you review my commit on top of yours?

@julienledem, you may want to have a look at the API changes as well.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@julienledem
Copy link
Member

@rdblue This is a much needed cleanup and refactoring. Thanks for doing this! I made some comment above. Otherwise this looks good to me.

@rdblue rdblue force-pushed the parquet-properties-update branch from e072bb9 to c93b73e Compare December 8, 2015 17:59
@rdblue
Copy link
Contributor Author

rdblue commented Dec 8, 2015

Updated for Julien's comments. Thanks for reviewing!

@julienledem
Copy link
Member

+1

@rdblue rdblue changed the title WIP: Clean up internal APIs using ParquetProperties PARQUET-99: Add page size check properties Dec 8, 2015
@asfgit asfgit closed this in 5632640 Dec 8, 2015
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks.

* `parquet.page.size.row.check.min` - minimum number of records between page size checks
* `parquet.page.size.row.check.max` - maximum number of records between page size checks
* `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records.

This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules.

This closes apache#250

Author: Daniel Weeks <dweeks@netflix.com>
Author: Ryan Blue <blue@apache.org>

Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits:

c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config.
18f8d3a [Daniel Weeks] Spacing
2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off
71336ee [Daniel Weeks] Fixed param name
5d99072 [Daniel Weeks] Update page size checking for v2 writer
3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts
68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check
b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck
a057f46 [Daniel Weeks] Fixed inverted property logic
e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks.

* `parquet.page.size.row.check.min` - minimum number of records between page size checks
* `parquet.page.size.row.check.max` - maximum number of records between page size checks
* `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records.

This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules.

This closes apache#250

Author: Daniel Weeks <dweeks@netflix.com>
Author: Ryan Blue <blue@apache.org>

Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits:

c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config.
18f8d3a [Daniel Weeks] Spacing
2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off
71336ee [Daniel Weeks] Fixed param name
5d99072 [Daniel Weeks] Update page size checking for v2 writer
3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts
68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check
b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck
a057f46 [Daniel Weeks] Fixed inverted property logic
e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking

Conflicts:
	parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV2.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV2.java
	parquet-column/src/test/java/org/apache/parquet/column/impl/TestCorruptDeltaByteArrays.java
	parquet-column/src/test/java/org/apache/parquet/column/mem/TestMemColumn.java
	parquet-column/src/test/java/org/apache/parquet/io/PerfTest.java
	parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java
	parquet-column/src/test/java/org/apache/parquet/io/TestFiltered.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
	parquet-pig/src/test/java/org/apache/parquet/pig/TupleConsumerPerfTest.java
	parquet-thrift/src/test/java/org/apache/parquet/thrift/TestParquetReadProtocol.java
Resolution:
    Fixed changes that depended on the addition of an allocator argument
    Ignored adjacent changes that were flagged
    Passed page size at compressor instantiation instead of to the factory
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This adds properties to set the min and max number of records that are passed between page checks, as well as a property that controls whether the next check will be based on records already seen or set to the minimum number of records between checks.

* `parquet.page.size.row.check.min` - minimum number of records between page size checks
* `parquet.page.size.row.check.max` - maximum number of records between page size checks
* `parquet.page.size.check.estimate` - whether to estimate the number of records before the next check, or to always use the minimum number of records.

This also updates the internal API to use ParquetProperties to carry encoding settings (used in parquet-column) to reduce the number of parameters passed through internal APIs. It also adds a builder for ParquetProperties to avoid needing to reference defaults in other modules.

This closes apache#250

Author: Daniel Weeks <dweeks@netflix.com>
Author: Ryan Blue <blue@apache.org>

Closes apache#297 from rdblue/parquet-properties-update and squashes the following commits:

c93b73e [Ryan Blue] PARQUET-99: Use ParquetProperties to carry encoding config.
18f8d3a [Daniel Weeks] Spacing
2090719 [Daniel Weeks] Update sizeCheck to write page properly if estimating is turned off
71336ee [Daniel Weeks] Fixed param name
5d99072 [Daniel Weeks] Update page size checking for v2 writer
3f7870c [Daniel Weeks] Rebase to resolve byte buffer conflicts
68794f0 [Daniel Weeks] Merge branch 'master' into page_size_check
b49f03c [Daniel Weeks] Fixed reset of nextSizeCheck
a057f46 [Daniel Weeks] Fixed inverted property logic
e7cd54b [Daniel Weeks] Added property to toggle page size check estimation and initial row size checking

Conflicts:
	parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV2.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java
	parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV2.java
	parquet-column/src/test/java/org/apache/parquet/column/impl/TestCorruptDeltaByteArrays.java
	parquet-column/src/test/java/org/apache/parquet/column/mem/TestMemColumn.java
	parquet-column/src/test/java/org/apache/parquet/io/PerfTest.java
	parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java
	parquet-column/src/test/java/org/apache/parquet/io/TestFiltered.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
	parquet-pig/src/test/java/org/apache/parquet/pig/TupleConsumerPerfTest.java
	parquet-thrift/src/test/java/org/apache/parquet/thrift/TestParquetReadProtocol.java
Resolution:
    Fixed changes that depended on the addition of an allocator argument
    Ignored adjacent changes that were flagged
    Passed page size at compressor instantiation instead of to the factory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants