Skip to content

PARQUET-384: Add dictionary filtering.#330

Closed
rdblue wants to merge 3 commits intoapache:masterfrom
rdblue:PARQUET-384-add-dictionary-filtering
Closed

PARQUET-384: Add dictionary filtering.#330
rdblue wants to merge 3 commits intoapache:masterfrom
rdblue:PARQUET-384-add-dictionary-filtering

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 25, 2016

This builds on #286 from @danielcweeks and cleans up some of the interfaces. It introduces DictionaryPageReadStore to expose dictionary pages to the filters and cleans up some internal calls by passing ParquetFileReader.

When committed, this closes #286.

}

for(T entry : dictSet) {
if(value.compareTo(entry) > 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be >=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is correct. The logic changes to > because the order is reversed, not because the logic is negated.

If V is the bound and we find any value, x, in the dictionary such that x < V, then there may be a matching row. But we call V.compareTo(x) and the order is reversed. We could call x.compareTo(V), but it seems like calling the method on V is more likely to result in something the JVM can optimize.

The tests also validate that this behavior is correct by testing the boundary conditions with the smallest and largest values.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right.

@rdblue rdblue force-pushed the PARQUET-384-add-dictionary-filtering branch from 4bf33cf to befe03e Compare February 25, 2016 20:28
@rdblue
Copy link
Contributor Author

rdblue commented Feb 25, 2016

I just added caching and a metadata check to the dictionary reader so that dictionary pages are only read once even if used in multiple predicates (which each call readDictionary).

// avoid re-reading bytes the dictionary reader is used after this call
if (nextDictionaryReader != null) {
nextDictionaryReader.setRowGroup(currentRowGroup);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advanceToNextBlock resets nextDictionaryReader. Does that mean the dictionary reader is not available after the row group read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advanceToNextBlock resets the next reader because the next block becomes the current block. This happens inside skip or read. The expected use is to get the dictionary reader for the next block, determine whether to skip or read it, do that operation, and then get the reader for the next block. Skipping or reading should cause the file reader to prepare to return the dictionary reader for the next, which is what is happening here.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 26, 2016

#332 is a follow-up to this and adds EncodingStats so to avoid the brittle logic in hasNonDictionaryEncodedPages.

@danielcweeks
Copy link

+1 The ParquetFileReader has some general issues with how it should be used, but I think this works really well for the dictionary filter and cleaning up some of the block filtering.

Daniel Weeks and others added 3 commits March 9, 2016 13:20
This updates the read path to rely more on the ParquetFileReader class,
rather than externally filtering file blocks and passing other metadata
into the internal reader to instantiate a ParquetFileReader.

This also includes:
* Tests for binary, int32, int64, float, double
* Tests for eq, notEq, lt, ltEq, gt, gtEq (with boundary tests)
* Tests for non-dictionary columns and fallback columns
DictionaryPageReader will read the first page in a column for each call
to readDictionary, which means that multiple predicates for a single
column to read the dictionary for each predicate. This commit adds a
cache for pages and adds a check to see if the column has any
dictionary-encoded pages to avoid unnecessary reads.
@rdblue rdblue force-pushed the PARQUET-384-add-dictionary-filtering branch from befe03e to ff89424 Compare March 9, 2016 21:20
@asfgit asfgit closed this in 4b1ff8f Mar 9, 2016
coughman pushed a commit to coughman/incubator-parquet-mr that referenced this pull request Apr 18, 2016
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`.

When committed, this closes apache#286.

Author: Ryan Blue <blue@apache.org>
Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits:

ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader.
1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers.
21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.
asfgit pushed a commit that referenced this pull request Apr 23, 2016
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`.

`EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read.

This includes commits from #330 because it updates the dictionary filter. I'll rebase and remove them once it is merged.

Author: Ryan Blue <blue@apache.org>

Closes #332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits:

5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments.
dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`.

When committed, this closes apache#286.

Author: Ryan Blue <blue@apache.org>
Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits:

ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader.
1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers.
21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`.

`EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read.

This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged.

Author: Ryan Blue <blue@apache.org>

Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits:

5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments.
dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`.

When committed, this closes apache#286.

Author: Ryan Blue <blue@apache.org>
Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits:

ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader.
1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers.
21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Resolution:
    Removed unnecessary allocator args and fields
    Minor changes for using the codec API
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`.

`EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read.

This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged.

Author: Ryan Blue <blue@apache.org>

Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits:

5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments.
dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Resolution:
    Minor formatting changes conflicted with wrapping encodings in a HashSet.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`.

When committed, this closes apache#286.

Author: Ryan Blue <blue@apache.org>
Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits:

ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader.
1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers.
21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Resolution:
    Removed unnecessary allocator args and fields
    Minor changes for using the codec API
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`.

`EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read.

This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged.

Author: Ryan Blue <blue@apache.org>

Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits:

5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments.
dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Resolution:
    Minor formatting changes conflicted with wrapping encodings in a HashSet.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`.

When committed, this closes apache#286.

Author: Ryan Blue <blue@apache.org>
Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits:

ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader.
1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers.
21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Resolution:
    Removed unnecessary allocator args and fields
    Minor changes for using the codec API
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`.

`EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read.

This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged.

Author: Ryan Blue <blue@apache.org>

Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits:

5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments.
dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Resolution:
    Minor formatting changes conflicted with wrapping encodings in a HashSet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants