PARQUET-384: Add dictionary filtering. by rdblue · Pull Request #330 · apache/parquet-java

rdblue · 2016-02-25T01:18:53Z

This builds on #286 from @danielcweeks and cleans up some of the interfaces. It introduces DictionaryPageReadStore to expose dictionary pages to the filters and cleans up some internal calls by passing ParquetFileReader.

When committed, this closes #286.

danielcweeks · 2016-02-25T17:19:11Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

+      }
+
+      for(T entry : dictSet) {
+        if(value.compareTo(entry) > 0) {


Shouldn't this be >=?

No, I think this is correct. The logic changes to > because the order is reversed, not because the logic is negated.

If V is the bound and we find any value, x, in the dictionary such that x < V, then there may be a matching row. But we call V.compareTo(x) and the order is reversed. We could call x.compareTo(V), but it seems like calling the method on V is more likely to result in something the JVM can optimize.

The tests also validate that this behavior is correct by testing the boundary conditions with the smallest and largest values.

Yeah, you're right.

rdblue · 2016-02-25T20:29:39Z

I just added caching and a metadata check to the dictionary reader so that dictionary pages are only read once even if used in multiple predicates (which each call readDictionary).

danielcweeks · 2016-02-25T21:59:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+    // avoid re-reading bytes the dictionary reader is used after this call
+    if (nextDictionaryReader != null) {
+      nextDictionaryReader.setRowGroup(currentRowGroup);
+    }


advanceToNextBlock resets nextDictionaryReader. Does that mean the dictionary reader is not available after the row group read?

advanceToNextBlock resets the next reader because the next block becomes the current block. This happens inside skip or read. The expected use is to get the dictionary reader for the next block, determine whether to skip or read it, do that operation, and then get the reader for the next block. Skipping or reading should cause the file reader to prepare to return the dictionary reader for the next, which is what is happening here.

rdblue · 2016-02-26T17:53:38Z

#332 is a follow-up to this and adds EncodingStats so to avoid the brittle logic in hasNonDictionaryEncodedPages.

danielcweeks · 2016-03-09T18:24:26Z

+1 The ParquetFileReader has some general issues with how it should be used, but I think this works really well for the dictionary filter and cleaning up some of the block filtering.

This updates the read path to rely more on the ParquetFileReader class, rather than externally filtering file blocks and passing other metadata into the internal reader to instantiate a ParquetFileReader. This also includes: * Tests for binary, int32, int64, float, double * Tests for eq, notEq, lt, ltEq, gt, gtEq (with boundary tests) * Tests for non-dictionary columns and fallback columns

DictionaryPageReader will read the first page in a column for each call to readDictionary, which means that multiple predicates for a single column to read the dictionary for each predicate. This commit adds a cache for pages and adds a check to see if the column has any dictionary-encoded pages to avoid unnecessary reads.

@danielcweeks

This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.

This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from #330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes #332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.

@danielcweeks

This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.

This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.

@danielcweeks

This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API

This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.

@danielcweeks

This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API

This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.

@danielcweeks

This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API

This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.

danielcweeks reviewed Feb 25, 2016
View reviewed changes

rdblue force-pushed the PARQUET-384-add-dictionary-filtering branch from 4bf33cf to befe03e Compare February 25, 2016 20:28

danielcweeks reviewed Feb 25, 2016
View reviewed changes

rdblue mentioned this pull request Feb 26, 2016

PARQUET-548: Add EncodingStats. #332

Closed

danielcweeks mentioned this pull request Mar 9, 2016

Parquet dictionary filter #286

Closed

Daniel Weeks and others added 3 commits March 9, 2016 13:20

PARQUET-384: Add dictionary row group filter.

21ef4b6

rdblue force-pushed the PARQUET-384-add-dictionary-filtering branch from befe03e to ff89424 Compare March 9, 2016 21:20

asfgit closed this in 4b1ff8f Mar 9, 2016

asfimport mentioned this pull request Jun 23, 2024

Add Dictionary Based Filtering to Filter2 API #1896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-384: Add dictionary filtering.#330

PARQUET-384: Add dictionary filtering.#330
rdblue wants to merge 3 commits intoapache:masterfrom
rdblue:PARQUET-384-add-dictionary-filtering

rdblue commented Feb 25, 2016

Uh oh!

danielcweeks Feb 25, 2016

Uh oh!

rdblue Feb 25, 2016

Uh oh!

danielcweeks Feb 25, 2016

Uh oh!

rdblue commented Feb 25, 2016

Uh oh!

danielcweeks Feb 25, 2016

Uh oh!

rdblue Feb 26, 2016

Uh oh!

rdblue commented Feb 26, 2016

Uh oh!

danielcweeks commented Mar 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rdblue commented Feb 25, 2016

Uh oh!

danielcweeks Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

danielcweeks Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 25, 2016

Uh oh!

danielcweeks Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 26, 2016

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 26, 2016

Uh oh!

danielcweeks commented Mar 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants