PARQUET-384: Add dictionary filtering.#330
Conversation
| } | ||
|
|
||
| for(T entry : dictSet) { | ||
| if(value.compareTo(entry) > 0) { |
There was a problem hiding this comment.
No, I think this is correct. The logic changes to > because the order is reversed, not because the logic is negated.
If V is the bound and we find any value, x, in the dictionary such that x < V, then there may be a matching row. But we call V.compareTo(x) and the order is reversed. We could call x.compareTo(V), but it seems like calling the method on V is more likely to result in something the JVM can optimize.
The tests also validate that this behavior is correct by testing the boundary conditions with the smallest and largest values.
4bf33cf to
befe03e
Compare
|
I just added caching and a metadata check to the dictionary reader so that dictionary pages are only read once even if used in multiple predicates (which each call |
| // avoid re-reading bytes the dictionary reader is used after this call | ||
| if (nextDictionaryReader != null) { | ||
| nextDictionaryReader.setRowGroup(currentRowGroup); | ||
| } |
There was a problem hiding this comment.
advanceToNextBlock resets nextDictionaryReader. Does that mean the dictionary reader is not available after the row group read?
There was a problem hiding this comment.
advanceToNextBlock resets the next reader because the next block becomes the current block. This happens inside skip or read. The expected use is to get the dictionary reader for the next block, determine whether to skip or read it, do that operation, and then get the reader for the next block. Skipping or reading should cause the file reader to prepare to return the dictionary reader for the next, which is what is happening here.
|
#332 is a follow-up to this and adds EncodingStats so to avoid the brittle logic in |
|
+1 The ParquetFileReader has some general issues with how it should be used, but I think this works really well for the dictionary filter and cleaning up some of the block filtering. |
This updates the read path to rely more on the ParquetFileReader class, rather than externally filtering file blocks and passing other metadata into the internal reader to instantiate a ParquetFileReader. This also includes: * Tests for binary, int32, int64, float, double * Tests for eq, notEq, lt, ltEq, gt, gtEq (with boundary tests) * Tests for non-dictionary columns and fallback columns
DictionaryPageReader will read the first page in a column for each call to readDictionary, which means that multiple predicates for a single column to read the dictionary for each predicate. This commit adds a cache for pages and adds a check to see if the column has any dictionary-encoded pages to avoid unnecessary reads.
befe03e to
ff89424
Compare
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from #330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes #332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter.
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats.
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.
This builds on apache#286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes apache#286. Author: Ryan Blue <blue@apache.org> Author: Daniel Weeks <dweeks@netflix.com> Closes apache#330 from rdblue/PARQUET-384-add-dictionary-filtering and squashes the following commits: ff89424 [Ryan Blue] PARQUET-384: Add a cache to DictionaryPageReader. 1f6861c [Ryan Blue] PARQUET-384: Use ParquetFileReader to initialize readers. 21ef4b6 [Daniel Weeks] PARQUET-384: Add dictionary row group filter. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Resolution: Removed unnecessary allocator args and fields Minor changes for using the codec API
This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPages`. `EncodingStats` have a unit test in parquet-column and an integration test in parquet-hadoop that writes a file and verifies the stats are present and correct when it is read. This includes commits from apache#330 because it updates the dictionary filter. I'll rebase and remove them once it is merged. Author: Ryan Blue <blue@apache.org> Closes apache#332 from rdblue/PARQUET-548-add-encoding-stats and squashes the following commits: 5f148e6 [Ryan Blue] PARQUET-548: Fixes for review comments. dc332d3 [Ryan Blue] PARQUET-548: Add EncodingStats. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Resolution: Minor formatting changes conflicted with wrapping encodings in a HashSet.
This builds on #286 from @danielcweeks and cleans up some of the interfaces. It introduces
DictionaryPageReadStoreto expose dictionary pages to the filters and cleans up some internal calls by passingParquetFileReader.When committed, this closes #286.