-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Filtering records across multiple blocks #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
NULL tuples causes NPE when writing
|
@onlynone, I agree with you, however I think the fix is still functionally correct. That's what I meant about ensuring Having said that, here's another fix that correctly updates |
|
I think that Tom's fix is correct and that's a reasonable work-around for right now. But I'd rather get rid of the recursive call because that will increase the stack for each filtered record. Here's a version that just loops until the internal reader starts returning non-null records again. It also checks to make sure the total isn't going past the currently loaded limit so that there aren't conditions where it would loop infinitely. try {
checkRead();
currentValue = recordReader.read();
current ++;
// only happens with FilteredRecordReader at end of block
while (currentValue == null && current < total && current <= totalCountLoadedSoFar) {
checkRead();
currentValue = recordReader.read();
current ++;
}
if (DEBUG) LOG.debug("read value: " + currentValue);
} catch (RuntimeException e) {
throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e);
}Like you said, a real fix needs to correctly keep track of the records that are filtered out. How about adding a count accessor to parquet.io.RecordReader? That would be a quick fix, but I'd rather see a better contract with the record reader that strictly defines behavior when it runs out of records and maybe keeps track internally. Iterator is good inspiration. |
|
Thanks for the review @rdblue. I agree that the minimal fix is the way to go to get this fixed in the short term; for one thing changing (Filtered)RecordReader causes the semantic versioning plugin to complain. I've updated the minimal fix to avoid the recursive call as you suggested. See https://github.com/apache/incubator-parquet-mr/pull/9. It's slightly different to your code since we need to take account of the case where there are no further non-null records - i.e. the while loop needs to return false for that case. I've added a test for that case and also for the case where only the last block has a record that matches the filter. |
|
LGTM. |
|
Thanks for taking a look, Julien. I've opened PARQUET-9 for this. |
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks
|
Was this included in https://github.com/apache/incubator-parquet-mr/pull/9 ? |
|
@julienledem: yes. I think Tom had to create a new pull request because he couldn't push review changes to this one. |
|
This is fixed: |
|
Thanks Guys! |
|
Thank you @onlynone ! |
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes apache#9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks
merge apache/incubator-parquet-mr
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks
…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes #99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
...thod Author: Alex Levenson <alexlevenson@twitter.com> Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com> Author: kostya-sh <kostya-sh@users.noreply.github.com> Closes #171 from kostya-sh/PARQUET-246 and squashes the following commits: 75950c5 [kostya-sh] Merge pull request #1 from isnotinvain/PR-171 a718309 [Konstantin Shaposhnikov] Merge remote-tracking branch 'refs/remotes/origin/master' into PARQUET-246 0367588 [Alex Levenson] Add regression test for PR-171 94e8fda [Alex Levenson] Merge branch 'master' into PR-171 0a9ac9f [Konstantin Shaposhnikov] [PARQUET-246] bugfix: reset all DeltaByteArrayWriter state in reset() method
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file. There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion. Thanks Reuben Author: Reuben Kuhnert <sircodesalot@gmail.com> Author: Ryan Blue <blue@apache.org> Closes #255 from sircodesalotOfTheRound/stats-validation and squashes the following commits: 680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests 9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl. 7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file. There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion. Thanks Reuben Author: Reuben Kuhnert <sircodesalot@gmail.com> Author: Ryan Blue <blue@apache.org> Closes #255 from sircodesalotOfTheRound/stats-validation and squashes the following commits: 680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests 9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl. 7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
Index validator
Copied from: https://github.com/Parquet/parquet-mr/pull/413
However, as tomwhite mentioned, there might be a better way to do this.
I had also written this:
current ++;still doesn't seem correct even whencurrentValue != null. Imagine a block with 100 records, but only the record at position 50 matches our filter. In this case, the first timenextKeyValue()is called, it will callrecordReader.read()which will successfully find the record at pos 50, butcurrentwill just be incremented to 1.