-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ct read from file metadata is empty, and should return true for all nulls check for the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have an empty row group without any records in it? In this case, a binary column also has null statistics values, and isAllNulls/hasNulls both return true, which is at least semantically wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide this is a bug in the write path, lets put a TODO here with a reference to a jira.
Hopefully the only way the stats can be empty is when the column is all nulls, but I wonder if that is true?
What about files that were written pre-statistics?
@julienledem do you know if there can be an empty rowgroup? Hopefully not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix it in the write path and have column.getStatistics().getNumNulls() == column.getValueCount() in that case.
missing statistics should mean "the writer did not have that feature"
We should not have empty row groups. There was a bug in the past that produced empty row groups in some cases but this was fixed.
|
Would be great if a test case can be added :-) |
|
Thanks for fixing! So currently, in the case of writing a row group that is all nulls, we don't record the number of nulls? A test would be great as well. |
|
ping @egonina who might know about whether we should be writing stats for all null column chunks |
|
So actually, when the stats object is empty, I think we should just do no filtering -- we don't know if it's empty because of this corner case, or because the file simply has no stats. I think an aggressive early check for stats.isEmpty is all that we need here? |
|
I've filed: https://issues.apache.org/jira/browse/PARQUET-161 for the write path issue. I think the fix here on the read path is to check isEmpty and keep all chunks that have no statistics. |
|
For a column that is all nulls, we would count the number of nulls and put that in the stats object but the stats object is still marked as empty (markAsNotEmpty is not called down this path) (EDIT: and is not written). Looks like a bug, we should mark the stats as non-empty if the values are all nulls, since the column is non-empty. Then empty stats mean that we didn't collect any stats at all for the column. |
…es are nulls" This reverts commit c7c126f.
…ics object read from file metadata is empty, and should return true for all nulls check for the column" This reverts commit 974a22b.
…cs object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
|
Hi all, Thanks for the feedback. Will add test case soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this comment is repeated, and we are going to fix the bug in the write path, lets just make this comment something like:
// we have no statistics available, we cannot drop any chunks
|
I've sent you a PR for your 'npe' branch: I think that gets us to a place where we don't need #103 and we'll have fixed both PARQUET-136 and PARQUET-161 in a way that is compatible with old files that were both written a) before stats were collected and b) while this bug was in effect. Let me know what you think. Thanks! |
|
@isnotinvain , thanks for all the help. Have merged your changes (reverted the test case revert because of the reasons mentioned in the in line comment). Its much cleaner now |
… is recorded in case of all nulls in a column
|
Added a test case. |
|
@isnotinvain , any updates on this one ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this loop for if all we do is call getPath() and ignore the return value?
|
@saucam sorry for the slow reply thanks a lot! LGTM, minor comments but aside from that +1 Long term we probably need more comprehensive tests around statistics, but probably not in this PR :) |
|
Added changes for the comments. @isnotinvain , thanks ! |
|
+1, merge when the tests pass (to trigger the tests running again you might have to commit a whitespace change and then revert it, Travis can be flakey) cc @julienledem @tsdeng who can actually merge this |
This reverts commit 82209e6.
|
I should be able to merge this now, I'll try to do it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we drop it if it is all nulls in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot just drop it, because the user defined predicate can be anything, we don't know beforehand. Also , if we don't keep this check, it will proceed to get min/max for the predicate, which can be nulls in case all values are nulls, so its best to just keep the chunk in case of all nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have (and could) extend the UDP interface to receive not just the min/max, but also the null count. Currently the UDP interface only accepts min/max, and in the case of all nulls, there is no min/max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note that if we add num nulls there, then existing UDP's will have to check for null themselves or their behavior might change.
|
+1 besides one clarification question |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparing null to anything else than null should always drop it. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this filter is notEq filter , where we are looking for columnValue != (some non null value), so any row with null column should be returned and cannot be dropped.
Also I think the above check is redundant anyways, since we have a hasNulls() check above this one, so if there are any nulls, we will be returning 'false' and this check would never be reached.
maybe @isnotinvain can help ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets see...
case 1: v notEq(null) (line 121)
we are trying to keep all non-null records. If this is a chunk of only-nulls, it can be dropped, otherwise, it can't. This case returns either way (does not fall through).
case2 : v notEq(nonNull) (line 127)
we are trying to keep all records where v is not a particular (non-null) value. For example, v notEq(7). If this chunk contains nulls, there are some records to keep, because we decided that:
null notEq(7) is true, eg, we chose to make predicates work the way a programming language like java works (null != 7) instead of the way SQL or Pig works (nulls are viral, and dropped in all scenarios). It boils down to in programming languages we think of null as a value, and SQL think of it as "unknown."
This case falls through to the next unlike the case above.
@saucam is right, Line 134 to 137 (which I added to every operator) is unreachable in this case, we can delete it because to get here we've already confirmed that !stats.isEmpty() && !stats.hasNulls()
case3: v notEq(nonNullTarget) && there's a min + max
only case where we can drop the chunk is: min = max = nonNullTarget
eg, v notEq(7) and this column is a column of all 7s.
Thanks for catching the unreachable block!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, docs describing how nulls are handled: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/FilterApi.java#L90
|
Same question as Katya. Otherwise this looks good to me. |
|
removed the unreachable block. @julienledem how is it now ? |
|
LGTM |
|
@saucam our merge process is a little strict, can you rename this PR: "PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null" (note the all caps PARQUET) Our merge script requires this to deal with JIRA / how apache git vs github integration works. Thanks! |
|
done :) |
|
Done! Thanks! |
…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.
The other way is to fix this behaviour in the writer, but is that what we want ?