PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null #99

saucam · 2015-01-07T13:35:48Z

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

…ct read from file metadata is empty, and should return true for all nulls check for the column

…ulls

liancheng · 2015-01-08T06:50:48Z

parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java

Is it possible to have an empty row group without any records in it? In this case, a binary column also has null statistics values, and isAllNulls/hasNulls both return true, which is at least semantically wrong.

If we decide this is a bug in the write path, lets put a TODO here with a reference to a jira.
Hopefully the only way the stats can be empty is when the column is all nulls, but I wonder if that is true?
What about files that were written pre-statistics?

@julienledem do you know if there can be an empty rowgroup? Hopefully not...

I think we should fix it in the write path and have column.getStatistics().getNumNulls() == column.getValueCount() in that case.
missing statistics should mean "the writer did not have that feature"
We should not have empty row groups. There was a bug in the past that produced empty row groups in some cases but this was fixed.

liancheng · 2015-01-08T06:51:41Z

Would be great if a test case can be added :-)

isnotinvain · 2015-01-08T19:37:17Z

Thanks for fixing!

So currently, in the case of writing a row group that is all nulls, we don't record the number of nulls?
That does sound like a bug in the write path to me. @julienledem what do you think?

A test would be great as well.

isnotinvain · 2015-01-08T19:41:50Z

ping @egonina who might know about whether we should be writing stats for all null column chunks

isnotinvain · 2015-01-08T19:43:53Z

So actually, when the stats object is empty, I think we should just do no filtering -- we don't know if it's empty because of this corner case, or because the file simply has no stats.

I think an aggressive early check for stats.isEmpty is all that we need here?

isnotinvain · 2015-01-08T19:46:46Z

I've filed: https://issues.apache.org/jira/browse/PARQUET-161 for the write path issue.

I think the fix here on the read path is to check isEmpty and keep all chunks that have no statistics.

egonina · 2015-01-08T21:23:04Z

For a column that is all nulls, we would count the number of nulls and put that in the stats object but the stats object is still marked as empty (markAsNotEmpty is not called down this path) (EDIT: and is not written). Looks like a bug, we should mark the stats as non-empty if the values are all nulls, since the column is non-empty. Then empty stats mean that we didn't collect any stats at all for the column.

…es are nulls" This reverts commit c7c126f.

…ics object read from file metadata is empty, and should return true for all nulls check for the column" This reverts commit 974a22b.

…cs object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls

saucam · 2015-01-09T13:14:33Z

Hi all,

Thanks for the feedback.
@isnotinvain yes the statistics object is not written in case there are only nulls. I agree it is better to not filter in case of statistics object being empty.
@egonina thanks for pointing out the omits.
Made the relevant changes.

Will add test case soon.

isnotinvain · 2015-01-09T22:15:19Z

parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java

since this comment is repeated, and we are going to fix the bug in the write path, lets just make this comment something like:
// we have no statistics available, we cannot drop any chunks

isnotinvain · 2015-01-09T23:08:22Z

I've sent you a PR for your 'npe' branch:
https://github.com/saucam/incubator-parquet-mr/pull/1/files

I think that gets us to a place where we don't need #103 and we'll have fixed both PARQUET-136 and PARQUET-161 in a way that is compatible with old files that were both written a) before stats were collected and b) while this bug was in effect.

Let me know what you think. Thanks!

Some updates for PARQUET 161 + 136

This reverts commit 0c88be0.

… only nulls in the column

saucam · 2015-01-11T07:39:41Z

@isnotinvain , thanks for all the help. Have merged your changes (reverted the test case revert because of the reasons mentioned in the in line comment). Its much cleaner now
how does it look ?

… is recorded in case of all nulls in a column

saucam · 2015-01-11T15:24:51Z

Added a test case.

saucam · 2015-01-16T01:56:55Z

@isnotinvain , any updates on this one ?

isnotinvain · 2015-01-21T00:46:18Z

parquet-hadoop/src/test/java/parquet/hadoop/TestParquetFileWriter.java

what's this loop for if all we do is call getPath() and ignore the return value?

isnotinvain · 2015-01-21T00:48:30Z

@saucam sorry for the slow reply thanks a lot!

LGTM, minor comments but aside from that +1

Long term we probably need more comprehensive tests around statistics, but probably not in this PR :)

saucam · 2015-01-21T02:35:43Z

Added changes for the comments.

@isnotinvain , thanks !

isnotinvain · 2015-01-21T07:33:56Z

+1, merge when the tests pass (to trigger the tests running again you might have to commit a whitespace change and then revert it, Travis can be flakey)

cc @julienledem @tsdeng who can actually merge this

This reverts commit 82209e6.

saucam · 2015-01-23T10:13:48Z

cc @julienledem @tsdeng

isnotinvain · 2015-01-23T20:58:43Z

I should be able to merge this now, I'll try to do it today.

egonina · 2015-01-23T22:15:02Z

parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java

why can't we drop it if it is all nulls in this case?

same question.

we cannot just drop it, because the user defined predicate can be anything, we don't know beforehand. Also , if we don't keep this check, it will proceed to get min/max for the predicate, which can be nulls in case all values are nulls, so its best to just keep the chunk in case of all nulls.

we should have (and could) extend the UDP interface to receive not just the min/max, but also the null count. Currently the UDP interface only accepts min/max, and in the case of all nulls, there is no min/max.

See here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/Statistics.java

also note that if we add num nulls there, then existing UDP's will have to check for null themselves or their behavior might change.

egonina · 2015-01-23T22:19:20Z

+1 besides one clarification question

julienledem · 2015-01-24T01:16:03Z

parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java

comparing null to anything else than null should always drop it. right?

this filter is notEq filter , where we are looking for columnValue != (some non null value), so any row with null column should be returned and cannot be dropped.

Also I think the above check is redundant anyways, since we have a hasNulls() check above this one, so if there are any nulls, we will be returning 'false' and this check would never be reached.

maybe @isnotinvain can help ?

Lets see...

case 1: v notEq(null) (line 121)
we are trying to keep all non-null records. If this is a chunk of only-nulls, it can be dropped, otherwise, it can't. This case returns either way (does not fall through).

case2 : v notEq(nonNull) (line 127)
we are trying to keep all records where v is not a particular (non-null) value. For example, v notEq(7). If this chunk contains nulls, there are some records to keep, because we decided that:
null notEq(7) is true, eg, we chose to make predicates work the way a programming language like java works (null != 7) instead of the way SQL or Pig works (nulls are viral, and dropped in all scenarios). It boils down to in programming languages we think of null as a value, and SQL think of it as "unknown."
This case falls through to the next unlike the case above.

@saucam is right, Line 134 to 137 (which I added to every operator) is unreachable in this case, we can delete it because to get here we've already confirmed that !stats.isEmpty() && !stats.hasNulls()

case3: v notEq(nonNullTarget) && there's a min + max
only case where we can drop the chunk is: min = max = nonNullTarget
eg, v notEq(7) and this column is a column of all 7s.

Thanks for catching the unreachable block!

FYI, docs describing how nulls are handled: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/FilterApi.java#L90

julienledem · 2015-01-24T01:17:28Z

Same question as Katya. Otherwise this looks good to me.

saucam · 2015-01-25T05:44:57Z

removed the unreachable block. @julienledem how is it now ?

egonina · 2015-01-26T20:17:35Z

LGTM

isnotinvain · 2015-01-26T23:47:59Z

@saucam our merge process is a little strict, can you rename this PR:

"PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null"

(note the all caps PARQUET)

Our merge script requires this to deal with JIRA / how apache git vs github integration works.

Thanks!

saucam · 2015-01-27T02:04:39Z

done :)

isnotinvain · 2015-01-27T02:28:02Z

Done! Thanks!

…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column

…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column

Yash Datta added 2 commits January 7, 2015 18:27

PARQUET-136: In case of all nulls in a binary column, statistics obje…

974a22b

…ct read from file metadata is empty, and should return true for all nulls check for the column

PARQUET-136: fix hasNulls to cater to the case where all values are n…

c7c126f

…ulls

liancheng reviewed Jan 8, 2015
View reviewed changes

Yash Datta added 3 commits January 9, 2015 17:34

Revert "PARQUET-136: fix hasNulls to cater to the case where all valu…

8cc5106

…es are nulls" This reverts commit c7c126f.

Revert "PARQUET-136: In case of all nulls in a binary column, statist…

e5e924e

…ics object read from file metadata is empty, and should return true for all nulls check for the column" This reverts commit 974a22b.

PARQUET-136: Its better to not filter chunks for which empty statisti…

1ac9192

…cs object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls

isnotinvain reviewed Jan 9, 2015
View reviewed changes

isnotinvain added 2 commits January 9, 2015 14:57

revert TestStatisticsFilter.java

0c88be0

Fix statistics isEmpty, handle more edge cases in statistics filter

b1001eb

saucam and others added 3 commits January 11, 2015 11:47

Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136

a06f0d0

Some updates for PARQUET 161 + 136

Revert "revert TestStatisticsFilter.java"

97bb517

This reverts commit 0c88be0.

PARQUET-161: Fix the write path to write statistics object in case of…

c2f8d6f

… only nulls in the column

PARQUET-161: Add a test case for checking the correct statistics info…

2217ee2

… is recorded in case of all nulls in a column

isnotinvain reviewed Jan 21, 2015
View reviewed changes

PARQUET-161: Review comments for the test case

aab2f81

PARQUET-161: Trigger tests

82209e6

Revert "PARQUET-161: Trigger tests"

b17cd38

This reverts commit 82209e6.

egonina reviewed Jan 23, 2015
View reviewed changes

julienledem reviewed Jan 24, 2015
View reviewed changes

PARQUET-136: Remove unreachable block

5138e44

saucam changed the title ~~Parquet-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null~~ PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null Jan 27, 2015

asfgit closed this in 4bf9be3 Jan 27, 2015

julienledem mentioned this pull request Jan 30, 2015

PARQUET-161: Statistics should be written for column chunks that are all null #103

Closed

PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null #99

PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null #99

Uh oh!

Conversation

saucam commented Jan 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jan 8, 2015

Uh oh!

isnotinvain commented Jan 8, 2015

Uh oh!

isnotinvain commented Jan 8, 2015

Uh oh!

isnotinvain commented Jan 8, 2015

Uh oh!

isnotinvain commented Jan 8, 2015

Uh oh!

egonina commented Jan 8, 2015

Uh oh!

saucam commented Jan 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jan 9, 2015

Uh oh!

saucam commented Jan 11, 2015

Uh oh!

saucam commented Jan 11, 2015

Uh oh!

saucam commented Jan 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jan 21, 2015

Uh oh!

saucam commented Jan 21, 2015

Uh oh!

isnotinvain commented Jan 21, 2015

Uh oh!

saucam commented Jan 23, 2015

Uh oh!

isnotinvain commented Jan 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egonina commented Jan 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jan 24, 2015

Uh oh!

saucam commented Jan 25, 2015

Uh oh!

egonina commented Jan 26, 2015

Uh oh!

isnotinvain commented Jan 26, 2015

Uh oh!

saucam commented Jan 27, 2015

Uh oh!

isnotinvain commented Jan 27, 2015

Uh oh!