Skip to content

Conversation

@saucam
Copy link

@saucam saucam commented Jan 7, 2015

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

Yash Datta added 2 commits January 7, 2015 18:27
…ct read from file metadata is empty, and should return true for all nulls check for the column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have an empty row group without any records in it? In this case, a binary column also has null statistics values, and isAllNulls/hasNulls both return true, which is at least semantically wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide this is a bug in the write path, lets put a TODO here with a reference to a jira.
Hopefully the only way the stats can be empty is when the column is all nulls, but I wonder if that is true?
What about files that were written pre-statistics?

@julienledem do you know if there can be an empty rowgroup? Hopefully not...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix it in the write path and have column.getStatistics().getNumNulls() == column.getValueCount() in that case.
missing statistics should mean "the writer did not have that feature"
We should not have empty row groups. There was a bug in the past that produced empty row groups in some cases but this was fixed.

@liancheng
Copy link
Contributor

Would be great if a test case can be added :-)

@isnotinvain
Copy link
Contributor

Thanks for fixing!

So currently, in the case of writing a row group that is all nulls, we don't record the number of nulls?
That does sound like a bug in the write path to me. @julienledem what do you think?

A test would be great as well.

@isnotinvain
Copy link
Contributor

ping @egonina who might know about whether we should be writing stats for all null column chunks

@isnotinvain
Copy link
Contributor

So actually, when the stats object is empty, I think we should just do no filtering -- we don't know if it's empty because of this corner case, or because the file simply has no stats.

I think an aggressive early check for stats.isEmpty is all that we need here?

@isnotinvain
Copy link
Contributor

I've filed: https://issues.apache.org/jira/browse/PARQUET-161 for the write path issue.

I think the fix here on the read path is to check isEmpty and keep all chunks that have no statistics.

@egonina
Copy link

egonina commented Jan 8, 2015

For a column that is all nulls, we would count the number of nulls and put that in the stats object but the stats object is still marked as empty (markAsNotEmpty is not called down this path) (EDIT: and is not written). Looks like a bug, we should mark the stats as non-empty if the values are all nulls, since the column is non-empty. Then empty stats mean that we didn't collect any stats at all for the column.

Yash Datta added 3 commits January 9, 2015 17:34
…ics object read from file metadata is empty, and should return true for all nulls check for the column"

This reverts commit 974a22b.
…cs object is returned. Empty statistics can be read in case of

1. pre-statistics files,
2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
@saucam
Copy link
Author

saucam commented Jan 9, 2015

Hi all,

Thanks for the feedback.
@isnotinvain yes the statistics object is not written in case there are only nulls. I agree it is better to not filter in case of statistics object being empty.
@egonina thanks for pointing out the omits.
Made the relevant changes.

Will add test case soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this comment is repeated, and we are going to fix the bug in the write path, lets just make this comment something like:
// we have no statistics available, we cannot drop any chunks

@isnotinvain
Copy link
Contributor

I've sent you a PR for your 'npe' branch:
https://github.com/saucam/incubator-parquet-mr/pull/1/files

I think that gets us to a place where we don't need #103 and we'll have fixed both PARQUET-136 and PARQUET-161 in a way that is compatible with old files that were both written a) before stats were collected and b) while this bug was in effect.

Let me know what you think. Thanks!

@saucam
Copy link
Author

saucam commented Jan 11, 2015

@isnotinvain , thanks for all the help. Have merged your changes (reverted the test case revert because of the reasons mentioned in the in line comment). Its much cleaner now
how does it look ?

… is recorded in case of all nulls in a column
@saucam
Copy link
Author

saucam commented Jan 11, 2015

Added a test case.

@saucam
Copy link
Author

saucam commented Jan 16, 2015

@isnotinvain , any updates on this one ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this loop for if all we do is call getPath() and ignore the return value?

@isnotinvain
Copy link
Contributor

@saucam sorry for the slow reply thanks a lot!

LGTM, minor comments but aside from that +1

Long term we probably need more comprehensive tests around statistics, but probably not in this PR :)

@saucam
Copy link
Author

saucam commented Jan 21, 2015

Added changes for the comments.

@isnotinvain , thanks !

@isnotinvain
Copy link
Contributor

+1, merge when the tests pass (to trigger the tests running again you might have to commit a whitespace change and then revert it, Travis can be flakey)

cc @julienledem @tsdeng who can actually merge this

@saucam
Copy link
Author

saucam commented Jan 23, 2015

cc @julienledem @tsdeng

@isnotinvain
Copy link
Contributor

I should be able to merge this now, I'll try to do it today.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we drop it if it is all nulls in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot just drop it, because the user defined predicate can be anything, we don't know beforehand. Also , if we don't keep this check, it will proceed to get min/max for the predicate, which can be nulls in case all values are nulls, so its best to just keep the chunk in case of all nulls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have (and could) extend the UDP interface to receive not just the min/max, but also the null count. Currently the UDP interface only accepts min/max, and in the case of all nulls, there is no min/max.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-column/src/main/java/parquet/filter2/predicate/Statistics.java

also note that if we add num nulls there, then existing UDP's will have to check for null themselves or their behavior might change.

@egonina
Copy link

egonina commented Jan 23, 2015

+1 besides one clarification question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing null to anything else than null should always drop it. right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this filter is notEq filter , where we are looking for columnValue != (some non null value), so any row with null column should be returned and cannot be dropped.

Also I think the above check is redundant anyways, since we have a hasNulls() check above this one, so if there are any nulls, we will be returning 'false' and this check would never be reached.

maybe @isnotinvain can help ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see...

case 1: v notEq(null) (line 121)
we are trying to keep all non-null records. If this is a chunk of only-nulls, it can be dropped, otherwise, it can't. This case returns either way (does not fall through).

case2 : v notEq(nonNull) (line 127)
we are trying to keep all records where v is not a particular (non-null) value. For example, v notEq(7). If this chunk contains nulls, there are some records to keep, because we decided that:
null notEq(7) is true, eg, we chose to make predicates work the way a programming language like java works (null != 7) instead of the way SQL or Pig works (nulls are viral, and dropped in all scenarios). It boils down to in programming languages we think of null as a value, and SQL think of it as "unknown."
This case falls through to the next unlike the case above.

@saucam is right, Line 134 to 137 (which I added to every operator) is unreachable in this case, we can delete it because to get here we've already confirmed that !stats.isEmpty() && !stats.hasNulls()

case3: v notEq(nonNullTarget) && there's a min + max
only case where we can drop the chunk is: min = max = nonNullTarget
eg, v notEq(7) and this column is a column of all 7s.

Thanks for catching the unreachable block!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem
Copy link
Member

Same question as Katya. Otherwise this looks good to me.

@saucam
Copy link
Author

saucam commented Jan 25, 2015

removed the unreachable block. @julienledem how is it now ?

@egonina
Copy link

egonina commented Jan 26, 2015

LGTM

@isnotinvain
Copy link
Contributor

@saucam our merge process is a little strict, can you rename this PR:

"PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null"

(note the all caps PARQUET)

Our merge script requires this to deal with JIRA / how apache git vs github integration works.

Thanks!

@saucam saucam changed the title Parquet-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null Jan 27, 2015
@saucam
Copy link
Author

saucam commented Jan 27, 2015

done :)

@asfgit asfgit closed this in 4bf9be3 Jan 27, 2015
@isnotinvain
Copy link
Contributor

Done! Thanks!

dongche pushed a commit to dongche/incubator-parquet-mr that referenced this pull request Feb 3, 2015
…ng/binary column trunk are null

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

Author: Yash Datta <Yash.Datta@guavus.com>
Author: Alex Levenson <alexlevenson@twitter.com>
Author: Yash Datta <saucam@gmail.com>

Closes apache#99 from saucam/npe and squashes the following commits:

5138e44 [Yash Datta] PARQUET-136: Remove unreachable block
b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests"
82209e6 [Yash Datta] PARQUET-161: Trigger tests
aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case
2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column
c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column
97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java"
a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136
b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter
0c88be0 [Alex Levenson] revert TestStatisticsFilter.java
1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column"
8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls"
c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls
974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
…ng/binary column trunk are null

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

Author: Yash Datta <Yash.Datta@guavus.com>
Author: Alex Levenson <alexlevenson@twitter.com>
Author: Yash Datta <saucam@gmail.com>

Closes apache#99 from saucam/npe and squashes the following commits:

5138e44 [Yash Datta] PARQUET-136: Remove unreachable block
b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests"
82209e6 [Yash Datta] PARQUET-161: Trigger tests
aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case
2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column
c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column
97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java"
a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136
b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter
0c88be0 [Alex Levenson] revert TestStatisticsFilter.java
1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column"
8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls"
c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls
974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants