Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 23, 2015

This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 23, 2015

@spena, @julienledem, @isnotinvain what do you guys think of this change? We're hitting cases where the min/max stats are about the size of a page, so we end up with pages 3x larger than they should be.

I think the best solution is to not write stats over some limit, which I've set at 4k here. Scanning would be quick when this is the case because if values are that large there wouldn't be many of them in a page. Also, since the large values must be min and/or max, we don't expect many cases where just one huge value prevents stats from being written.

I had suggested on the ticket that we might want to modify the value, but that gets messy and wouldn't allow us to satisfy queries from stats alone (select max(col) from table).

@rdblue
Copy link
Contributor Author

rdblue commented Sep 23, 2015

Test failures are again due to flaky memory manager tests, fixed in #263.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 23, 2015

I looked at stats merging and I think this is safe because statistics are aggregated before they are written to the file. A column chunk will accumulate the min and max values for all pages accurately, but when individual pages are written, the stats may be discarded instead of written to the file.

@rdblue rdblue force-pushed the PARQUET-372-fix-min-max-for-long-values branch 2 times, most recently from 8a7224c to 048cc62 Compare September 24, 2015 18:18
@spena
Copy link

spena commented Sep 29, 2015

@rdblue The patch looks good.

I just have a question about the limit for the statistics. Why not limiting the size to the current page size used on the row group (or pagesize / 4 or something)? If a user ends setting a big page size, then they won't have the advantage of using the statistics to check whether the row group should be read or not.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 2, 2015

Thanks for taking a look, @spena. There are a couple of motivations for the 4k static limit. First, I'd like to keep a hard limit on the page size so they don't get unexpectedly large again. The other reason is just to keep the code simple and not need to pass the current page size through unless we know it's going to be beneficial. We can always make it more fancy later.

@rdblue rdblue force-pushed the PARQUET-372-fix-min-max-for-long-values branch from 048cc62 to 8c1323e Compare October 27, 2015 22:51
@rdblue
Copy link
Contributor Author

rdblue commented Dec 1, 2015

@julienledem, I think this (or something similar) should go into 1.9.0. Could you have a look at it?

@rdblue rdblue force-pushed the PARQUET-372-fix-min-max-for-long-values branch from 8c1323e to 3a5207e Compare December 18, 2015 17:22
@rdblue
Copy link
Contributor Author

rdblue commented Dec 18, 2015

I've rebased this and added a comment to explain why stats are ignored rather than truncated.

@julienledem could you take a look when you have time? Thanks!

@lw-lin
Copy link
Contributor

lw-lin commented Jan 30, 2016

hi @rdblue
Although I wonder if we could just stop updating the stats once we encounter large-sized max+min (to save the max/min compassion for upcoming writes), I'm +1 for this. I've looked through the code changes as well as the added unit tests -- LGTM.

// rationale is that some engines may use the minimum value in the page as
// the true minimum for aggregations and there is no way to mark that a
// value has been truncated and is a lower bound and not in the page.
if (!statistics.isEmpty() && statistics.isSmallerThan(MAX_STATS_SIZE)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have null count without min/max. This is a nit, but engines could use null count even without min/max present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked into this. There's nothing preventing us from doing this in the format, but the implementation assumes that there will be non-null min and max if numNulls != numRows. I think it makes sense to add this as a follow-up instead of making this patch larger.

@danielcweeks
Copy link

+1 I agree with all of the follow up to my questions.

rdblue added 2 commits May 5, 2016 12:39
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.
@rdblue rdblue force-pushed the PARQUET-372-fix-min-max-for-long-values branch from 3a5207e to 61e05d9 Compare May 5, 2016 20:25
@asfgit asfgit closed this in c3f3830 May 5, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants