-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-372: Do not write stats larger than 4k. #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-372: Do not write stats larger than 4k. #275
Conversation
|
@spena, @julienledem, @isnotinvain what do you guys think of this change? We're hitting cases where the min/max stats are about the size of a page, so we end up with pages 3x larger than they should be. I think the best solution is to not write stats over some limit, which I've set at 4k here. Scanning would be quick when this is the case because if values are that large there wouldn't be many of them in a page. Also, since the large values must be min and/or max, we don't expect many cases where just one huge value prevents stats from being written. I had suggested on the ticket that we might want to modify the value, but that gets messy and wouldn't allow us to satisfy queries from stats alone (select max(col) from table). |
|
Test failures are again due to flaky memory manager tests, fixed in #263. |
|
I looked at stats merging and I think this is safe because statistics are aggregated before they are written to the file. A column chunk will accumulate the min and max values for all pages accurately, but when individual pages are written, the stats may be discarded instead of written to the file. |
8a7224c to
048cc62
Compare
|
@rdblue The patch looks good. I just have a question about the limit for the statistics. Why not limiting the size to the current page size used on the row group (or pagesize / 4 or something)? If a user ends setting a big page size, then they won't have the advantage of using the statistics to check whether the row group should be read or not. |
|
Thanks for taking a look, @spena. There are a couple of motivations for the 4k static limit. First, I'd like to keep a hard limit on the page size so they don't get unexpectedly large again. The other reason is just to keep the code simple and not need to pass the current page size through unless we know it's going to be beneficial. We can always make it more fancy later. |
048cc62 to
8c1323e
Compare
|
@julienledem, I think this (or something similar) should go into 1.9.0. Could you have a look at it? |
8c1323e to
3a5207e
Compare
|
I've rebased this and added a comment to explain why stats are ignored rather than truncated. @julienledem could you take a look when you have time? Thanks! |
|
hi @rdblue |
| // rationale is that some engines may use the minimum value in the page as | ||
| // the true minimum for aggregations and there is no way to mark that a | ||
| // value has been truncated and is a lower bound and not in the page. | ||
| if (!statistics.isEmpty() && statistics.isSmallerThan(MAX_STATS_SIZE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have null count without min/max. This is a nit, but engines could use null count even without min/max present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked into this. There's nothing preventing us from doing this in the format, but the implementation assumes that there will be non-null min and max if numNulls != numRows. I think it makes sense to add this as a follow-up instead of making this patch larger.
|
+1 I agree with all of the follow up to my questions. |
This updates the stats conversion to check whether the min and max values for page stats are larger than 4k. If so, no statistics for a page are written.
3a5207e to
61e05d9
Compare
This updates the stats conversion to check whether the min and max values for page stats are larger than 4k. If so, no statistics for a page are written. Author: Ryan Blue <blue@apache.org> Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits: 61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values. fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
This updates the stats conversion to check whether the min and max values for page stats are larger than 4k. If so, no statistics for a page are written. Author: Ryan Blue <blue@apache.org> Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits: 61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values. fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
This updates the stats conversion to check whether the min and max values for page stats are larger than 4k. If so, no statistics for a page are written. Author: Ryan Blue <blue@apache.org> Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits: 61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values. fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.