-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-372: Do not write stats larger than 4k. #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -77,6 +77,7 @@ public class ParquetMetadataConverter { | |
|
|
||
| public static final MetadataFilter NO_FILTER = new NoFilter(); | ||
| public static final MetadataFilter SKIP_ROW_GROUPS = new SkipMetadataFilter(); | ||
| public static final long MAX_STATS_SIZE = 4096; // limit stats to 4k | ||
|
|
||
| private static final Log LOG = Log.getLog(ParquetMetadataConverter.class); | ||
|
|
||
|
|
@@ -284,7 +285,11 @@ dataPageType, getEncoding(encoding), | |
| public static Statistics toParquetStatistics( | ||
| org.apache.parquet.column.statistics.Statistics statistics) { | ||
| Statistics stats = new Statistics(); | ||
| if (!statistics.isEmpty()) { | ||
| // Don't write stats larger than the max size rather than truncating. The | ||
| // rationale is that some engines may use the minimum value in the page as | ||
| // the true minimum for aggregations and there is no way to mark that a | ||
| // value has been truncated and is a lower bound and not in the page. | ||
| if (!statistics.isEmpty() && statistics.isSmallerThan(MAX_STATS_SIZE)) { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to have null count without min/max. This is a nit, but engines could use null count even without min/max present.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just looked into this. There's nothing preventing us from doing this in the format, but the implementation assumes that there will be non-null min and max if numNulls != numRows. I think it makes sense to add this as a follow-up instead of making this patch larger. |
||
| stats.setNull_count(statistics.getNumNulls()); | ||
| if (statistics.hasNonNullValue()) { | ||
| stats.setMax(statistics.getMaxBytes()); | ||
|
|
@@ -293,6 +298,7 @@ public static Statistics toParquetStatistics( | |
| } | ||
| return stats; | ||
| } | ||
|
|
||
| /** | ||
| * @deprecated Replaced by {@link #fromParquetStatistics( | ||
| * String createdBy, Statistics statistics, PrimitiveTypeName type)} | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size() might be a more intuitive approach and would remove the burden of evaluating this check in the subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while, but I think my logic was this:
getMinBytesandgetMaxBytesallocate byte arrays and generally too much work for this check. We could cache the bytes, but that requires more work.sizeisn't determined byStatisticsbecause the metadata converter is responsible for actually converting stats into bytes. This only assumes that the entire min and max bytes will be written.So we delegate to a stats object to avoid work, but can't delegate to stats because it doesn't do serialization. I could change it to
getSizeHint, but I don't think that's much better and I'm inclined to leave it as is.