-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1025: Support new min-max statistics in parquet-mr #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Started working on it. It is far not complete. @rdblue, could you please check it if the direction is fine? |
| return BOOLEAN_COMPARATOR; | ||
| } | ||
|
|
||
| public static LongComparator int64Comparator(OriginalType logicalType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should delegate to Type to supply comparators. That way, you have a stronger guarantee that the logical type and physical type match because Types won't allow you to construct an invalid one.
| /** | ||
| * Returns the string representation of the specified value for debugging/logging purposes | ||
| */ | ||
| public String toString(boolean b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that it is convenient to add toString here, but I don't think it makes sense with the Comparator API. I think it would be better to pass Type to Statistics, then use that type to get a comparator and to do the coercion to string.
| break; | ||
| case BINARY: | ||
| if (logicalType == null) | ||
| // TODO: return lexicographical comparator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a Comparators class from another project that has the unsigned byte-wise comparator for ByteBuffer, one for CharSequence, and some Java 8 utils for handling nulls.
https://gist.github.com/rdblue/7e80e9700d508d395b004c4c146dbfe2
| this(PrimitiveType.PrimitiveTypeName.BINARY, null); | ||
| } | ||
|
|
||
| BinaryStatistics(PrimitiveType.PrimitiveTypeName type, OriginalType logicalType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned this earlier, but I think it would be a bit cleaner to pass the Type instance and use that to get a comparator and coerce to strings.
| } | ||
| } | ||
|
|
||
| public static Statistics<?> getStatsBasedOnType(PrimitiveTypeName type, OriginalType logicalType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are still using Statistics classes based on the primitive type, then why was support for stats other than BinaryStatistics removed from parquet-cli?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't get it. Where was the support for stats other than BinaryStatistics removed?
| abstract public T genericGetMax(); | ||
|
|
||
| /** | ||
| * Returns the comparator to be used to compare two generic values in the proper way (e.g. unsigned comparison for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that unsigned comparison is correct for int. I think you should instead say "For example, unsigned comparison for UINT32"
| abstract public T genericGetMin(); | ||
|
|
||
| /** | ||
| * Returns the generic object representing the max value in the statistics. The self comparing logic of the returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few clarifications:
- Instead of "generic object" just say "max value". The type is clear from the use of generics. If anything, add a
@param <T>to the class-level javadoc. - Instead of "self comparing logic", use "Java natural order" or reference
compareToto be explicit. - Instead of "strongly recommended", I think you should explain why: the sort order for logical types may not be the order produced by
compareTo. Users should use the logical type's defined order and can use#comparator()to get aComparatorinstance that implements that order.
| } | ||
|
|
||
| /** | ||
| * The self-comparison logic of {@code T} might not proper for the actual logical type (e.g. unsigned int). Use {@link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful note, but not javadoc for this method. If you want to add Javadoc here, describe what the method does and then add a note below that. Same for getMax.
|
|
||
| abstract public class Binary implements Comparable<Binary>, Serializable { | ||
|
|
||
| public interface ComparatorHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be static methods that create correct ByteBuffers and then call the right Comparator<ByteBuffer>. No need to define an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we have 4 different Binary implementations with 2 inner representations: byte[] and ByteBuffer. So, we need 3 different comparisons: byte[] to byte[], ByteBuffer to ByteBuffer and byte[] to ByteBuffer.
The concept here with this interface is to make it able to have Comparator<Binary> implemented outside of this class.
I would even implement it similarly if we want to have different Binary instances with the proper Comparable<Binary> implementation only in this case the interface would be private.
Meanwhile, I would vote on having separate Comparator<Binary> implementations in the same place where the other primitive comparators are.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that this doesn't need to be a class because these are straight-forward transformations to compare(ByteBuffer, ByteBuffer). The byte[] arguments can be easily wrapped, and the ByteBuffer args should duplicate the buffer and then set the duplicate's position and limit appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, it means that a byte[] backed Binary shall either create new ByteBuffer instances to wrap the byte[] references for each compare(Binary, Binary) invocation or have a ByteBuffer reference cached inside the Binary implementation.
The first one would impact the performance of filtering and creating the statistics the latter one would increase the memory footprint of the binary representation.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I'm suggesting would mean wrapping the byte array in a buffer to send it into compare, but I don't think avoiding object creation here is going to be worth it for performance. I think the main concern is to make sure we have as few code paths for this comparison as possible for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we have 5 different implementations for the abstract class Binary. I guess, it is for performance reasons.
If we really think wrapping a byte[] into a ByteBuffer for comparing Binary object worth for the clean code then I would suggest completely refactoring the class Binary to have only one implementation which contains a ByteBuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a bit of refactoring in #390 for Binary to use ByteBuffer more. Lets see where we're at when those two are in.
|
Thanks for working on this @gszadovszky! I have a few comments, but the overall direction to introduce comparators based on the full type (physical and logical) looks good to me. |
| ((DoubleStatistics) stats).getMin(), | ||
| ((DoubleStatistics) stats).getMax()); | ||
| } else if (stats instanceof BinaryStatistics) { | ||
| if (stats instanceof BinaryStatistics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the other stats types have been removed. Why are only binary statistics supported here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I got it. There was another concept I've started with and I've forgot to revert this change. I'll do it in the next commit.
Thanks for the finding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single Statistics class would be nice, if it works out when the comparator interfaces are done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've designed the new PrimitiveComparator super class to have primitive comparisons to avoid the unnecessary boxing of primitive types. So, I would keep the current specialised Statistics classes for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| return !hasNonNullValue() || ((min.length() + max.length()) < size); | ||
| String toString(Binary value) { | ||
| // TODO: have separate toString for different logical types? | ||
| return value.toStringUsingUTF8(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegating to type would be better here, and would work when this Binary is used for Decimal.
|
|
||
| @Override | ||
| public LongComparator comparator() { | ||
| return LongComparators.NATURAL_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to override here, the type can provide this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've missed to remove it after rewriting to use my PrimitiveComparator class.
| PrimitiveComparator<?> comparator(OriginalType logicalType) { | ||
| if (logicalType == OriginalType.UINT_64) | ||
| // TODO: return unsigned comparator | ||
| return PrimitiveComparator.SIGNED_INT64_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is only temporary, but it's probably a better practice to throw UnsupportedOperationException rather than return the wrong comparator. That way if it is overlooked later, it fails instead of silently introducing correctness problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concept here is to refactor the code to the new design with still passing unit tests. Then, at the final step add the new comparators and fix the existing tests and add more to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
| @Override | ||
| PrimitiveComparator<?> comparator(OriginalType logicalType) { | ||
| // TODO: what to return here? | ||
| return PrimitiveComparator.BINARY_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INT96 should continue to use signed byte-wise comparison.
|
|
||
| @Override | ||
| PrimitiveComparator<?> comparator(OriginalType logicalType) { | ||
| if (logicalType == OriginalType.UINT_64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: always use { and } for branches and loops.
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments, I didn't get to end of it yet.
|
|
||
| public BinaryStatistics() { | ||
| // Creating a fake primitive type to have the proper comparator | ||
| this(Types.optional(PrimitiveType.PrimitiveTypeName.BINARY).named("")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a new instance of this type every time, could you create a single one that you reuse every time? (Same for all other Statistics classes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also use a name that makes it clear that this came from stats, in case it shows up elsewhere.
| * @author Katya Gonina | ||
| */ | ||
| public abstract class Statistics<T extends Comparable<T>> { | ||
| public abstract class Statistics<T extends Comparable<T>> implements Cloneable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered alternatives to making this class Cloneable? Effective Java strongly discourages the use of Cloneable. Some of the arguments mentioned there are publicly available here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Better to create a copy() method.
| private final Comparator<T> comparator; | ||
|
|
||
| public Statistics(T min, T max) { | ||
| public Statistics(T min, T max, Comparator<T> comparator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this public constructor is technically a breaking change. If possible, please deprecate it instead. If it is not supposed to be used by clients, please use the InterfaceAudience.Private Yetus annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we don't have dependency on Yetus but have InterfaceAudience from 3 different dependencies (hadoop, hive, pig). It is not clear to me which pattern should we follow (having our own annotation implementation, use Yetus, use Hadoop etc.) So, I would leave the annotation to a separate change. (It might worth a JIRA for that.)
I'll put back the original constructor as deprecated for now a comment that the constructors are not for public use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 for introducing audience annotations. We don't use those in Parquet. We've been needing a push to clean up the public API for a while now, but introducing the annotations for single classes isn't going to help. If we are going to clean up the API, lets do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to introduce it for the whole API, but it's a big task to do in a single chunk, so I think a progressive approach would be a good start. It is much easier to decide the intended audience of a method when it gets added and both developer and reviewers have a context than later on when it requires a massive investigation. For this reason I think it would be good to start adding these annotations for new (and possibly existing but touched) methods/classes. This would also prevent users from misusing these functions in the interim time while we add the annotations to the whole API (progressively, module-by-module, I guess).
| } | ||
|
|
||
| private int unsignedCompare(byte b1, byte b2) { | ||
| return toUnsigned(b1) - toUnsigned(b2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't use - to implement comparison because Java integers wrap around instead of throwing an overflow exception. Instead, use something like this (from Comparators gist):
int cmp = Integer.compare(
((int) buf1.get(b1pos + i)) & 0xff,
((int) buf2.get(b2pos + i)) & 0xff);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are dealing with byte values casted to ints. There is no chance for overflows.
|
|
||
| abstract int compare(ByteBuffer b1, ByteBuffer b2); | ||
|
|
||
| final int toUnsigned(byte b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this final? An attempt to inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. This is an abstract class and the implementation of converting unsigned byte values to ints shall not be overridden.
| } | ||
| }; | ||
|
|
||
| static final PrimitiveComparator<Binary> SIGNED_BINARY_COMPARATOR = new BinaryComparator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a need for this comparator. The regular ByteBuffer comparator implements the signed byte-wise comparison that is required by the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the name is a bit misleading. This comparator compares the values as signed integers (e.g. BigInteger). We need it if we would like to have similar behaviour for DECIMALs on INT32/INT64 and on FIX_LEN_BYTE_ARRAY/BINARY.
What a regular ByteBuffer comparator or the compareTo method of Binary do is quite useless. They compare byte values one-by-one as signed values which is incorrect either for string like values or numeric values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To second Gabor's opinion, I agree that this is needed. The byte-wise signed comparison is not appropriate, we need a whole-value signed comparison.
Regarding the comment that the ByteBuffer comparator implements signed byte-wise comparison as required by the spec: The spec actually requires unsigned byte-wise comparison. (These specific lines were last modified by me, but they was present earlier as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. If this is a comparator for encoded decimals, integers, or longs, it should have a more specific name.
| public abstract class PrimitiveComparator<T> implements Comparator<T> { | ||
|
|
||
| public int compare(boolean b1, boolean b2) { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should include a helpful message. It is never a good idea to throw an exception with no message. Some code will strip out the stack trace and leave just the message (which is also a bad practice) and this would cause that to produce just null. It's also helpful to explain a little more why this exception was thrown, "compare(boolean,boolean) was called on a non-boolean comparator"
| long valueCount, | ||
| long totalSize, | ||
| long totalUncompressedSize) { | ||
| return get(path, new PrimitiveType(Type.Repetition.OPTIONAL, type, ""), codec, encodingStats, encodings, statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the Types API instead of constructors. The constructors will probably not be public in the 2.0 API.
| <cascading.version>2.7.1</cascading.version> | ||
| <cascading3.version>3.1.2</cascading3.version> | ||
| <parquet.format.version>2.3.1</parquet.format.version> | ||
| <parquet.format.version>2.4.0</parquet.format.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase, you'll pick up this change.
5f99e56 to
8bdf16f
Compare
| @Test | ||
| public void testV2OnlyStats() { | ||
| testV2OnlyStats(createStats(Types.optional(PrimitiveTypeName.INT32).as(OriginalType.UINT_8).named(""), | ||
| Byte.MAX_VALUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think for UINT_8 you should use 0x00 as min and 0xFF as max (or some arbitrary values, but certainly not these constants). Although for the decimal you use arbitrary values (±123456), and Byte.MIN_VALUE and Byte.MAX_VALUE are arbitrary values, it is still misleading to use them, because:
-
They do not seem to be arbitrary values but special values instead, while in reality those values are not special at all in unsigned interpretation (specifically those are not the most extreme values of the allowed range for unsigned bytes).
-
Byte.MIN_VALUE = -128 = 0x80 and Byte.MAX_VALUE = 127 = 0x7F and in unsigned comparison 0x80 > 0x7F, so the min value is larger the max value.
Similarly for the bigger unsigned types below.
| // NOTE: See docs in CorruptStatistics for explanation of why this check is needed | ||
| // The sort order is checked to avoid returning min/max stats that are not | ||
| // valid with the type's sort order. Currently, all stats are aggregated | ||
| // using a signed ordering, which isn't valid for strings or unsigned ints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither are they valid for decimals or int96 timestamps, you could mention those as well or be more generic in this statements ("which isn't valid for some types, for example strings"). EDIT: Just realized that this comment is not new, it just got moved. Still, I think it would be worth improving.
| } | ||
| if (Float.TYPE != exclude) { | ||
| try { | ||
| comparator.compare(0F, 0F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) 0.0f would be easier on the eye, seeing 0F I immediately thought of 0x0F. Same with 0D below.
| Binary.fromReusedByteArray("azzza".getBytes(), 1, 3), | ||
| Binary.fromReusedByteBuffer(ByteBuffer.wrap("zzzzzz".getBytes())), | ||
| Binary.fromReusedByteBuffer(ByteBuffer.wrap("aazzzzzzaa".getBytes(), 2, 7)), | ||
| Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -128})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Although it's usually the other way around, in this case using hexadecimal literals would make these values more human-readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hexadecimal literals in Java are integers which means while -128 does not required casting to byte 0x80 would. Do you think it would still be readable if I write new byte[]{(byte) 0x80, (byte) 0x80, (byte) 0xFF}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Optional: Please consider adding the hex values in comments, like // 0x8080FF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the option with casting for small values like these.
|
|
||
| @Override | ||
| PrimitiveComparator<?> comparator(OriginalType logicalType) { | ||
| return PrimitiveComparator.SIGNED_BINARY_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be undefined according to PARQUET-1065.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing undefined comparator logic is not a simple effort and a bit error prone. I can see two options:
- Returning
nullwherever we provide comparators and handlenullcomparator in statistics logic to not to update min-max neither load or save it. In this caseNullPointerExceptionmight occur in client code which is misleading. - Returning a special implementation of the
PrimitiveComparatorwhich throws proper exceptions in case ofcompareis invoked. (Also update the statistics logic.) In this case it might not be obvious for the client code to determine whether the comparator is undefined or not.
I would prefer the latter one. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion the best approach would be to throw a checked exception. This way API consumers will be forced to handle the lack of comparators. This is important, because it can happen for unsupported logical types as detailed in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should return the comparator that is documented in logical types for consistency. We should just discard statistics for these types when reading and should not produce statistics when writing.
| if (logicalType == OriginalType.DECIMAL) { | ||
| return PrimitiveComparator.SIGNED_BINARY_COMPARATOR; | ||
| } | ||
| return PrimitiveComparator.LEXICOGRAPHICAL_BINARY_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to be explicit about the logical types where LEXICOGRAPHICAL_BINARY_COMPARATOR is returned, so that if new logical types get added in the future that require different handling they won't automatically fall into the default case.
| if (logicalType == OriginalType.DECIMAL) { | ||
| return PrimitiveComparator.SIGNED_BINARY_COMPARATOR; | ||
| } | ||
| return PrimitiveComparator.LEXICOGRAPHICAL_BINARY_COMPARATOR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to be explicit about the logical types where LEXICOGRAPHICAL_BINARY_COMPARATOR is returned, so that if new logical types get added in the future that require different handling they won't automatically fall into the default case.
|
|
||
| /** | ||
| * Returns the generic object representing the min value in the statistics. | ||
| * The self-comparison logic of {@code T} might not proper for the actual logical type (e.g. unsigned int). Use {@link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "self-comparison" may not be the best wording. I would rephrase this sentence as "The ordering provided by the compareTo() method of T may not be appropriate for the actual logical type."
|
I just thought of the following scenario: In a future version of Parquet, we introduce a new logical type with a corresponding type-defined ordering. It can be any type, but let's use COLOR as an example. The new library will be able to write the COLOR logical type for the FIXED_LEN_BYTE_ARRAY primary type and it will write correct statistics according to its type-defined order. Older libraries that don't support the COLOR logical type will still be able to read the FIXED_LEN_BYTE_ARRAY data as raw bytes. This is an already existing feature of Parquet. However, since the statistics were calculated using an ordering defined for the COLOR logical type, they will be invalid for FIXED_LEN_BYTE_ARRAY primary type and can lead to dropping false negative values. The problem is that currently there is no way to distinguish between the lack of a saved logical type and an existing saved logical type that is not supported by the library. We should either introduce an explicit enum value for the case when there is no logical type (some possible names: UNSET, RAW or PRIMARY) or an explicit enum value for the case when there is a logical type but it is not supported (some possible names: UNSET or UNSUPPORTED). Based on this, we should ignore the stats if the logical type is unsupported. @rdblue, what do you think? |
| // Have to handle null type to support deprecated methods | ||
| return type != null | ||
| && type.getPrimitiveTypeName() != PrimitiveTypeName.INT96 | ||
| && type.getOriginalType() != OriginalType.INTERVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should explicitly list all supported logical types here instead of the unsupported ones so that when implementing a new logical type in the future it can not accidentally gain statistics support without an intentional code change here.
| Binary.fromReusedByteBuffer(ByteBuffer.wrap("zzzzzz".getBytes())), // |7A|7A|7A|7A|7A|7A| | ||
| Binary.fromReusedByteBuffer(ByteBuffer.wrap("aazzzzzzaa".getBytes(), 2, 7)), // |7A|7A|7A|7A|7A|7A|61| | ||
| Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -128})), // |80|80|80| | ||
| Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -1}, 1, 2)) // |80|FF| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing these comment, although it would have been enough for the byte[] values. From this particular line, however, one 80 is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are creating a Binary object with offset 1 and length 2 so no 80 is missing.
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the hard work! I approved the pull request, but will wait for Ryan before merging.
|
I'll take a look as soon as I can. Thanks for working on this @gszadovszky! |
|
Just two minor comments. Also, please rebase so we can merge. Thanks @gszadovszky! |
9b3414f to
1ee3eaf
Compare
|
To merge this, the title should match the JIRA issue key. Can you change it to "PARQUET-1025: summary"? When that's done, I should be able to merge. |
|
@rdblue, Pushed an empty changeset with the proper title so default title will be this one after squashing. Anything else I should do to help merging it? |
|
@rdblue, This change builds on the following one-liners; could you please approve them before merging this one?
Thanks! |
Also update parquet-format version
…it tests accordingly
Also rename SIGNED_BINARY_COMPARATOR to a more descriptive name Also added comments for haxa representation of values at unsigned comparison testing
6514a94 to
2a63fcf
Compare
|
@rdblue if you don't have any objections, I will merge this change tomorrow. Thanks! |
|
Sorry for the delay. Last I checked we were still waiting on the title fix. I merged this. |
|
May want to migrate parquet-mr and parquet-format to GitBox which will give you write permissions on GitHub (fixing PR titles, etc.) |
No description provided.