PARQUET-1025: Support new min-max statistics in parquet-mr #435

gszadovszky · 2017-11-09T15:38:19Z

No description provided.

gszadovszky · 2017-11-09T15:45:59Z

Started working on it. It is far not complete.

@rdblue, could you please check it if the direction is fine?
The next steps would be to extend Comparators with the unsigned/lexicographical logics and to use the comparators everywhere in parquet-mr instead of using primitive comparisons or self-comparisons on the values. The latter one seems to be tough. Any idea/suggestion would be welcomed.

rdblue · 2017-11-09T16:43:51Z

parquet-column/src/main/java/org/apache/parquet/column/Comparators.java

+    return BOOLEAN_COMPARATOR;
+  }
+
+  public static LongComparator int64Comparator(OriginalType logicalType) {


I think you should delegate to Type to supply comparators. That way, you have a stronger guarantee that the logical type and physical type match because Types won't allow you to construct an invalid one.

rdblue · 2017-11-09T16:45:40Z

parquet-column/src/main/java/org/apache/parquet/column/Comparators.java

+    /**
+     * Returns the string representation of the specified value for debugging/logging purposes
+     */
+    public String toString(boolean b) {


I know that it is convenient to add toString here, but I don't think it makes sense with the Comparator API. I think it would be better to pass Type to Statistics, then use that type to get a comparator and to do the coercion to string.

rdblue · 2017-11-09T16:50:30Z

parquet-column/src/main/java/org/apache/parquet/column/Comparators.java

+        break;
+      case BINARY:
+        if (logicalType == null)
+          // TODO: return lexicographical comparator


Here's a Comparators class from another project that has the unsigned byte-wise comparator for ByteBuffer, one for CharSequence, and some Java 8 utils for handling nulls.

https://gist.github.com/rdblue/7e80e9700d508d395b004c4c146dbfe2

rdblue · 2017-11-09T16:52:11Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

+    this(PrimitiveType.PrimitiveTypeName.BINARY, null);
+  }
+
+  BinaryStatistics(PrimitiveType.PrimitiveTypeName type, OriginalType logicalType) {


I think I mentioned this earlier, but I think it would be a bit cleaner to pass the Type instance and use that to get a comparator and coerce to strings.

rdblue · 2017-11-09T17:05:25Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java

    }
  }

+  public static Statistics<?> getStatsBasedOnType(PrimitiveTypeName type, OriginalType logicalType) {


If we are still using Statistics classes based on the primitive type, then why was support for stats other than BinaryStatistics removed from parquet-cli?

Sorry, but I don't get it. Where was the support for stats other than BinaryStatistics removed?

rdblue · 2017-11-09T17:09:29Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java

  abstract public T genericGetMax();

+  /**
+   * Returns the comparator to be used to compare two generic values in the proper way (e.g. unsigned comparison for


This implies that unsigned comparison is correct for int. I think you should instead say "For example, unsigned comparison for UINT32"

rdblue · 2017-11-09T17:12:53Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java

  abstract public T genericGetMin();
+
+  /**
+   * Returns the generic object representing the max value in the statistics. The self comparing logic of the returned


A few clarifications:

Instead of "generic object" just say "max value". The type is clear from the use of generics. If anything, add a @param <T> to the class-level javadoc.

Instead of "self comparing logic", use "Java natural order" or reference compareTo to be explicit.

Instead of "strongly recommended", I think you should explain why: the sort order for logical types may not be the order produced by compareTo. Users should use the logical type's defined order and can use #comparator() to get a Comparator instance that implements that order.

rdblue · 2017-11-09T17:14:17Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Statistics.java

  }

+  /**
+   * The self-comparison logic of {@code T} might not proper for the actual logical type (e.g. unsigned int). Use {@link


This is a useful note, but not javadoc for this method. If you want to add Javadoc here, describe what the method does and then add a note below that. Same for getMax.

rdblue · 2017-11-09T17:16:13Z

parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java


 abstract public class Binary implements Comparable<Binary>, Serializable {

+  public interface ComparatorHelper {


These can be static methods that create correct ByteBuffers and then call the right Comparator<ByteBuffer>. No need to define an interface.

Currently, we have 4 different Binary implementations with 2 inner representations: byte[] and ByteBuffer. So, we need 3 different comparisons: byte[] to byte[], ByteBuffer to ByteBuffer and byte[] to ByteBuffer.
The concept here with this interface is to make it able to have Comparator<Binary> implemented outside of this class.
I would even implement it similarly if we want to have different Binary instances with the proper Comparable<Binary> implementation only in this case the interface would be private.
Meanwhile, I would vote on having separate Comparator<Binary> implementations in the same place where the other primitive comparators are.
What do you think?

My point is that this doesn't need to be a class because these are straight-forward transformations to compare(ByteBuffer, ByteBuffer). The byte[] arguments can be easily wrapped, and the ByteBuffer args should duplicate the buffer and then set the duplicate's position and limit appropriately.

If I understand it correctly, it means that a byte[] backed Binary shall either create new ByteBuffer instances to wrap the byte[] references for each compare(Binary, Binary) invocation or have a ByteBuffer reference cached inside the Binary implementation.
The first one would impact the performance of filtering and creating the statistics the latter one would increase the memory footprint of the binary representation.
What do you think?

Yes, what I'm suggesting would mean wrapping the byte array in a buffer to send it into compare, but I don't think avoiding object creation here is going to be worth it for performance. I think the main concern is to make sure we have as few code paths for this comparison as possible for correctness.

Currently, we have 5 different implementations for the abstract class Binary. I guess, it is for performance reasons.
If we really think wrapping a byte[] into a ByteBuffer for comparing Binary object worth for the clean code then I would suggest completely refactoring the class Binary to have only one implementation which contains a ByteBuffer.

There's quite a bit of refactoring in #390 for Binary to use ByteBuffer more. Lets see where we're at when those two are in.

rdblue · 2017-11-09T17:17:44Z

Thanks for working on this @gszadovszky! I have a few comments, but the overall direction to introduce comparators based on the full type (physical and logical) looks good to me.

rdblue · 2017-11-10T16:55:28Z

parquet-cli/src/main/java/org/apache/parquet/cli/Util.java

-          ((DoubleStatistics) stats).getMin(),
-          ((DoubleStatistics) stats).getMax());
-    } else if (stats instanceof BinaryStatistics) {
+    if (stats instanceof BinaryStatistics) {


This is where the other stats types have been removed. Why are only binary statistics supported here?

Now, I got it. There was another concept I've started with and I've forgot to revert this change. I'll do it in the next commit.
Thanks for the finding.

Having a single Statistics class would be nice, if it works out when the comparator interfaces are done.

I've designed the new PrimitiveComparator super class to have primitive comparisons to avoid the unnecessary boxing of primitive types. So, I would keep the current specialised Statistics classes for performance reasons.

rdblue · 2017-11-13T16:40:00Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

-    return !hasNonNullValue() || ((min.length() + max.length()) < size);
+  String toString(Binary value) {
+    // TODO: have separate toString for different logical types?
+    return value.toStringUsingUTF8();


Delegating to type would be better here, and would work when this Binary is used for Decimal.

rdblue · 2017-11-13T16:42:11Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/LongStatistics.java


+  @Override
+  public LongComparator comparator() {
+    return LongComparators.NATURAL_COMPARATOR;


No need to override here, the type can provide this.

Good catch, I've missed to remove it after rewriting to use my PrimitiveComparator class.

rdblue · 2017-11-13T16:45:32Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+      PrimitiveComparator<?> comparator(OriginalType logicalType) {
+        if (logicalType == OriginalType.UINT_64)
+          // TODO: return unsigned comparator
+          return PrimitiveComparator.SIGNED_INT64_COMPARATOR;


I know this is only temporary, but it's probably a better practice to throw UnsupportedOperationException rather than return the wrong comparator. That way if it is overlooked later, it fails instead of silently introducing correctness problems.

My concept here is to refactor the code to the new design with still passing unit tests. Then, at the final step add the new comparators and fix the existing tests and add more to pass.

Makes sense.

rdblue · 2017-11-13T16:46:58Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+      @Override
+      PrimitiveComparator<?> comparator(OriginalType logicalType) {
+        // TODO: what to return here?
+        return PrimitiveComparator.BINARY_COMPARATOR;


INT96 should continue to use signed byte-wise comparison.

rdblue · 2017-11-13T16:48:22Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+
+      @Override
+      PrimitiveComparator<?> comparator(OriginalType logicalType) {
+        if (logicalType == OriginalType.UINT_64)


Minor: always use { and } for branches and loops.

zivanfi

Some preliminary comments, I didn't get to end of it yet.

zivanfi · 2017-11-21T15:33:06Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java


+  public BinaryStatistics() {
+    // Creating a fake primitive type to have the proper comparator
+    this(Types.optional(PrimitiveType.PrimitiveTypeName.BINARY).named(""));


Instead of creating a new instance of this type every time, could you create a single one that you reuse every time? (Same for all other Statistics classes.)

+1

Also use a name that makes it clear that this came from stats, in case it shows up elsewhere.

zivanfi · 2017-11-21T16:05:00Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java

 * @author Katya Gonina
 */
-public abstract class Statistics<T extends Comparable<T>> {
+public abstract class Statistics<T extends Comparable<T>> implements Cloneable {


Have you considered alternatives to making this class Cloneable? Effective Java strongly discourages the use of Cloneable. Some of the arguments mentioned there are publicly available here.

Agreed. Better to create a copy() method.

zivanfi · 2017-11-21T16:11:53Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Statistics.java

+  private final Comparator<T> comparator;

-  public Statistics(T min, T max) {
+  public Statistics(T min, T max, Comparator<T> comparator) {


Removing this public constructor is technically a breaking change. If possible, please deprecate it instead. If it is not supposed to be used by clients, please use the InterfaceAudience.Private Yetus annotation.

Currently, we don't have dependency on Yetus but have InterfaceAudience from 3 different dependencies (hadoop, hive, pig). It is not clear to me which pattern should we follow (having our own annotation implementation, use Yetus, use Hadoop etc.) So, I would leave the annotation to a separate change. (It might worth a JIRA for that.)
I'll put back the original constructor as deprecated for now a comment that the constructors are not for public use.

-1 for introducing audience annotations. We don't use those in Parquet. We've been needing a push to clean up the public API for a while now, but introducing the annotations for single classes isn't going to help. If we are going to clean up the API, lets do that.

We would like to introduce it for the whole API, but it's a big task to do in a single chunk, so I think a progressive approach would be a good start. It is much easier to decide the intended audience of a method when it gets added and both developer and reviewers have a context than later on when it requires a massive investigation. For this reason I think it would be good to start adding these annotations for new (and possibly existing but touched) methods/classes. This would also prevent users from misusing these functions in the interim time while we add the annotations to the whole API (progressively, module-by-module, I guess).

rdblue · 2017-11-22T16:51:02Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java

+    }
+
+    private int unsignedCompare(byte b1, byte b2) {
+      return toUnsigned(b1) - toUnsigned(b2);


This shouldn't use - to implement comparison because Java integers wrap around instead of throwing an overflow exception. Instead, use something like this (from Comparators gist):

int cmp = Integer.compare( ((int) buf1.get(b1pos + i)) & 0xff, ((int) buf2.get(b2pos + i)) & 0xff);

Here we are dealing with byte values casted to ints. There is no chance for overflows.

rdblue · 2017-11-22T16:52:51Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java

+
+    abstract int compare(ByteBuffer b1, ByteBuffer b2);
+
+    final int toUnsigned(byte b) {


Why is this final? An attempt to inline?

Yep. This is an abstract class and the implementation of converting unsigned byte values to ints shall not be overridden.

rdblue · 2017-11-22T16:54:41Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java

+    }
+  };
+
+  static final PrimitiveComparator<Binary> SIGNED_BINARY_COMPARATOR = new BinaryComparator() {


I don't think we have a need for this comparator. The regular ByteBuffer comparator implements the signed byte-wise comparison that is required by the spec.

Maybe the name is a bit misleading. This comparator compares the values as signed integers (e.g. BigInteger). We need it if we would like to have similar behaviour for DECIMALs on INT32/INT64 and on FIX_LEN_BYTE_ARRAY/BINARY.
What a regular ByteBuffer comparator or the compareTo method of Binary do is quite useless. They compare byte values one-by-one as signed values which is incorrect either for string like values or numeric values.

To second Gabor's opinion, I agree that this is needed. The byte-wise signed comparison is not appropriate, we need a whole-value signed comparison.

Regarding the comment that the ByteBuffer comparator implements signed byte-wise comparison as required by the spec: The spec actually requires unsigned byte-wise comparison. (These specific lines were last modified by me, but they was present earlier as well.)

Makes sense. If this is a comparator for encoded decimals, integers, or longs, it should have a more specific name.

rdblue · 2017-11-22T16:57:02Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java

+public abstract class PrimitiveComparator<T> implements Comparator<T> {
+
+  public int compare(boolean b1, boolean b2) {
+    throw new UnsupportedOperationException();


These should include a helpful message. It is never a good idea to throw an exception with no message. Some code will strip out the stack trace and leave just the message (which is also a bad practice) and this would cause that to produce just null. It's also helpful to explain a little more why this exception was thrown, "compare(boolean,boolean) was called on a non-boolean comparator"

rdblue · 2017-11-22T18:29:17Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java

      long valueCount,
      long totalSize,
      long totalUncompressedSize) {
+    return get(path, new PrimitiveType(Type.Repetition.OPTIONAL, type, ""), codec, encodingStats, encodings, statistics,


Use the Types API instead of constructors. The constructors will probably not be public in the 2.0 API.

rdblue · 2017-11-22T18:29:58Z

pom.xml

    <cascading.version>2.7.1</cascading.version>
    <cascading3.version>3.1.2</cascading3.version>
-    <parquet.format.version>2.3.1</parquet.format.version>
+    <parquet.format.version>2.4.0</parquet.format.version>


If you rebase, you'll pick up this change.

zivanfi · 2017-11-23T18:25:13Z

...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java

+  @Test
+  public void testV2OnlyStats() {
+    testV2OnlyStats(createStats(Types.optional(PrimitiveTypeName.INT32).as(OriginalType.UINT_8).named(""),
+        Byte.MAX_VALUE,


In think for UINT_8 you should use 0x00 as min and 0xFF as max (or some arbitrary values, but certainly not these constants). Although for the decimal you use arbitrary values (±123456), and Byte.MIN_VALUE and Byte.MAX_VALUE are arbitrary values, it is still misleading to use them, because:

They do not seem to be arbitrary values but special values instead, while in reality those values are not special at all in unsigned interpretation (specifically those are not the most extreme values of the allowed range for unsigned bytes).

Byte.MIN_VALUE = -128 = 0x80 and Byte.MAX_VALUE = 127 = 0x7F and in unsigned comparison 0x80 > 0x7F, so the min value is larger the max value.

Similarly for the bigger unsigned types below.

zivanfi · 2017-11-23T18:46:05Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+      // NOTE: See docs in CorruptStatistics for explanation of why this check is needed
+      // The sort order is checked to avoid returning min/max stats that are not
+      // valid with the type's sort order. Currently, all stats are aggregated
+      // using a signed ordering, which isn't valid for strings or unsigned ints.


Neither are they valid for decimals or int96 timestamps, you could mention those as well or be more generic in this statements ("which isn't valid for some types, for example strings"). EDIT: Just realized that this comment is not new, it just got moved. Still, I think it would be worth improving.

zivanfi · 2017-11-23T18:53:23Z

parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java

+    }
+    if (Float.TYPE != exclude) {
+      try {
+        comparator.compare(0F, 0F);


(nit) 0.0f would be easier on the eye, seeing 0F I immediately thought of 0x0F. Same with 0D below.

zivanfi · 2017-11-23T18:55:51Z

parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java

+      Binary.fromReusedByteArray("azzza".getBytes(), 1, 3),
+      Binary.fromReusedByteBuffer(ByteBuffer.wrap("zzzzzz".getBytes())),
+      Binary.fromReusedByteBuffer(ByteBuffer.wrap("aazzzzzzaa".getBytes(), 2, 7)),
+      Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -128})),


(nit) Although it's usually the other way around, in this case using hexadecimal literals would make these values more human-readable.

Hexadecimal literals in Java are integers which means while -128 does not required casting to byte 0x80 would. Do you think it would still be readable if I write new byte[]{(byte) 0x80, (byte) 0x80, (byte) 0xFF}?

You're right. Optional: Please consider adding the hex values in comments, like // 0x8080FF

I prefer the option with casting for small values like these.

zivanfi · 2017-11-23T19:09:40Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+
+      @Override
+      PrimitiveComparator<?> comparator(OriginalType logicalType) {
+        return PrimitiveComparator.SIGNED_BINARY_COMPARATOR;


This should be undefined according to PARQUET-1065.

Implementing undefined comparator logic is not a simple effort and a bit error prone. I can see two options:

Returning null wherever we provide comparators and handle null comparator in statistics logic to not to update min-max neither load or save it. In this case NullPointerException might occur in client code which is misleading.

Returning a special implementation of the PrimitiveComparator which throws proper exceptions in case of compare is invoked. (Also update the statistics logic.) In this case it might not be obvious for the client code to determine whether the comparator is undefined or not.
I would prefer the latter one. What do you think?

In my opinion the best approach would be to throw a checked exception. This way API consumers will be forced to handle the lack of comparators. This is important, because it can happen for unsupported logical types as detailed in my previous comment.

I think this should return the comparator that is documented in logical types for consistency. We should just discard statistics for these types when reading and should not produce statistics when writing.

zivanfi · 2017-11-23T19:14:18Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+        if (logicalType == OriginalType.DECIMAL) {
+          return PrimitiveComparator.SIGNED_BINARY_COMPARATOR;
+        }
+        return PrimitiveComparator.LEXICOGRAPHICAL_BINARY_COMPARATOR;


I would prefer to be explicit about the logical types where LEXICOGRAPHICAL_BINARY_COMPARATOR is returned, so that if new logical types get added in the future that require different handling they won't automatically fall into the default case.

zivanfi · 2017-11-23T19:15:02Z

parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java

+        if (logicalType == OriginalType.DECIMAL) {
+          return PrimitiveComparator.SIGNED_BINARY_COMPARATOR;
+        }
+        return PrimitiveComparator.LEXICOGRAPHICAL_BINARY_COMPARATOR;


I would prefer to be explicit about the logical types where LEXICOGRAPHICAL_BINARY_COMPARATOR is returned, so that if new logical types get added in the future that require different handling they won't automatically fall into the default case.

zivanfi · 2017-11-23T19:19:17Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Statistics.java

+
+  /**
+   * Returns the generic object representing the min value in the statistics.
+   * The self-comparison logic of {@code T} might not proper for the actual logical type (e.g. unsigned int). Use {@link


(nit) "self-comparison" may not be the best wording. I would rephrase this sentence as "The ordering provided by the compareTo() method of T may not be appropriate for the actual logical type."

zivanfi · 2017-11-27T13:03:38Z

I just thought of the following scenario: In a future version of Parquet, we introduce a new logical type with a corresponding type-defined ordering. It can be any type, but let's use COLOR as an example. The new library will be able to write the COLOR logical type for the FIXED_LEN_BYTE_ARRAY primary type and it will write correct statistics according to its type-defined order. Older libraries that don't support the COLOR logical type will still be able to read the FIXED_LEN_BYTE_ARRAY data as raw bytes. This is an already existing feature of Parquet. However, since the statistics were calculated using an ordering defined for the COLOR logical type, they will be invalid for FIXED_LEN_BYTE_ARRAY primary type and can lead to dropping false negative values.

The problem is that currently there is no way to distinguish between the lack of a saved logical type and an existing saved logical type that is not supported by the library. We should either introduce an explicit enum value for the case when there is no logical type (some possible names: UNSET, RAW or PRIMARY) or an explicit enum value for the case when there is a logical type but it is not supported (some possible names: UNSET or UNSUPPORTED). Based on this, we should ignore the stats if the logical type is unsupported.

@rdblue, what do you think?

zivanfi · 2017-11-29T10:36:38Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+    // Have to handle null type to support deprecated methods
+    return type != null
+        && type.getPrimitiveTypeName() != PrimitiveTypeName.INT96
+        && type.getOriginalType() != OriginalType.INTERVAL;


I wonder whether we should explicitly list all supported logical types here instead of the unsupported ones so that when implementing a new logical type in the future it can not accidentally gain statistics support without an intentional code change here.

zivanfi · 2017-11-29T11:16:00Z

parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java

+      Binary.fromReusedByteBuffer(ByteBuffer.wrap("zzzzzz".getBytes())),                // |7A|7A|7A|7A|7A|7A|
+      Binary.fromReusedByteBuffer(ByteBuffer.wrap("aazzzzzzaa".getBytes(), 2, 7)),      // |7A|7A|7A|7A|7A|7A|61|
+      Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -128})),     // |80|80|80|
+      Binary.fromConstantByteBuffer(ByteBuffer.wrap(new byte[]{-128, -128, -1}, 1, 2))  // |80|FF|


Thanks for writing these comment, although it would have been enough for the byte[] values. From this particular line, however, one 80 is missing.

Here we are creating a Binary object with offset 1 and length 2 so no 80 is missing.

zivanfi

LGTM, thanks for the hard work! I approved the pull request, but will wait for Ryan before merging.

rdblue · 2017-11-29T22:30:15Z

I'll take a look as soon as I can. Thanks for working on this @gszadovszky!

rdblue · 2017-12-20T20:08:12Z

Just two minor comments. Also, please rebase so we can merge. Thanks @gszadovszky!

rdblue · 2017-12-22T20:15:59Z

To merge this, the title should match the JIRA issue key. Can you change it to "PARQUET-1025: summary"? When that's done, I should be able to merge.

gszadovszky · 2018-01-04T13:44:24Z

@rdblue, Pushed an empty changeset with the proper title so default title will be this one after squashing. Anything else I should do to help merging it?

zivanfi · 2018-01-04T15:14:52Z

@rdblue, This change builds on the following one-liners; could you please approve them before merging this one?

PARQUET-1064: Deprecate type-defined sort ordering for INTERVAL type. parquet-format#76
PARQUET-1065: Deprecate type-defined sort ordering for INT96 type parquet-format#77
PARQUET-1065: Deprecate type-defined sort ordering for INT96 type. #436

Thanks!

Also update parquet-format version

…tors as needed

…elated changes

…it tests accordingly

Also rename SIGNED_BINARY_COMPARATOR to a more descriptive name Also added comments for haxa representation of values at unsigned comparison testing

…nstances

zivanfi · 2018-01-11T14:11:22Z

@rdblue if you don't have any objections, I will merge this change tomorrow. Thanks!

rdblue · 2018-01-13T00:31:27Z

Sorry for the delay. Last I checked we were still waiting on the title fix. I merged this.

wesm · 2018-01-16T03:32:48Z

May want to migrate parquet-mr and parquet-format to GitBox which will give you write permissions on GitHub (fixing PR titles, etc.)

rdblue reviewed Nov 9, 2017

View reviewed changes

rdblue reviewed Nov 10, 2017

View reviewed changes

rdblue reviewed Nov 13, 2017

View reviewed changes

zivanfi reviewed Nov 21, 2017

View reviewed changes

rdblue reviewed Nov 22, 2017

View reviewed changes

gszadovszky force-pushed the PARQUET-1025 branch from 5f99e56 to 8bdf16f Compare November 23, 2017 10:27

zivanfi reviewed Nov 23, 2017

View reviewed changes

zivanfi reviewed Nov 29, 2017

View reviewed changes

zivanfi approved these changes Nov 29, 2017

View reviewed changes

gszadovszky force-pushed the PARQUET-1025 branch from 9b3414f to 1ee3eaf Compare December 21, 2017 10:13

gszadovszky changed the title ~~Parquet 1025: Support new min-max statistics in parquet-mr~~ PARQUET-1025: Support new min-max statistics in parquet-mr Jan 9, 2018

Gabor Szadovszky added 17 commits January 10, 2018 15:17

PARQUET-1025: Refactor Binary to prepare from custom comparators

e1719bb

Also update parquet-format version

PARQUET-1025: Implement comparators and use them with statistics

3378b6d

PARQUET-1025: Move comparators to Type

52cd58f

PARQUET-1025: reading/writing new min-max statistics; use the compara…

20b937f

…tors as needed

PARQUET-1025: Add the proper comparators as required; revert Binary r…

51bc1f8

…elated changes

PARQUET-1025: Updates according to zi's and rdblue's comments

688ef2e

PARQUET-1025: Finalize reading/writing new stats; modify/implement un…

318e585

…it tests accordingly

PARQUET-1025: Some modifications according to zi's comments

c5536a0

PARQUET-1025: Finalize read/write stats updates

2f28c2c

PARQUET-1025: Use lexicographical comparison for Binary.compareTo

95199e5

Also rename SIGNED_BINARY_COMPARATOR to a more descriptive name Also added comments for haxa representation of values at unsigned comparison testing

PARQUET-1025: Add explicit list of types to not to read/write statistics

70e56a7

PARQUET-1025: Updates according to rdblue's comments

bc86e8a

PARQUET-1025: Unified formatting/comments/deprecation

a2ae97c

PARQUET-1025: Some updates for zi's findings

524750b

PARQUET-1025: Implement ColumnOrder; other updates for rdblue's findings

dc838f2

PARQUET-1025: Minor fixes at data generation for TestStatistics

820df6f

PARQUET-1025: Use constant instead of creating new TypeDefinedOrder i…

2a63fcf

…nstances

gszadovszky force-pushed the PARQUET-1025 branch from 6514a94 to 2a63fcf Compare January 10, 2018 14:32

asfgit closed this in c6764c4 Jan 13, 2018

asfimport mentioned this pull request Jun 23, 2024

Support new min-max statistics in parquet-mr #2065

Closed


		abstract public class Binary implements Comparable<Binary>, Serializable {

		public interface ComparatorHelper {


		abstract int compare(ByteBuffer b1, ByteBuffer b2);

		final int toUnsigned(byte b) {

PARQUET-1025: Support new min-max statistics in parquet-mr #435

PARQUET-1025: Support new min-max statistics in parquet-mr #435

Uh oh!

Conversation

gszadovszky commented Nov 9, 2017

Uh oh!

gszadovszky commented Nov 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rdblue Nov 9, 2017 •

edited

Loading