apache · zivanfi · Jun 16, 2017 · Jun 20, 2017 · Jun 20, 2017 · Jul 6, 2017
diff --git a/LogicalTypes.md b/LogicalTypes.md
@@ -37,7 +37,7 @@ may require additional metadata fields, as well as rules for those fields.
 `UTF8` may only be used to annotate the binary primitive type and indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.
 
-The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.
+The sort order used for `UTF8` strings is unsigned byte-wise comparison.
 
 ## Numeric Types
 
@@ -57,7 +57,7 @@ allows.
 implied by the `int32` and `int64` primitive types if no other annotation is
 present and should be considered optional.
 
-The sort order used for signed integer types is `SIGNED`.
+The sort order used for signed integer types is signed.
 
 ### Unsigned Integers
 
@@ -74,7 +74,7 @@ allows.
 `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
 `UINT_64` must annotate an `int64` primitive type.
 
-The sort order used for unsigned integer types is `UNSIGNED`.
+The sort order used for unsigned integer types is unsigned.
 
 ### DECIMAL
 
@@ -104,8 +104,8 @@ integer. A precision too large for the underlying type (see below) is an error.
 A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
 `scale` and `precision` fields set, even if scale is 0 by default.
 
-The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
-to signed comparison of decimal values.
+The sort order used for `DECIMAL` values is signed comparison of the represented
+value.
 
 If the column uses `int32` or `int64` physical types, then signed comparison of
 the integer values produces the correct ordering. If the physical type is
@@ -121,39 +121,39 @@ comparison.
 annotate an `int32` that stores the number of days from the Unix epoch, 1
 January 1970.
 
-The sort order used for `DATE` is `SIGNED`.
+The sort order used for `DATE` is signed.
 
 ### TIME\_MILLIS
 
 `TIME_MILLIS` is used for a logical time type with millisecond precision,
 without a date. It must annotate an `int32` that stores the number of
 milliseconds after midnight.
 
-The sort order used for `TIME\_MILLIS` is `SIGNED`.
+The sort order used for `TIME\_MILLIS` is signed.
 
 ### TIME\_MICROS
 
 `TIME_MICROS` is used for a logical time type with microsecond precision,
 without a date. It must annotate an `int64` that stores the number of
 microseconds after midnight.
 
-The sort order used for `TIME\_MICROS` is `SIGNED`.
+The sort order used for `TIME\_MICROS` is signed.
 
 ### TIMESTAMP\_MILLIS
 
 `TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
 millisecond precision. It must annotate an `int64` that stores the number of
 milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.
 
-The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
+The sort order used for `TIMESTAMP\_MILLIS` is signed.
 
 ### TIMESTAMP\_MICROS
 
 `TIMESTAMP_MICROS` is used for a combined logical date and time type with
 microsecond precision. It must annotate an `int64` that stores the number of
 microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.
 
-The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.
+The sort order used for `TIMESTAMP\_MICROS` is signed.
 
 ### INTERVAL
 
@@ -169,7 +169,7 @@ example, there is no requirement that a large number of days should be
 expressed as a mix of months and days because there is not a constant
 conversion from days to months.
 
-The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
+The sort order used for `INTERVAL` is unsigned, produced by sorting by
 the value of months, then days, then milliseconds with unsigned comparison.
 
 ## Embedded Types
@@ -184,6 +184,8 @@ string of valid JSON as defined by the [JSON specification][json-spec]
 
 [json-spec]: http://json.org/
 
+The sort order used for `JSON` is unsigned byte-wise comparison.
+
 ### BSON
 
 `BSON` is used for an embedded BSON document. It must annotate a `binary`
@@ -192,6 +194,8 @@ defined by the [BSON specification][bson-spec].
 
 [bson-spec]: http://bsonspec.org/spec.html
 
+The sort order used for `BSON` is unsigned byte-wise comparison.
+
 ## Nested Types
 
 This section specifies how `LIST` and `MAP` can be used to encode nested types

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
@@ -28,17 +28,6 @@ namespace java org.apache.parquet.format
  * with the encodings to control the on disk storage format.
  * For example INT16 is not included as a type since a good encoding of INT32
  * would handle this.
- *
- * When a logical type is not present, the type-defined sort order of these
- * physical types are:
- * * BOOLEAN - false, true
- * * INT32 - signed comparison
- * * INT64 - signed comparison
- * * INT96 - signed comparison
- * * FLOAT - signed comparison
- * * DOUBLE - signed comparison
- * * BYTE_ARRAY - unsigned byte-wise comparison
- * * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
  */
 enum Type {
   BOOLEAN = 0;
@@ -219,12 +208,12 @@ struct Statistics {
     * Values are encoded using PLAIN encoding, except that variable-length byte
     * arrays do not include a length prefix.
     *
-    * These fields encode min and max values determined by SIGNED comparison
+    * These fields encode min and max values determined by signed comparison
     * only. New files should use the correct order for a column's logical type
     * and store the values in the min_value and max_value fields.
     *
     * To support older readers, these may be set when the column order is
-    * SIGNED.
+    * signed.
     */
    1: optional binary max;
    2: optional binary min;
@@ -582,7 +571,9 @@ struct RowGroup {
 struct TypeDefinedOrder {}
 
 /**
- * Union to specify the order used for min, max, and sorting values in a column.
+ * Union to specify the order used for the min_value and max_value fields for a
+ * column. This union takes the role of an enhanced enum that allows rich
+ * elements (which will be needed for a collation-based ordering in the future).
  *
  * Possible values are:
  * * TypeDefinedOrder - the column uses the order defined by its logical or
@@ -592,6 +583,41 @@ struct TypeDefinedOrder {}
  * for this column should be ignored.
  */
 union ColumnOrder {
+
+  /**
+   * The sort orders for logical types are:
+   *   UTF8 - unsigned byte-wise comparison
+   *   INT8 - signed comparison
+   *   INT16 - signed comparison
+   *   INT32 - signed comparison
+   *   INT64 - signed comparison
+   *   UINT8 - unsigned comparison
+   *   UINT16 - unsigned comparison
+   *   UINT32 - unsigned comparison
+   *   UINT64 - unsigned comparison
+   *   DECIMAL - signed comparison of the represented value
+   *   DATE - signed comparison
+   *   TIME_MILLIS - signed comparison
+   *   TIME_MICROS - signed comparison
+   *   TIMESTAMP_MILLIS - signed comparison
+   *   TIMESTAMP_MICROS - signed comparison
+   *   INTERVAL - unsigned comparison
+   *   JSON - unsigned byte-wise comparison
+   *   BSON - unsigned byte-wise comparison
+   *   ENUM - unsigned byte-wise comparison
+   *   LIST - undefined
+   *   MAP - undefined
+   *
+   * In the absence of logical types, the sort order is determined by the physical type:
+   *   BOOLEAN - false, true
+   *   INT32 - signed comparison
+   *   INT64 - signed comparison
+   *   INT96 (only used for legacy timestamps) - unsigned comparison
+   *   FLOAT - signed comparison of the represented value
+   *   DOUBLE - signed comparison of the represented value
+   *   BYTE_ARRAY - unsigned byte-wise comparison
+   *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
+   */
   1: TypeDefinedOrder TYPE_ORDER;
 }
 
@@ -626,11 +652,16 @@ struct FileMetaData {
   6: optional string created_by
 
   /**
-   * Sort order used for each column in this file.
+   * Sort order used for the min_value and max_value fields of each column in
+   * this file. Each sort order corresponds to one column, determined by its
+   * position in the list, matching the position of the column in the schema.
+   *
+   * Without column_orders, the meaning of the min_value and max_value fields is
+   * undefined. To ensure well-defined behaviour, if min_value and max_value are
+   * written to a Parquet file, column_orders must be written as well.
    *
-   * If this list is not present, then the order for each column is assumed to
-   * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
-   * as fixed or bytes should be ignored.
+   * The obsolete min and max fields are always sorted by signed comparison
+   * regardless of column_orders.
    */
   7: optional list<ColumnOrder> column_orders;
 }