-
Notifications
You must be signed in to change notification settings - Fork 462
Description
Currently parquet-format specifies the sort order for floating point numbers as follows:
* FLOAT - signed comparison of the represented value
* DOUBLE - signed comparison of the represented valueThe problem is that the comparison of floating point numbers is only a partial ordering with strange behaviour in specific corner cases. For example, according to IEEE 754, -0 is neither less nor more than <u>0 and comparing NaN to anything always returns false. This ordering is not suitable for statistics. Additionally, the Java implementation already uses a different (total) ordering that handles these cases correctly but differently than the C</u>+ implementations, which leads to interoperability problems.
TypeDefinedOrder for doubles and floats should be deprecated and a new TotalFloatingPointOrder should be introduced. The default for writing doubles and floats would be the new TotalFloatingPointOrder. This ordering should be effective and easy to implement in all programming languages.
Reporter: Zoltan Ivanfi / @zivanfi
Assignee: Micah Kornfield / @emkornfield
Related issues:
- Impala shouldn't write column indexes for float columns until PARQUET-1222 is resolved (Blocked)
- Implement specification-compliant floating point comparison (blocks)
- [parquet-mr] Implement specification-compliant floating point comparison (blocks)
- [C++] Implement specification-compliant floating point comparison (blocks)
- [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down (is related to)
- Ignore float/double statistics in case of NaN (is related to)
- Clarify ambiguous min/max stats for FLOAT/DOUBLE (is related to)
Note: This issue was originally created as PARQUET-1222. Please see the migration documentation for further details.