-
Notifications
You must be signed in to change notification settings - Fork 33
Add RFC for a new NaN definition #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tdcmeehan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this proposal to be a reasonable fix to a complicated problem. I think there could be some light copy editing for posterity, and I'm tagging @steveburnett to see if he has some suggestions.
RFC-0006-nan-definition.md
Outdated
|
|
||
| For sorting, again the basic comparisons follow IEEE-754. where <,>, <=, >= return false. Beyond that there is a considerable degree of variation across functions and operators. min, max, min_by, max_by, min(x, n) and max(x,n) have inconsistent behavior depending on when NaN is encountered. Greatest and Least throw an error on NaN input. Map_top_n returns wrong results if NaN shows up in the input. Order by sorts NaN as larger than infinity, and most array and map sorting and topn functions consider NaN as largest as well. However, array_min and array_max return NaN if any of the input is NaN. | ||
|
|
||
| When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators). Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN. Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's more desirable for NaN to be excluded from min/max stats for both ORC and Parquet? One NaN seems to render the stat useless, at least in Parquet: apache/parquet-format#88
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to exclude NaN or to exclude min/max stats if there are NaNs? i think for the file format we should follow whatever the file format expects if its defined and then make sure to handle it appropriately in Presto.
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! The structure, content, flow of ideas, and internal logic seems solid and this draft reads well as a result.
The bulk of my comments are nits about punctuation and capitalization, and some small suggestions for rephrasing to improve readability.
One item I didn't address in my comments was I noticed what appears to be some inconsistency in initial capitalization of the word "infinity". I'd suggest reviewing all uses of the word and changing to "Infinity" as appropriate for the meaning of the word in the sentence.
rschlussel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @steveburnett and @tdcmeehan!
RFC-0006-nan-definition.md
Outdated
|
|
||
| For sorting, again the basic comparisons follow IEEE-754. where <,>, <=, >= return false. Beyond that there is a considerable degree of variation across functions and operators. min, max, min_by, max_by, min(x, n) and max(x,n) have inconsistent behavior depending on when NaN is encountered. Greatest and Least throw an error on NaN input. Map_top_n returns wrong results if NaN shows up in the input. Order by sorts NaN as larger than infinity, and most array and map sorting and topn functions consider NaN as largest as well. However, array_min and array_max return NaN if any of the input is NaN. | ||
|
|
||
| When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators). Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN. Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to exclude NaN or to exclude min/max stats if there are NaNs? i think for the file format we should follow whatever the file format expects if its defined and then make sure to handle it appropriately in Presto.
RFC-0006-nan-definition.md
Outdated
| #### Postgresql | ||
|
|
||
| The NaN (not a number) value is used to represent undefined calculational results. In general, any operation with a NaN input yields another NaN. The only exception is when the operation's other inputs are such that the same output would be obtained if the NaN were to be replaced by any finite or infinite numeric value; then, that output value is used for NaN too. (An example of this principle is that NaN raised to the zero power yields one.) | ||
| Note: In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish I could take credit, but it's a direct quote from the PostgreSQL docs
RFC-0006-nan-definition.md
Outdated
|
|
||
| Here we give a short summary of Presto's current behavior. There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498 | ||
|
|
||
| Currently basic comparison operators follow IEEE-754. = returns false. <> returns true. Joins on NaN keys do not match. For maps and arrays, NaNs are considered distinct from each other, and will appear multiple times in a set_agg, union, or even as map_keys. However, for group bys, distinct, and distinct aggregations, NaNs are deduplicated (there will only be one group for all NaNs, they are not considered distinct from each other). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a variation on this suggestion.
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on the revision, thanks for the quick turnaround! I did another complete review of the revised draft and I found only a few final nits for you to consider.
RFC-0006-nan-definition.md
Outdated
| >…. | ||
| >Operations on numbers are performed according to the normal rules of arithmetic, within implementation- defined limits, except as provided for in Subclause 6.29, “<numeric value expression>”. | ||
|
|
||
| Some read the above as disallowing NaNs because floating point types are defined only as mantissa and exponent, and there are no special values defined like NaN, infinity, and -infinity. I don't think it's clear one way or the other, and it looks like most dbs do support NaN values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest expanding "dbs" here, and throughout as well ("DB" elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks!
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I missed this earlier, one final formatting nit. I just did another complete review of the document and I don't see anything else to mention.
czentgr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great write-up! Thanks for that!
RFC-0006-nan-definition.md
Outdated
| Most array and map sorting and topn functions consider NaN as largest | ||
| `array_min()` and `array_max()` return NaN if any of the input is NaN | ||
|
|
||
| When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need more investigation/check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do a bit more digging to see what the current behavior is/what changes are needed, but i don't consider it a blocker for the RFC, which is about what we're moving too.
RFC-0006-nan-definition.md
Outdated
| When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). | ||
| tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators). | ||
| Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN. | ||
| Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Velox I checked and it doesn't include NaN in the min/max
parquet cat 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet
{"col1": 1.0}
{"col1": 2.0}
{"col1": "NaN"}
{"col1": 3.0}
{"col1": "NaN"}
{"col1": 4.0}
{"col1": "NaN"}
parquet meta 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet
File path: 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet
Created by: parquet-cpp-velox
Properties: (none)
Schema:
message schema {
optional double col1;
}
Row group 0: count: 7 20.71 B records start: 4 total(compressed): 145 B total(uncompressed):126 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
col1 DOUBLE G _ R 7 20.71 B 0 "1.0" / "4.0"
The result for Java is interesting. The min/max is not set at all. Because NaN are present?
parquet meta 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639
File path: 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639
Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties: (none)
Schema:
message hive_schema {
optional double col1;
}
Row group 0: count: 7 16.00 B records start: 4 total(compressed): 112 B total(uncompressed):93 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
col1 DOUBLE G _ R 7 16.00 B 0
Data
parquet cat 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639
{"col1": 10.0}
{"col1": 20.0}
{"col1": "NaN"}
{"col1": 30.0}
{"col1": "NaN"}
{"col1": 40.0}
{"col1": "NaN"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this! I'll update with this information.
RFC-0006-nan-definition.md
Outdated
|
|
||
| https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-33A52FDB-BA5C-474E-96D3-40390BA5F5F4 | ||
|
|
||
| #### DB2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section describes Db2 z/OS. But there are other variants of Db2 e.g. Db2 LUW and Db2 for i and they do have differences though for this case, it appears that they behave the same.
Perhaps we can point put that both Db2 LUW and Db2 z/OS behave in the same way with regards to DOUBLE/FLOAT and DECFLOAT. I did not find the equivalent docs for Db2 for i (and it too niche to be relevant here).
Relevant Db2 LUW documentation matching the Db2 z/OS documentation
https://www.ibm.com/docs/en/db2/11.5?topic=list-numbers#r0008469__title__7
https://www.ibm.com/docs/en/db2/11.5?topic=list-numbers#r0008469__title__8
https://www.ibm.com/docs/en/db2/11.5?topic=types-assignments-comparisons#r0008479__numcomp__title__1
As for the title I suggest "Db2 for z/OS and Db2 LUW"
RFC-0006-nan-definition.md
Outdated
|
|
||
| ## Goals | ||
|
|
||
| * Define a consistent behavior for ordering and comparisons of Nans in Presto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: NaNs.
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Thanks again for the quick update! Reviewed the updated file and it looks great.
|
@rschlussel : Nice writeup. Was curious if you have researched Trino. Do they have the same issues as Presto ? I was intrigued since I saw in the description of Project Hummingbird that they intend to add a runtime optimization to eliminate nan handling checks for data without nan's |
elharo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assorted nits on phrasing, but +1 on the meat of the proposal. This sounds like a consistent and reliable solution.
RFC-0006-nan-definition.md
Outdated
|
|
||
| ### Current Presto behavior | ||
|
|
||
| Here we give a short summary of Presto's current behavior. There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make this public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to edit the formatting a bit, but made a wiki for it https://github.com/prestodb/presto/wiki/Presto-NaN-behavior. Thanks for the suggestion.
RFC-0006-nan-definition.md
Outdated
|
|
||
| #### Comparison | ||
|
|
||
| = returns false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not totally clear that you're talking about NaN = NaN and NaN <> NaN. There are also the cases where NaN = 7.2 or NaN <> 56.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the above. Any equality comparison will return false including NaN=NaN and any inequality comparison with another number will return true including NaN <> NaN. I'll specify NaN = NaN and NaN <> NaN here because i think the other cases would be understood from that.
RFC-0006-nan-definition.md
Outdated
|
|
||
| #### Sorting | ||
|
|
||
| <,>, <=, >= all return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: periods at end of sentences
RFC-0006-nan-definition.md
Outdated
| `map_top_n()` returns wrong results if NaN shows up in the input | ||
| Order by sorts NaN as larger than infinity | ||
| Most array and map sorting and topn functions consider NaN as largest | ||
| `array_min()` and `array_max()` return NaN if any of the input is NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the input --> input value
|
|
||
| When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (I'm not confident about this, and need to dig a bit more to be certain). | ||
| tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators). | ||
| Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after Velox
|
|
||
| #### Spark SQL | ||
|
|
||
| There is special handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we tldr this whole proposal as "do what Spark 3.x does"?
| * Define a consistent behavior for ordering and comparisons of NaNs in Presto | ||
| * Make behavior easy to implement and debug | ||
| * Make behavior easy for users to understand | ||
| * Align behavior with Velox for consistency across Java and C++ workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Align with Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a nice to have when choosing between options, but not a primary goal.
RFC-0006-nan-definition.md
Outdated
|
|
||
| We propose defining ordering and comparisons for NaNs such that `NaN = NaN` and sorts biggest. | ||
|
|
||
| With this proposal we stray from the IEEE definition and follow the example of PostgreSQL and others. In this case for all equality purposes such as =, distinctness, grouping, and join keys, we would define NaN=NaN. We would sort NaN as greater than infinity for all purposes, including for min/max and > and < comparisons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgresSQL, Spark, and others.
RFC-0006-nan-definition.md
Outdated
|
|
||
| ## Test Plan | ||
|
|
||
| We will add a test class TestNanQueries with tests to ensure correct behavior for all functions and operations that handle NaNs. The tests will use queries that execute on workers, rather than constants that get interpreted on the coordinator, so that the test can also be used to ensure consistency between native and java workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java
RFC-0006-nan-definition.md
Outdated
|
|
||
| ## Test Plan | ||
|
|
||
| We will add a test class TestNanQueries with tests to ensure correct behavior for all functions and operations that handle NaNs. The tests will use queries that execute on workers, rather than constants that get interpreted on the coordinator, so that the test can also be used to ensure consistency between native and java workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can likely also put NaN tests into existing unit test classes that test GROUP BY, <>, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that works too. I think it could be handy to have a test suite that has NaN tests for all functions in one place to make sure we have thorough coverage, but it's also nice for the group by tests to be thorough about all the cases it needs to cover including NaNs.
| * Make behavior easy to implement and debug | ||
| * Make behavior easy for users to understand | ||
| * Align behavior with Velox for consistency across Java and C++ workers | ||
| * [Nice to have] Align behavior with Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 3.0 in ANSI mode, or previous versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they have the same semantics with regard to NaN.
I haven't looked deeply, but from a quick look at their docs and issues and knowing what they started with , I believe they also have inconsistencies, though more intentional. I believe the following is their behavior:
They also have a number of bugfixes related to nan handling and we should look into whether they are relevant for us:
|


No description provided.