Add RFC for a new NaN definition #9

rschlussel · 2024-03-13T17:27:41Z

No description provided.

tdcmeehan

I find this proposal to be a reasonable fix to a complicated problem. I think there could be some light copy editing for posterity, and I'm tagging @steveburnett to see if he has some suggestions.

tdcmeehan · 2024-03-14T19:34:54Z

RFC-0006-nan-definition.md

+
+For sorting, again the basic comparisons follow IEEE-754. where <,>, <=, >= return false.  Beyond that there is a considerable degree of variation across functions and operators. min, max, min_by, max_by, min(x, n) and max(x,n) have inconsistent behavior depending on when NaN is encountered.  Greatest and Least throw an error on NaN input. Map_top_n returns wrong results if NaN shows up in the input. Order by sorts NaN as larger than infinity, and most array and map sorting and topn functions consider NaN as largest as well. However, array_min and array_max return NaN if any of the input is NaN.
+
+When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators).  Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN.  Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines.  


I wonder if it's more desirable for NaN to be excluded from min/max stats for both ORC and Parquet? One NaN seems to render the stat useless, at least in Parquet: apache/parquet-format#88

to exclude NaN or to exclude min/max stats if there are NaNs? i think for the file format we should follow whatever the file format expects if its defined and then make sure to handle it appropriately in Presto.

RFC-0006-nan-definition.md

steveburnett

This looks great! The structure, content, flow of ideas, and internal logic seems solid and this draft reads well as a result.

The bulk of my comments are nits about punctuation and capitalization, and some small suggestions for rephrasing to improve readability.

One item I didn't address in my comments was I noticed what appears to be some inconsistency in initial capitalization of the word "infinity". I'd suggest reviewing all uses of the word and changing to "Infinity" as appropriate for the meaning of the word in the sentence.

RFC-0006-nan-definition.md

rschlussel

Thanks for the review @steveburnett and @tdcmeehan!

rschlussel · 2024-03-15T14:07:50Z

RFC-0006-nan-definition.md

+
+For sorting, again the basic comparisons follow IEEE-754. where <,>, <=, >= return false.  Beyond that there is a considerable degree of variation across functions and operators. min, max, min_by, max_by, min(x, n) and max(x,n) have inconsistent behavior depending on when NaN is encountered.  Greatest and Least throw an error on NaN input. Map_top_n returns wrong results if NaN shows up in the input. Order by sorts NaN as larger than infinity, and most array and map sorting and topn functions consider NaN as largest as well. However, array_min and array_max return NaN if any of the input is NaN.
+
+When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators).  Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN.  Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines.  


to exclude NaN or to exclude min/max stats if there are NaNs? i think for the file format we should follow whatever the file format expects if its defined and then make sure to handle it appropriately in Presto.

rschlussel · 2024-03-15T14:08:47Z

RFC-0006-nan-definition.md

+#### Postgresql
+
+The NaN (not a number) value is used to represent undefined calculational results. In general, any operation with a NaN input yields another NaN. The only exception is when the operation's other inputs are such that the same output would be obtained if the NaN were to be replaced by any finite or infinite numeric value; then, that output value is used for NaN too. (An example of this principle is that NaN raised to the zero power yields one.)
+Note: In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.


Wish I could take credit, but it's a direct quote from the PostgreSQL docs

RFC-0006-nan-definition.md

rschlussel · 2024-03-15T14:49:56Z

RFC-0006-nan-definition.md

+
+Here we give a short summary of Presto's current behavior.  There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498  
+
+Currently basic comparison operators follow IEEE-754. =  returns false. <> returns true.  Joins on NaN keys do not match.  For maps and arrays, NaNs are considered distinct from each other, and will appear multiple times in a set_agg, union, or even as map_keys.  However, for group bys, distinct, and distinct aggregations, NaNs are deduplicated (there will only be one group for all NaNs, they are not considered distinct from each other). 


did a variation on this suggestion.

steveburnett

Nice work on the revision, thanks for the quick turnaround! I did another complete review of the revised draft and I found only a few final nits for you to consider.

RFC-0006-nan-definition.md

steveburnett · 2024-03-15T15:54:18Z

RFC-0006-nan-definition.md

+>….							
+>Operations on numbers are performed according to the normal rules of arithmetic, within implementation- defined limits, except as provided for in Subclause 6.29, “<numeric value expression>”. 
+
+Some read the above as disallowing NaNs because floating point types are defined only as mantissa and exponent, and there are no special values defined like NaN, infinity, and -infinity.  I don't think it's clear one way or the other, and it looks like most dbs do support NaN values.


Suggest expanding "dbs" here, and throughout as well ("DB" elsewhere).

done. thanks!

RFC-0006-nan-definition.md

steveburnett

Sorry! I missed this earlier, one final formatting nit. I just did another complete review of the document and I don't see anything else to mention.

RFC-0006-nan-definition.md

czentgr

Great write-up! Thanks for that!

czentgr · 2024-03-15T15:35:09Z

RFC-0006-nan-definition.md

+Most array and map sorting and topn functions consider NaN as largest
+`array_min()` and `array_max()` return NaN if any of the input is NaN
+
+When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain).


Does this need more investigation/check?

I need to do a bit more digging to see what the current behavior is/what changes are needed, but i don't consider it a blocker for the RFC, which is about what we're moving too.

czentgr · 2024-03-15T15:35:39Z

RFC-0006-nan-definition.md

+When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain).
+ tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators).  
+Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN.  
+Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines.  


For Velox I checked and it doesn't include NaN in the min/max

parquet cat 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet {"col1": 1.0} {"col1": 2.0} {"col1": "NaN"} {"col1": 3.0} {"col1": "NaN"} {"col1": 4.0} {"col1": "NaN"}

parquet meta 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet File path: 20240315_172625_00005_juhzy.1.0.0.0_2_5_1e829c06-25b5-4451-8348-07abcb7ebf25.parquet Created by: parquet-cpp-velox Properties: (none) Schema: message schema { optional double col1; } Row group 0: count: 7 20.71 B records start: 4 total(compressed): 145 B total(uncompressed):126 B -------------------------------------------------------------------------------- type encodings count avg size nulls min / max col1 DOUBLE G _ R 7 20.71 B 0 "1.0" / "4.0"

The result for Java is interesting. The min/max is not set at all. Because NaN are present?

parquet meta 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639 File path: 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639 Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba) Properties: (none) Schema: message hive_schema { optional double col1; } Row group 0: count: 7 16.00 B records start: 4 total(compressed): 112 B total(uncompressed):93 B -------------------------------------------------------------------------------- type encodings count avg size nulls min / max col1 DOUBLE G _ R 7 16.00 B 0

Data

parquet cat 20240315_184142_00003_tqpac_a024760a-431c-4013-bf14-1a8ab2f55639 {"col1": 10.0} {"col1": 20.0} {"col1": "NaN"} {"col1": 30.0} {"col1": "NaN"} {"col1": 40.0} {"col1": "NaN"}

thanks for this! I'll update with this information.

czentgr · 2024-03-15T15:57:15Z

RFC-0006-nan-definition.md

+
+https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-33A52FDB-BA5C-474E-96D3-40390BA5F5F4
+
+#### DB2


This section describes Db2 z/OS. But there are other variants of Db2 e.g. Db2 LUW and Db2 for i and they do have differences though for this case, it appears that they behave the same.
Perhaps we can point put that both Db2 LUW and Db2 z/OS behave in the same way with regards to DOUBLE/FLOAT and DECFLOAT. I did not find the equivalent docs for Db2 for i (and it too niche to be relevant here).

Relevant Db2 LUW documentation matching the Db2 z/OS documentation
https://www.ibm.com/docs/en/db2/11.5?topic=list-numbers#r0008469__title__7
https://www.ibm.com/docs/en/db2/11.5?topic=list-numbers#r0008469__title__8
https://www.ibm.com/docs/en/db2/11.5?topic=types-assignments-comparisons#r0008479__numcomp__title__1

As for the title I suggest "Db2 for z/OS and Db2 LUW"

czentgr · 2024-03-15T18:44:34Z

RFC-0006-nan-definition.md

+
+## Goals
+
+* Define a consistent behavior for ordering and comparisons of Nans in Presto


nit: NaNs.

steveburnett

LGTM! (docs)

Thanks again for the quick update! Reviewed the updated file and it looks great.

aditi-pandit · 2024-03-15T22:03:44Z

@rschlussel : Nice writeup. Was curious if you have researched Trino. Do they have the same issues as Presto ?

I was intrigued since I saw in the description of Project Hummingbird that they intend to add a runtime optimization to eliminate nan handling checks for data without nan's

elharo

Assorted nits on phrasing, but +1 on the meat of the proposal. This sounds like a consistent and reliable solution.

elharo · 2024-03-18T13:39:59Z

RFC-0006-nan-definition.md

+
+### Current Presto behavior
+
+Here we give a short summary of Presto's current behavior.  There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498  


Do you want to make this public?

Still need to edit the formatting a bit, but made a wiki for it https://github.com/prestodb/presto/wiki/Presto-NaN-behavior. Thanks for the suggestion.

elharo · 2024-03-18T13:41:25Z

RFC-0006-nan-definition.md

+
+#### Comparison 
+
+= returns false


It's not totally clear that you're talking about NaN = NaN and NaN <> NaN. There are also the cases where NaN = 7.2 or NaN <> 56.0

All of the above. Any equality comparison will return false including NaN=NaN and any inequality comparison with another number will return true including NaN <> NaN. I'll specify NaN = NaN and NaN <> NaN here because i think the other cases would be understood from that.

elharo · 2024-03-18T13:42:04Z

RFC-0006-nan-definition.md

+
+#### Sorting 
+
+<,>, <=, >= all return false


nit: periods at end of sentences

elharo · 2024-03-18T13:42:24Z

RFC-0006-nan-definition.md

+`map_top_n()` returns wrong results if NaN shows up in the input
+Order by sorts NaN as larger than infinity
+Most array and map sorting and topn functions consider NaN as largest
+`array_min()` and `array_max()` return NaN if any of the input is NaN


of the input --> input value

elharo · 2024-03-18T13:42:54Z

RFC-0006-nan-definition.md

+
+When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (I'm not confident about this, and  need to dig a bit more to be certain). 
+tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators).  
+Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN.  


comma after Velox

elharo · 2024-03-18T13:45:47Z

RFC-0006-nan-definition.md

+
+#### Spark SQL
+
+There is special handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:


Can we tldr this whole proposal as "do what Spark 3.x does"?

elharo · 2024-03-18T13:46:46Z

RFC-0006-nan-definition.md

+* Define a consistent behavior for ordering and comparisons of NaNs in Presto
+* Make behavior easy to implement and debug
+* Make behavior easy for users to understand
+* Align behavior with Velox for consistency across Java and C++ workers


Add Align with Spark?

I think that's a nice to have when choosing between options, but not a primary goal.

elharo · 2024-03-18T13:47:12Z

RFC-0006-nan-definition.md

+
+We propose defining ordering and comparisons for NaNs such that `NaN = NaN` and sorts biggest.
+
+With this proposal we stray from the IEEE definition and follow the example of PostgreSQL and others.  In this case for all equality purposes such as =, distinctness, grouping, and join keys, we would define NaN=NaN.  We would sort NaN as greater than infinity for all purposes, including for min/max and > and < comparisons.


PostgresSQL, Spark, and others.

elharo · 2024-03-18T13:48:27Z

RFC-0006-nan-definition.md

+
+## Test Plan
+
+We will add a test class TestNanQueries with tests to ensure correct behavior for all functions and operations that handle NaNs. The tests will use queries that execute on workers, rather than constants that get interpreted on the coordinator, so that the test can also be used to ensure consistency between native and java workers.


elharo · 2024-03-18T13:50:28Z

RFC-0006-nan-definition.md

+
+## Test Plan
+
+We will add a test class TestNanQueries with tests to ensure correct behavior for all functions and operations that handle NaNs. The tests will use queries that execute on workers, rather than constants that get interpreted on the coordinator, so that the test can also be used to ensure consistency between native and java workers.


We can likely also put NaN tests into existing unit test classes that test GROUP BY, <>, etc.

Yeah, that works too. I think it could be handy to have a test suite that has NaN tests for all functions in one place to make sure we have thorough coverage, but it's also nice for the group by tests to be thorough about all the cases it needs to cover including NaNs.

tdcmeehan · 2024-03-18T15:10:26Z

RFC-0006-nan-definition.md

+* Make behavior easy to implement and debug
+* Make behavior easy for users to understand
+* Align behavior with Velox for consistency across Java and C++ workers
+* [Nice to have] Align behavior with Spark


Spark 3.0 in ANSI mode, or previous versions?

I believe they have the same semantics with regard to NaN.

rschlussel · 2024-03-18T19:08:02Z

@rschlussel : Nice writeup. Was curious if you have researched Trino. Do they have the same issues as Presto ?

I was intrigued since I saw in the description of Project Hummingbird that they intend to add a runtime optimization to eliminate nan handling checks for data without nan's

I haven't looked deeply, but from a quick look at their docs and issues and knowing what they started with , I believe they also have inconsistencies, though more intentional. I believe the following is their behavior:

<, >, =, <> say nan = nan is false and nan <> nan is true. Joins follow these semantics
order by sorts nan largest
grouping/distinct operations put nans into a single group (nan=nan) and map and array functions are intended to follow that, and have been partially, but not completely fixed to use those semantics Functions on arrays and maps should compare elements with IS DISTINCT semantics trinodb/trino#560
min/max and greatest/least functions ignore nans unless all of the input is nan https://trino.io/docs/current/release/release-346.html?highlight=nan, Use consistent NaN behavior for max functions trinodb/trino#5851

They also have a number of bugfixes related to nan handling and we should look into whether they are relevant for us:

https://trino.io/docs/current/release/release-318.html?highlight=nan#hive-connector (fix bucketing of nan values)
Improve support for floating point types in Domain translator trinodb/trino#2582 (improving domain translator support for nans - this one is described as fixing wrong results in the 329 release note)
Do not persist inf/nan floating point stats in metastore trinodb/trino#2471 (not writing nan values to hiave metastore - described in the 329 release notes as fixing a query failure)
Fix incorrect result for inequality join involving NaN trinodb/trino#4120 (wrong results for inequality join with nan values - probably won't be relevant for us when we switch joins to nan=nan semantics) (337 release notes)
Dynamic Filter failure in presence of NaN trinodb/trino#4272 (dynamic filter failures with nans)
Planning failure when effective predicate contains NaN trinodb/trino#4119 Handle NaN when extracting predicate trinodb/trino#4266 (planning failure with nans)
Query failure when reading ORC/Parquet files containing NaN trinodb/trino#4267 (fixed by: Handle NaN in Parquet statistics trinodb/trino#4467) ( parquet issue with nans)
Iceberg connector stores 0.0 when inserting NaN on partitioned real or double columns trinodb/trino#15723 (iceberg connector writing/reading incorrectly with nans)
Handle NaN and infinite values in Aggregation Window Function trinodb/trino#20946 issue with nans and infinities in window functions

tdcmeehan reviewed Mar 14, 2024

View reviewed changes

steveburnett suggested changes Mar 14, 2024

View reviewed changes

rschlussel force-pushed the nan-rfc branch from 4fdf860 to 2a66ba8 Compare March 15, 2024 14:50

rschlussel commented Mar 15, 2024

View reviewed changes

steveburnett suggested changes Mar 15, 2024

View reviewed changes

rschlussel force-pushed the nan-rfc branch from df74b4e to 6352c01 Compare March 15, 2024 17:18

steveburnett suggested changes Mar 15, 2024

View reviewed changes

RFC-0006-nan-definition.md Outdated Show resolved Hide resolved

rschlussel force-pushed the nan-rfc branch from 22d5c87 to 945303e Compare March 15, 2024 17:52

czentgr reviewed Mar 15, 2024

View reviewed changes

steveburnett approved these changes Mar 15, 2024

View reviewed changes

rschlussel force-pushed the nan-rfc branch from 945303e to c40a3db Compare March 15, 2024 19:17

elharo approved these changes Mar 18, 2024

View reviewed changes

rschlussel force-pushed the nan-rfc branch from c40a3db to 28dc3e2 Compare March 18, 2024 14:53

tdcmeehan approved these changes Mar 18, 2024

View reviewed changes

Add RFC for a new NaN definition

7eb17de

rschlussel force-pushed the nan-rfc branch from 28dc3e2 to 7eb17de Compare March 19, 2024 13:42

rschlussel merged commit 8919b4e into prestodb:main Mar 19, 2024


		For sorting, again the basic comparisons follow IEEE-754. where <,>, <=, >= return false. Beyond that there is a considerable degree of variation across functions and operators. min, max, min_by, max_by, min(x, n) and max(x,n) have inconsistent behavior depending on when NaN is encountered. Greatest and Least throw an error on NaN input. Map_top_n returns wrong results if NaN shows up in the input. Order by sorts NaN as larger than infinity, and most array and map sorting and topn functions consider NaN as largest as well. However, array_min and array_max return NaN if any of the input is NaN.

		When it comes to pushing filters into the scan. Tuple domains, I believe would consider NaN largest (though needs a bit more review to be certain). tupleDomainFilters testDouble() will return false if the value is NaN unless the filter is a NOT IN (following the IEEE definition and Presto's behavior for basic operators). Orc files, at least as written by Presto and Velox won't include min/max stats if any of the values are NaN. Parquet files written by Presto, I believe will consider NaN largest in their min/max stats, but other parquet writers behave differently, and Presto may read files written by other engines.


		Here we give a short summary of Presto's current behavior. There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498

		Currently basic comparison operators follow IEEE-754. = returns false. <> returns true. Joins on NaN keys do not match. For maps and arrays, NaNs are considered distinct from each other, and will appear multiple times in a set_agg, union, or even as map_keys. However, for group bys, distinct, and distinct aggregations, NaNs are deduplicated (there will only be one group for all NaNs, they are not considered distinct from each other).


		https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-33A52FDB-BA5C-474E-96D3-40390BA5F5F4

		#### DB2


		## Goals

		* Define a consistent behavior for ordering and comparisons of Nans in Presto


		### Current Presto behavior

		Here we give a short summary of Presto's current behavior. There is a complete review here: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498


		#### Spark SQL

		There is special handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:


		We propose defining ordering and comparisons for NaNs such that `NaN = NaN` and sorts biggest.

		With this proposal we stray from the IEEE definition and follow the example of PostgreSQL and others. In this case for all equality purposes such as =, distinctness, grouping, and join keys, we would define NaN=NaN. We would sort NaN as greater than infinity for all purposes, including for min/max and > and < comparisons.


		## Test Plan

		We will add a test class TestNanQueries with tests to ensure correct behavior for all functions and operations that handle NaNs. The tests will use queries that execute on workers, rather than constants that get interpreted on the coordinator, so that the test can also be used to ensure consistency between native and java workers.

Add RFC for a new NaN definition #9

Add RFC for a new NaN definition #9

Uh oh!

Conversation

rschlussel commented Mar 13, 2024

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rschlussel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

czentgr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

aditi-pandit commented Mar 15, 2024

Uh oh!

elharo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment