-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Presto NaN behavior
We are moving to a consistent semantics with regards to NaN across all functions and operators. This will bring internal consistency to Presto, as well as align us with Velox, as we move towards CPP workers in Presto. The new semantics are enabled by default and can be disabled by setting the configuration property use-new-nan-definition=false
.
According to the new definition:
- NaN = NaN is true for all equality and distinctness purposes
- NaN is the largest value for all comparison and sorting purposes.
In addition to the NaN changes, we have updated our semantics regarding +0 and -0 to consistently treat +0 and -0 as equal to each other for all equality and distinctness purposes.
You can find more information below about where this definition differs from the previous behavior, and how you can handle it for your queries.
NaN, short for "not a number", is a value in floating point types (DOUBLE and REAL) that indicates an invalid result of a mathematical operation. For example 0.0/0.0 returns NaN, as does sqrt(-1). NaN is different from NULL, which indicates an absent value.
The new NaN semantics may cause your queries to produce different results by returning different values than previously or returning a different number of rows. Below are places where the new Nan semantics differ from the old.
Comparison operators are by far the most common cause of results differences between the old and new semantics, especially comparisons such as double_column > value.
Filters such as x > 0.5
previously would not include NaNs. Now, since NaNs are considered larger than all other numbers, it will include NaN values. This is the most common cause of results differences we see with the new NaN changes. Sometimes this is from NaN values stored in tables. Other times it comes from comparison after division operations where you end up with 0.0 / 0.0. e.g. WHERE (col1 / col2) > 0.5, but col1 and col2 are both 0.0.
SELECT x > 0.5 FROM (VALUES nan()) t(x)
Previous results:
_col0
-------
false
(1 row)
New results:
_col0
-------
true
(1 row)
SELECT x/y > 0.5 FROM (VALUES (0.0, 0.0))t(x, y)
Previous results:
_col0
-------
false
(1 row)
New results:
_col0
-------
true
(1 row)
Filters such as x = y now return true if both values are NaN, whereas previously they returned false.
SELECT x = x FROM (values nan())t(x)
Previous results:
_col0
-------
false
(1 row)
New results:
_col0
-------
true
(1 row)
x <> y now returns false if both values are NaN. Previously it returned true.
SELECT x <> x FROM (values nan())t(x)
Previous results:
_col0
-------
true
(1 row)
New results:
_col0
-------
false
(1 row)
These functions now consider NaNs largest. Previously they returned NaN if it was the first value encountered and otherwise they returned the min or max non-NaN value.
Previously array_min returned NaN if there was any NaN in the input. Now it will consider NaN as the largest value.
SELECT array_min(ARRAY[1, nan()]);
Previous results:
_col0
-------
NaN
(1 row)
New results:
_col0
-------
1
(1 row)
Previously these functions threw an error if there was NaN in the input. Now they will consider NaN largest.
SELECT greatest (1, nan());
Previous results:
Query 20240627_211145_97754_m2atc failed: Invalid argument to greatest(): NaN
New results:
_col0
-------
NaN
(1 row)
set_agg, set_union, map_agg, multimap_agg, map_union and other map and array functions that are supposed to return a distinct set of elements in an array or map
Previously these did not deduplicate NaN values and maps could also contain multiple NaN keys. Now they will deduplicate NaN values and maps can only contain one NaN key.
SELECT set_agg(x) from (values nan(), nan()) t(x);
Previous results:
_col0
------------
[NaN, NaN]
(1 row)
New results:
_col0
------------
[NaN]
contains, element_at, arrays_overlap and other functions that find or match an element in a map or array
Previously these functions would not match NaN values in the array or map. Now these functions will match NaN to itself.
SELECT contains(ARRAY[nan()], nan())
Previous results:
_col0
-------
false
(1 row)
New results:
_col0
-------
true
(1 row)
map_top_n() previously returned wrong results in the presence of NaNs. Now it will consider NaN largest.
select map_top_n(map(array['a', 'b', 'c'], array[3, nan(), 2]),1)
Previous results (bug!):
_col0
---------
{c=2.0}
(1 row)
New results:
_col0
---------
{b=NaN}
(1 row)
A user defined SQL function may have behavior differences with the new NaN semantics if it calls a function that has differences with the new NaN semantics
NaN join keys now match. Previously they would not. Example:
with nan_input as (SELECT * FROM (values nan())t(x)) SELECT t1.x FROM nan_input t1 JOIN nan_input t2 on t1.x = t2.x;
Previous results:
x
---
(0 rows)
New results:
x
---
NaN
(1 row)
Previously we did not deduplicate NaNs when performing grouping operations on REAL types. Now we deduplicate NaNs and consider them all part of the same group for a "GROUP BY". There is no change for DOUBLE types which already had this behavior.
SELECT col, count(*) FROM (VALUES cast(nan() AS REAL), CAST(nan() AS REAL))t(col) GROUP BY col;
Previous results:
col | _col1
-----+-------
NaN | 1
NaN | 1
(2 rows)
New results:
col | _col1
-----+-------
NaN | 2
(1 row)
In addition to the comparison and ordering differences, a bug was fixed regarding casting NaN and infinity from DOUBLE or REAL to BIGINT, INT, TINYINT, or SMALLINT. These casts now throw an error, as NaN and infinity cannot be represented in integer types. Previously all these casts other than DOUBLE -> BIGINT would incorrectly return 0.
Previously joins, group bys, distincts, and many functions treated +0 and -0 as distinct from each other. Now they will be treated as equal to each other. They will match as join keys, be grouped to a single group, and be considered equals and not distinct for all purposes.
SELECT DISTINCT x FROM (VALUES (DOUBLE '0.0'), (DOUBLE '-0.0')) t(x)
Previous results:
x
------
0.0
-0.0
(2 rows)
New results:
x
------
0.0
(1 row)
SELECT * FROM (VALUES DOUBLE '0.0')t1(x) JOIN (VALUES DOUBLE '-0.0') t2(x) ON t1.x = t2.x
Previous results:
x | x
---+---
(0 rows)
New results:
x | x
-----+-----
0.0 | 0.0
(1 row)
In most cases where NaNs show up, people do not realize that they may be producing or processing NaNs.
If you would like to filter out NaNs from your data in a way that is compatible with both the old and new semantics, you can use the is_nan() function. For example: SELECT * FROM my_table WHERE NOT is_nan(my_double_column)
. You can also use is_finite() to filter out both NaNs and infinities. For example: SELECT * FROM my_table WHERE is_finite(my_double_column)
.
You can also prevent yourself from creating NaN by filtering inputs to certain functions that could produce NaNs. For example, you might check that the denominator is not 0 before performing a division operation.
SELECT col1/col2 as ratio FROM my_table WHERE col2 <> 0
.