Legacy Presto NaN behavior

Below we provide a reference of Presto's previous behavior with regard to ordering and comparison of NaN values. The new behavior is consistent definition of nan=nan and nan is greater than all other values. See https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md

for more details.

Function	Behavior	Examples
Top level operators
order by	NaN is biggest
SELECT distinct, group by one column, group by multiple columns	deduplicates NaNs for double types. for real types, it does not (but if you do IS DISTINCT FROM directly it will say they are not distinct). this is because of a bug in the block version of real's isDistinctFrom method	select distinct x from (values 1, nan(), 2, nan(), null, null)t(x); x ------ NaN NULL 1.0 2.0 select count(x), y from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by y; _col0 \| y -------+----- 2 \| NaN 2 \| 1.0 (2 rows) select x, y, count(*) as count from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by x, y; x \| y \| count ---+-----+------- 1 \| NaN \| 2 1 \| 1.0 \| 2 (2 rows)
join on nan() key	does not match. Also note that dynamic filtering has special handling for nan() based on this behavior https://github.com/prestodb/presto/pull/16314/files	SELECT * FROM (values 1, 2, nan()) t1(x) join (values 1, 3, nan()) t2(x) on t1.x = t2.x; x \| x -----+----- 1.0 \| 1.0 (1 row)

Simple Scalar Functions and operators
greatest/least	Throws an invalid argument error for NaN in any position	select greatest(1, nan());Query 20240228_144800_58951_3rsv7 failed: Invalid argument to greatest(): NaN select least(1, nan());Query 20240228_144824_58965_3rsv7 failed: Invalid argument to least(): NaN
`>, <, >=, <=`	false for NaN in any position	select nan() > 0; _col0 ------- false (1 row)select 0 > nan(); _col0 ------- false (1 row)select nan() >= 0; _col0 ------- false (1 row)select nan() < 0; _col0 ------- false (1 row)select nan() <= 0; _col0 ------- false (1 row) select nan() <= nan(); _col0 ------- false (1 row)
between	false for NaN in any position	select nan() between -Infinity() and infinity(); _col0 ------- false (1 row)select 3 between 0 and nan(); _col0 ------- false (1 row)
=	false	select nan() = nan(); _col0 ------- false (1 row)
<>, !=	TRUE	select nan() <> nan(); _col0 ------- true
nan() IN (...)	false	select nan() in (1, 2, nan()); _col0 ------- false (1 row)
nan() NOT IN (....)	TRUE	select nan() not in (nan(), 1, 2); _col0 ------- true (1 row)


Aggregation Functions
count(distinct)	nan counted only once	SELECT count(distinct x) from (values 1, 2, nan(), nan()) t(x); _col0 ------- 3 (1 row)select y, count(distinct x) as count_distinct, count(x) as count from (values(nan(), 1), (nan(), 1), (1, 1), (1, 2)) t(x,y) group by y; y \| count_distinct \| count ---+----------------+------- 2 \| 1 \| 1 1 \| 2 \| 3 (2 rows)
min, max, min_by, max_by	If NaN is first, then NaN. Else the min or max other than NaN (https://github.com/prestodb/presto/issues/21877)
set_agg()	Nan appears multiple times	select set_agg(x) from (Values 1, 2, nan(), nan(), 3) as t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, 3.0] (1 row)
set_union()	Nan appears multiple times	SELECT set_union(x) from (values array[1, 2, nan()], array[1, 3, nan()]) t(x); _col0 --------------------------- [1.0, 2.0, NaN, 3.0, NaN] (1 row)
histogram	each nan appears seperately in the histogram. in display it looks like there's only one with an incorrent count, but if you get map_keys you can see it appears n times	SELECT histogram(x) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 ----------------------- {1.0=2, 2.0=1, NaN=1} (1 row)SELECT map_keys(histogram(x)) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, NaN] (1 row)
map_agg	nan appears multiple times (in display looks like it only appears 1x, but if you get map_keys you can see multiple nans)	select map_agg(x,y) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ----------------------- {1.0=a, 2.0=b, NaN=e} (1 row)select map_keys(map_agg(x,y)) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ---------------------- [1.0, 2.0, NaN, NaN]
multimap_agg	fails with an index out of bounds exception if there are any nan() keys https://github.com/prestodb/presto/issues/22679	SELECT multimap_agg(x, y) FROM (VALUES (nan(), 2), (2, 3)) t(x,y); -> Query 20240506_161354_39491_uuek4 failed: Index -1 out of bounds for length 1024.
map_union	nan keys appear multiple times. in display it looks deduplicated, but if you get map_keys you can see they are not	SELECT map_union(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ---------------------------------- {1.0=3, 2.0=4, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row)
map_union_sum	nan keys are duplicated and values not summed. in display it looks deduplicated, but if you get map_keys you can see they are not	SELECT map_union_sum(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------------- {1.0=3, 2.0=24, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union_sum(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row)

Array functions
array_min	NaN if NaN is in any position in the array	select array_min(array[nan(),1, 2, 3]); _col0 ------- NaN (1 row) select array_min(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row)
array_max	NaN if NaN is in any position in the array	select array_max(array[nan(), 1, 2, 3]); _col0 ------- NaN (1 row)select array_max(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row)
array_sort	NaN is biggest	select array_sort(array[nan(), 1, 3, 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row)select array_sort(array[1, 3, nan(), 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row)
array_distinct()	Doesn't deduplicate NaN elements	select array_distinct(array[1, nan(), 2, nan(), null, null]); _col0 ---------------------------- [1.0, NaN, 2.0, NaN, null] (1 row)
array_duplicates()	Doesn't list NaN as duplicate elements	select array_duplicates(array[1, nan(), 2, nan(), null, null, 1]); _col0 ------------- [null, 1.0] (1 row)
array_except()	Always includes NaN, even if it's in both arrays	select array_except(array[1, 2, nan()], array[3, 4, nan()]); _col0 ----------------- [1.0, 2.0, NaN]
array_has_duplicates	returns false if NaN is the only duplicate	select array_has_duplicates(array[1, nan(), 2, nan()]); _col0 ------- false (1 row)
array_intersect	nan() is excluded from the intersection even if it's in both	select array_intersect(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- [] (1 row)
array_union()	nan() is included multiple times	select array_union(array[1, 2, nan()], array[3, 4, nan()]); _col0 -------------------------------- [1.0, 2.0, NaN, 3.0, 4.0, NaN]
array_least_frequent	nan() are all considered different from each other, so even if they appear multiple times, each one is considered as appearing once.	select array_least_frequent(array[1,1, 2,2, nan(), nan(), nan()]); _col0 ------- [NaN] (1 row) select array_least_frequent(array[1, 2,2, nan(), nan(), nan()]); _col0 ------- [1.0] (1 row)
array_least_frequent(x, n)	nan() appears however many times it appears in the array. For ties in frequency it is considered the largest number.	select array_least_frequent(array[1, 1, 2,2, nan(), nan(), nan()], 5); _col0 --------------------------- [NaN, NaN, NaN, 1.0, 2.0] (1 row)select array_least_frequent(array[2,2, nan(), nan(), nan(), 1.0], 5); _col0 --------------------------- [1.0, NaN, NaN, NaN, 2.0] (1 row)
array_position	nan() is not found	presto:di> select array_position(array[2,2, nan(), nan(), nan(), 1.0], nan()); _col0 ------- 0 (1 row)
array_remove	does not remove nan	select array_remove(array[1, 2, nan()], nan()); _col0 ----------------- [1.0, 2.0, NaN] (1 row)
array_top_n	nan is biggest	select array_top_n(array[1, 2, nan(), infinity()],3); _col0 ---------------------- [NaN, Infinity, 2.0] (1 row)
arrays_overlap	returns false if NaN is the only overlap	select arrays_overlap(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- false (1 row)
contains(x, nan())	checking for contains nan() always returns false	select contains(array[nan(), nan(), nan()], nan()); _col0 ------- false (1 row)


Maps
creating a map with duplicate nan keys	does not throw an error (unlike other duplicate values). The display shows only one of them, but if you try to get map_keys you'll see nan 2xNote that using double keys will always return a warning that floating point types can be non-deterministic	select map(array[1, nan(), nan()], array['a', 'b','c']); _col0 ---------------- {1.0=a, NaN=c} (1 row)vs.select map(array[1, nan(), 1], array['a', 'b','c'])[nan()]; Query 20240227_210132_03209_xptub failed: Duplicate map keys (1.0) are not allowed
subscript, element_at, map_subset	returns null for nan keys (i.e. doesn't find them)	select map(array[1, nan(), nan()], array['a', 'b','c'])[nan()]; _col0 ------- NULL (1 row) select element_at(map(array[1, nan(), 2], array['a', 'b','c']), nan()); _col0 ------- NULL (1 row)select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row)
map_subset	doesn't include nan keys in the returned map (i.e. doesn't find them)	select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row)
map_keys	returns all keys including duplicate NaN keys	select map_keys(map(array[1, nan(), nan()], array['a', 'b','c'])); _col0 ----------------- [1.0, NaN, NaN] (1 row)
map_top_n_keys	nan keys are biggest regardless of position	select map_top_n_keys(map(array[1, nan(), nan()], array['a', 'b','c']),3); _col0 ----------------- [NaN, NaN, 1.0] (1 row)
map_top_n	[BUG] nan ordering is dependent on map order. elements are evaluated in order, and nan is larger than anything before it, and smaller than anything after it. That means something after nan may be considered biggest even if something before nan was bigger. https://github.com/prestodb/presto/issues/22040	select map_top_n(map(array[1, 2, 3], array[nan(), 2, -infinity()]),2); _col0 ---------------------- {2=2.0, 3=-Infinity} (1 row)select map_top_n(map(array[1, 2, 3, 4], array[1, nan(), infinity(), -infinity()]),1); _col0 -------------- {3=Infinity}select map_top_n(map(array[1, 2, 3], array[1, infinity(), nan()]),1); _col0 --------- {3=NaN} (1 row)-- bug!! -infinity is reported as the largest here because it's the only thing after nan select map_top_n(map(array[1, 2, 3, 4], array[1, infinity(), nan(), -infinity()]),1); _col0 --------------- {4=-Infinity} (1 row)
map_top_n_values	nan is the biggest regardless of position	select map_top_n_values(map(array[1, 2 ,3, 4], array[1,nan(), 2, 0]),2); _col0 ------------ [NaN, 2.0] select map_top_n_values(map(array[1, 2 ,3, 4], array[nan(), 1, 2, 0]),2); _col0 ------------ [NaN, 2.0] (1 row)select map_top_n_values(map(array[1, 2 ,3, 4], array[1, 2, 2, nan()]),2); _col0 ------------ [NaN, 2.0] (1 row)

File format min/max (and sorted columns)	We can/do use min and max values supplied by the file format for the hive readers (e.g. partial aggregation pushdown, filter pushdown). The handling of nan in regards to min/max may be different across different file formats and may or may not align to whatever decision we make. Additionally, we may read from files that are not written by our readers/writers and therefore have different behavior. Changes may be required to these readers to make our behaviors fully consistent. I don't think Presto Java currently uses sortedness of files, but I think velox does. In addition to ensuring correctness/consistency for filtering related usages fo sortedness, we would also need to make sure e.g. for sort merge joins that the inputs sort nans in the same way
parquet	Presto Parquet writers do not include min/max stats when there are nans. Velox Parquet writers will write min/max stats but exclude NaN values. Other parquet implementations do otherwise, and there are also other proposals for what to do, e.g. see: https://github.com/apache/parquet-cpp/pull/444, https://github.com/apache/arrow-rs/issues/264, https://github.com/apache/parquet-format/pull/221, https://github.com/apache/parquet-format/pull/196
orc/dwrf	The Presto orc/dwrf writer will exclude all statistics if any of the values are NaN. https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/metadata/statistics/DoubleStatisticsBuilder.java#L75-L78. The Velox dwrf writer does the same https://github.com/facebookincubator/velox/blob/main/velox/dwio/dwrf/writer/StatisticsBuilder.h#L270-L273. don't know about other implementations
other file formats	didn't check, needs more investigation
Tuple domains	TupleDomain.compareTo will use the compareTo method for the type, which for DoubleType is Double.compare(), so NaN is largest
TupleDomainFilter (filter pushdown)	tuple domain filters behave the same as <, >, =, and <> operators. They return false except for testing whether NaN is NOT IN the filter
Other stats
metastore provided table/partition statistics	We do not currently use metastore stats for correctness. only for estimation, and do not consider them fully reliable, so this is not a concern. If we were to change that assumption, we would need to deal with this case (and how we evaluate them in the code). It is worth considering what the metastore expects when we provide min/max stats