Skip to content

Legacy Presto NaN behavior

Rebecca Schlussel edited this page Jun 17, 2024 · 1 revision

Below we provide a reference of Presto's previous behavior with regard to ordering and comparison of NaN values. The new behavior is consistent definition of nan=nan and nan is greater than all other values. See https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md


for more details.

Function Behavior Examples
Top level operators    
order by NaN is biggest  
SELECT distinct, group by one column, group by multiple columns deduplicates NaNs for double types. for real types, it does not (but if you do IS DISTINCT FROM directly it will say they are not distinct). this is because of a bug in the block version of real's isDistinctFrom method select distinct x from (values 1, nan(), 2, nan(), null, null)t(x); x ------ NaN NULL 1.0 2.0 select count(x), y from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by y; _col0 | y -------+----- 2 | NaN 2 | 1.0 (2 rows) select x, y, count(*) as count from (values(1, 1), (1, 1), (1,nan()), (1,nan())) t(x,y) group by x, y; x | y | count ---+-----+------- 1 | NaN | 2 1 | 1.0 | 2 (2 rows)
join on nan() key does not match. Also note that dynamic filtering has special handling for nan() based on this behavior https://github.com/prestodb/presto/pull/16314/files SELECT * FROM (values 1, 2, nan()) t1(x) join (values 1, 3, nan()) t2(x) on t1.x = t2.x; x | x -----+----- 1.0 | 1.0 (1 row)
     
Simple Scalar Functions and operators    
greatest/least Throws an invalid argument error for NaN in any position select greatest(1, nan());Query 20240228_144800_58951_3rsv7 failed: Invalid argument to greatest(): NaN select least(1, nan());Query 20240228_144824_58965_3rsv7 failed: Invalid argument to least(): NaN
>, <, >=, <= false for NaN in any position select nan() > 0; _col0 ------- false (1 row)select 0 > nan(); _col0 ------- false (1 row)select nan() >= 0; _col0 ------- false (1 row)select nan() < 0; _col0 ------- false (1 row)select nan() <= 0; _col0 ------- false (1 row) select nan() <= nan(); _col0 ------- false (1 row)
between false for NaN in any position select nan() between -Infinity() and infinity(); _col0 ------- false (1 row)select 3 between 0 and nan(); _col0 ------- false (1 row)
= false select nan() = nan(); _col0 ------- false (1 row)
<>, != TRUE select nan() <> nan(); _col0 ------- true
nan() IN (...) false select nan() in (1, 2, nan()); _col0 ------- false (1 row)
nan() NOT IN (....) TRUE select nan() not in (nan(), 1, 2); _col0 ------- true (1 row)
     
     
Aggregation Functions    
count(distinct) nan counted only once SELECT count(distinct x) from (values 1, 2, nan(), nan()) t(x); _col0 ------- 3 (1 row)select y, count(distinct x) as count_distinct, count(x) as count from (values(nan(), 1), (nan(), 1), (1, 1), (1, 2)) t(x,y) group by y; y | count_distinct | count ---+----------------+------- 2 | 1 | 1 1 | 2 | 3 (2 rows)
min, max, min_by, max_by If NaN is first, then NaN. Else the min or max other than NaN (https://github.com/prestodb/presto/issues/21877)  
set_agg() Nan appears multiple times select set_agg(x) from (Values 1, 2, nan(), nan(), 3) as t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, 3.0] (1 row)
set_union() Nan appears multiple times SELECT set_union(x) from (values array[1, 2, nan()], array[1, 3, nan()]) t(x); _col0 --------------------------- [1.0, 2.0, NaN, 3.0, NaN] (1 row)
histogram each nan appears seperately in the histogram. in display it looks like there's only one with an incorrent count, but if you get map_keys you can see it appears n times SELECT histogram(x) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 ----------------------- {1.0=2, 2.0=1, NaN=1} (1 row)SELECT map_keys(histogram(x)) from (values 1, 2, nan(), nan(), nan(), 1) t(x); _col0 --------------------------- [1.0, 2.0, NaN, NaN, NaN] (1 row)
map_agg nan appears multiple times (in display looks like it only appears 1x, but if you get map_keys you can see multiple nans) select map_agg(x,y) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ----------------------- {1.0=a, 2.0=b, NaN=e} (1 row)select map_keys(map_agg(x,y)) from (values(1, 'a'),( 2, 'b'), (nan(), 'c'),(1, 'd'), (nan(), 'e'))t(x,y); _col0 ---------------------- [1.0, 2.0, NaN, NaN]
multimap_agg fails with an index out of bounds exception if there are any nan() keys https://github.com/prestodb/presto/issues/22679 SELECT multimap_agg(x, y) FROM (VALUES (nan(), 2), (2, 3)) t(x,y); -> Query 20240506_161354_39491_uuek4 failed: Index -1 out of bounds for length 1024.
map_union nan keys appear multiple times. in display it looks deduplicated, but if you get map_keys you can see they are not SELECT map_union(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ---------------------------------- {1.0=3, 2.0=4, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row)
map_union_sum nan keys are duplicated and values not summed. in display it looks deduplicated, but if you get map_keys you can see they are not SELECT map_union_sum(x) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------------- {1.0=3, 2.0=24, NaN=30, 100.0=10} (1 row) SELECT map_keys(map_union_sum(x)) from (values map(array[1, 2, nan()], array[3,4,5]), map(array[100, 2, nan()], array[10,20, 30])) t(x); _col0 ----------------------------- [1.0, 2.0, NaN, 100.0, NaN] (1 row)
     
Array functions    
array_min NaN if NaN is in any position in the array select array_min(array[nan(),1, 2, 3]); _col0 ------- NaN (1 row) select array_min(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row)
array_max NaN if NaN is in any position in the array select array_max(array[nan(), 1, 2, 3]); _col0 ------- NaN (1 row)select array_max(array[1, nan(), 2, 3]); _col0 ------- NaN (1 row)
array_sort NaN is biggest select array_sort(array[nan(), 1, 3, 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row)select array_sort(array[1, 3, nan(), 2]); _col0 ---------------------- [1.0, 2.0, 3.0, NaN] (1 row)
array_distinct() Doesn't deduplicate NaN elements select array_distinct(array[1, nan(), 2, nan(), null, null]); _col0 ---------------------------- [1.0, NaN, 2.0, NaN, null] (1 row)
array_duplicates() Doesn't list NaN as duplicate elements select array_duplicates(array[1, nan(), 2, nan(), null, null, 1]); _col0 ------------- [null, 1.0] (1 row)
array_except() Always includes NaN, even if it's in both arrays select array_except(array[1, 2, nan()], array[3, 4, nan()]); _col0 ----------------- [1.0, 2.0, NaN]
array_has_duplicates returns false if NaN is the only duplicate select array_has_duplicates(array[1, nan(), 2, nan()]); _col0 ------- false (1 row)
array_intersect nan() is excluded from the intersection even if it's in both select array_intersect(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- [] (1 row)
array_union() nan() is included multiple times select array_union(array[1, 2, nan()], array[3, 4, nan()]); _col0 -------------------------------- [1.0, 2.0, NaN, 3.0, 4.0, NaN]
array_least_frequent nan() are all considered different from each other, so even if they appear multiple times, each one is considered as appearing once. select array_least_frequent(array[1,1, 2,2, nan(), nan(), nan()]); _col0 ------- [NaN] (1 row) select array_least_frequent(array[1, 2,2, nan(), nan(), nan()]); _col0 ------- [1.0] (1 row)
array_least_frequent(x, n) nan() appears however many times it appears in the array. For ties in frequency it is considered the largest number. select array_least_frequent(array[1, 1, 2,2, nan(), nan(), nan()], 5); _col0 --------------------------- [NaN, NaN, NaN, 1.0, 2.0] (1 row)select array_least_frequent(array[2,2, nan(), nan(), nan(), 1.0], 5); _col0 --------------------------- [1.0, NaN, NaN, NaN, 2.0] (1 row)
array_position nan() is not found presto:di> select array_position(array[2,2, nan(), nan(), nan(), 1.0], nan()); _col0 ------- 0 (1 row)
array_remove does not remove nan select array_remove(array[1, 2, nan()], nan()); _col0 ----------------- [1.0, 2.0, NaN] (1 row)
array_top_n nan is biggest select array_top_n(array[1, 2, nan(), infinity()],3); _col0 ---------------------- [NaN, Infinity, 2.0] (1 row)
arrays_overlap returns false if NaN is the only overlap select arrays_overlap(array[1, 2, nan()], array[3, 4, nan()]); _col0 ------- false (1 row)
contains(x, nan()) checking for contains nan() always returns false select contains(array[nan(), nan(), nan()], nan()); _col0 ------- false (1 row)
     
     
Maps    
creating a map with duplicate nan keys does not throw an error (unlike other duplicate values). The display shows only one of them, but if you try to get map_keys you'll see nan 2xNote that using double keys will always return a warning that floating point types can be non-deterministic select map(array[1, nan(), nan()], array['a', 'b','c']); _col0 ---------------- {1.0=a, NaN=c} (1 row)vs.select map(array[1, nan(), 1], array['a', 'b','c'])[nan()]; Query 20240227_210132_03209_xptub failed: Duplicate map keys (1.0) are not allowed
subscript, element_at, map_subset returns null for nan keys (i.e. doesn't find them) select map(array[1, nan(), nan()], array['a', 'b','c'])[nan()]; _col0 ------- NULL (1 row) select element_at(map(array[1, nan(), 2], array['a', 'b','c']), nan()); _col0 ------- NULL (1 row)select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row)
map_subset doesn't include nan keys in the returned map (i.e. doesn't find them) select map_subset(map(array[1, nan(), 2], array['a', 'b','c']), array[nan(), 1]); _col0 --------- {1.0=a} (1 row)
map_keys returns all keys including duplicate NaN keys select map_keys(map(array[1, nan(), nan()], array['a', 'b','c'])); _col0 ----------------- [1.0, NaN, NaN] (1 row)
map_top_n_keys nan keys are biggest regardless of position select map_top_n_keys(map(array[1, nan(), nan()], array['a', 'b','c']),3); _col0 ----------------- [NaN, NaN, 1.0] (1 row)
map_top_n [BUG] nan ordering is dependent on map order. elements are evaluated in order, and nan is larger than anything before it, and smaller than anything after it. That means something after nan may be considered biggest even if something before nan was bigger. https://github.com/prestodb/presto/issues/22040 select map_top_n(map(array[1, 2, 3], array[nan(), 2, -infinity()]),2); _col0 ---------------------- {2=2.0, 3=-Infinity} (1 row)select map_top_n(map(array[1, 2, 3, 4], array[1, nan(), infinity(), -infinity()]),1); _col0 -------------- {3=Infinity}select map_top_n(map(array[1, 2, 3], array[1, infinity(), nan()]),1); _col0 --------- {3=NaN} (1 row)-- bug!! -infinity is reported as the largest here because it's the only thing after nan select map_top_n(map(array[1, 2, 3, 4], array[1, infinity(), nan(), -infinity()]),1); _col0 --------------- {4=-Infinity} (1 row)
map_top_n_values nan is the biggest regardless of position select map_top_n_values(map(array[1, 2 ,3, 4], array[1,nan(), 2, 0]),2); _col0 ------------ [NaN, 2.0] select map_top_n_values(map(array[1, 2 ,3, 4], array[nan(), 1, 2, 0]),2); _col0 ------------ [NaN, 2.0] (1 row)select map_top_n_values(map(array[1, 2 ,3, 4], array[1, 2, 2, nan()]),2); _col0 ------------ [NaN, 2.0] (1 row)
     
File format min/max (and sorted columns) We can/do use min and max values supplied by the file format for the hive readers (e.g. partial aggregation pushdown, filter pushdown). The handling of nan in regards to min/max may be different across different file formats and may or may not align to whatever decision we make. Additionally, we may read from files that are not written by our readers/writers and therefore have different behavior. Changes may be required to these readers to make our behaviors fully consistent. I don't think Presto Java currently uses sortedness of files, but I think velox does. In addition to ensuring correctness/consistency for filtering related usages fo sortedness, we would also need to make sure e.g. for sort merge joins that the inputs sort nans in the same way  
parquet Presto Parquet writers do not include min/max stats when there are nans. Velox Parquet writers will write min/max stats but exclude NaN values. Other parquet implementations do otherwise, and there are also other proposals for what to do, e.g. see: https://github.com/apache/parquet-cpp/pull/444, https://github.com/apache/arrow-rs/issues/264, https://github.com/apache/parquet-format/pull/221, https://github.com/apache/parquet-format/pull/196  
orc/dwrf The Presto orc/dwrf writer will exclude all statistics if any of the values are NaN. https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/metadata/statistics/DoubleStatisticsBuilder.java#L75-L78. The Velox dwrf writer does the same https://github.com/facebookincubator/velox/blob/main/velox/dwio/dwrf/writer/StatisticsBuilder.h#L270-L273. don't know about other implementations  
other file formats didn't check, needs more investigation  
Tuple domains TupleDomain.compareTo will use the compareTo method for the type, which for DoubleType is Double.compare(), so NaN is largest  
TupleDomainFilter (filter pushdown) tuple domain filters behave the same as <, >, =, and <> operators. They return false except for testing whether NaN is NOT IN the filter  
Other stats    
metastore provided table/partition statistics We do not currently use metastore stats for correctness. only for estimation, and do not consider them fully reliable, so this is not a concern. If we were to change that assumption, we would need to deal with this case (and how we evaluate them in the code). It is worth considering what the metastore expects when we provide min/max stats  
Clone this wiki locally