Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-comparison coercion for Coalesce or even avoid implicit casting for Coalesce #10261

Closed
jayzhan211 opened this issue Apr 27, 2024 · 4 comments · Fixed by #10268
Closed
Assignees
Labels
enhancement New feature or request

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 27, 2024

Is your feature request related to a problem or challenge?

What does Coalesce do

The coalesce function implicitly coerces types with Signature::VariadicEqual which has comparison_coercion internally, and the coercion is taken considered for all the columns, not only those we need, which gives us back unexpected casting results.

Coerce types after first non-null values are known

We can see the following example,

statement ok
create table test1 (a int, b int) as values (null, 3), (2, null);

# test coercion string
query TT
select 
	coalesce(a, b, 'none_set'),
	arrow_typeof(coalesce(a, b, 'none_set'))
from test1;
----
3 Utf8
2 Utf8

Since they are coerced to Utf8, so we get 3 and 2 with Utf8.

Ideally, If we take the first non-null value, we should expect to get Int, not Utf8.

another example, dict is cast to Int64

query IT
select 
	coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'), 34),
	arrow_typeof(coalesce(arrow_cast(123, 'Dictionary(Int32, Int8)'), 34))
----
123 Int64

The reason we need coercion is that it is possible that we have different types for different columns. Coerce them can help we get the final single type.

statement ok
create table test1 as values 
	(arrow_cast(1, 'Int8'), arrow_cast(2, 'Int32')),
	(null, arrow_cast(3, 'Int32'))

query IT
select 
	coalesce(column1, column2),
	arrow_typeof(coalesce(column1, column2))
from test1;
----
1 Int32
3 Int32

We get (1, Int8) and (3, Int32) for respective row, and finally cast them to I32.

I suggest that we apply coercion after we collect those first non-null values for each row.

Use non-comparison coercion

Comparision coercion (fn comparison_coercion) vs non-comparison coercion (fn coerced_from)

Those two logic are quite different, comparison coercion is for comparision. For example, compare dict with dict returns value type, since dict key is not important. Compare i64 with u64 we fallback to i64 because we don't have i128, even there is possible of lossy if you have large U64 value, but most of the cases like U64(1) and I64(1) is comparable, we will not block for those edge cases in comparison. And, there might be more.

Given the difference between these two coercion, I think non-comparison coercion is more suitable for Coalesce function. Btw, I think make_array should switch to non-comparison coercion too. make_array should do comparison coercion, following what Duckdb do

I suggest we switch VariadicEqual to non-comparison coercion or introduce another signature VariadicEqualNonCompare if comparison coercion is needed somewhere.

I think Int64 is a big surprise that we should avoid.

query IT
select 
	coalesce(arrow_cast(3, 'Dictionary(Int32, Int32)'), arrow_cast(3, 'Dictionary(Int32, Int64)')),
	arrow_typeof(coalesce(arrow_cast(3, 'Dictionary(Int32, Int32)'), arrow_cast(3, 'Dictionary(Int32, Int64)')));
----
3 Int64

Maybe disable implicit coercion for Coalesce function?

I'm not sure why is coalesce function introduce implicit coercion, but I found that Postgres and Duckdb does not do implicit casting for Coalesce, maybe we should follow them?

DuckDB errors

D create table t1 (a integer, b varchar);
D insert into t1 values (1, 'a');
D insert into t1 values (null, 'b');
D select coalesce(a, b) from t1;
Error: Binder Error: Cannot mix values of type INTEGER and VARCHAR in COALESCE operator - an explicit cast is required

Postgres Error

postgres=# create table t2(a int, b varchar);
CREATE TABLE
postgres=# insert into t2 values (1, 'a');
INSERT 0 1
postgres=# insert into t2 values (null, 'b');
INSERT 0 1
postgres=# select coalesce(a, b) from t2;
ERROR:  COALESCE types integer and character varying cannot be matched
LINE 1: select coalesce(a, b) from t2;

Describe the solution you'd like

  1. Avoid casting for coalesce at all
  2. Consider coercion after collecting known results if we need casting.
  3. Switch VariadicEqual to non-comparison coercion.

Describe alternatives you've considered

No response

Additional context

No response

Related issue that has coercion issue from Coalesce #10221
Part of the idea #10241

@jayzhan211 jayzhan211 added the enhancement New feature or request label Apr 27, 2024
@alamb
Copy link
Contributor

alamb commented Apr 27, 2024

If possible, I think we should follow the model of existing implementations here (rather than invent DataFusion specific semants)

Specifically if we are going to change the semantics of coalsce I think we should follow either postgres or spark's behavior -- I haven't done the research to know how close/far what DataFusion does compared to those systems

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Apr 27, 2024

Alright, so I think we should remove casting logic for coalesce, which follows what Postgres and DuckDB do. Returns error if there are mixed types for different columns.

It seems Postgres has casting for some of the types too.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Apr 28, 2024

Plan:

  1. I think we should change coerced_from to can_cast_types here, pre-computed reasonable valid types with TypeSignature in get_valid_types.
    for valid_types in valid_types {
    if let Some(types) = maybe_data_types(&valid_types, current_types) {
    return Ok(types);
    }
    }

fn get_valid_types(
signature: &TypeSignature,
current_types: &[DataType],
) -> Result<Vec<Vec<DataType>>> {
fn array_element_and_optional_index(
current_types: &[DataType],
) -> Result<Vec<Vec<DataType>>> {
// make sure there's 2 or 3 arguments
if !(current_types.len() == 2 || current_types.len() == 3) {
return Ok(vec![vec![]]);
}
let first_two_types = &current_types[0..2];
let mut valid_types = array_append_or_prepend_valid_types(first_two_types, true)?;
// Early return if there are only 2 arguments
if current_types.len() == 2 {
return Ok(valid_types);
}
let valid_types_with_index = valid_types
.iter()
.map(|t| {
let mut t = t.clone();
t.push(DataType::Int64);
t
})
.collect::<Vec<_>>();
valid_types.extend(valid_types_with_index);
Ok(valid_types)
}
fn array_append_or_prepend_valid_types(
current_types: &[DataType],
is_append: bool,
) -> Result<Vec<Vec<DataType>>> {
if current_types.len() != 2 {
return Ok(vec![vec![]]);
}
let (array_type, elem_type) = if is_append {
(&current_types[0], &current_types[1])
} else {
(&current_types[1], &current_types[0])
};
// We follow Postgres on `array_append(Null, T)`, which is not valid.
if array_type.eq(&DataType::Null) {
return Ok(vec![vec![]]);
}
// We need to find the coerced base type, mainly for cases like:
// `array_append(List(null), i64)` -> `List(i64)`
let array_base_type = datafusion_common::utils::base_type(array_type);
let elem_base_type = datafusion_common::utils::base_type(elem_type);
let new_base_type = comparison_coercion(&array_base_type, &elem_base_type);
let new_base_type = new_base_type.ok_or_else(|| {
internal_datafusion_err!(
"Coercion from {array_base_type:?} to {elem_base_type:?} not supported."
)
})?;
let new_array_type = datafusion_common::utils::coerced_type_with_base_type_only(
array_type,
&new_base_type,
);
match new_array_type {
DataType::List(ref field)
| DataType::LargeList(ref field)
| DataType::FixedSizeList(ref field, _) => {
let new_elem_type = field.data_type();
if is_append {
Ok(vec![vec![new_array_type.clone(), new_elem_type.clone()]])
} else {
Ok(vec![vec![new_elem_type.to_owned(), new_array_type.clone()]])
}
}
_ => Ok(vec![vec![]]),
}
}
fn array(array_type: &DataType) -> Option<DataType> {
match array_type {
DataType::List(_)
| DataType::LargeList(_)
| DataType::FixedSizeList(_, _) => {
let array_type = coerced_fixed_size_list_to_list(array_type);
Some(array_type)
}
_ => None,
}
}
let valid_types = match signature {
TypeSignature::Variadic(valid_types) => valid_types
.iter()
.map(|valid_type| current_types.iter().map(|_| valid_type.clone()).collect())
.collect(),
TypeSignature::Uniform(number, valid_types) => valid_types
.iter()
.map(|valid_type| (0..*number).map(|_| valid_type.clone()).collect())
.collect(),
TypeSignature::VariadicEqual => {
let new_type = current_types.iter().skip(1).try_fold(
current_types.first().unwrap().clone(),
|acc, x| {
// The coerced types found by `comparison_coercion` are not guaranteed to be
// coercible for the arguments. `comparison_coercion` returns more loose
// types that can be coerced to both `acc` and `x` for comparison purpose.
// See `maybe_data_types` for the actual coercion.
let coerced_type = comparison_coercion(&acc, x);
if let Some(coerced_type) = coerced_type {
Ok(coerced_type)
} else {
internal_err!("Coercion from {acc:?} to {x:?} failed.")
}
},
);
match new_type {
Ok(new_type) => vec![vec![new_type; current_types.len()]],
Err(e) => return Err(e),
}
}
TypeSignature::VariadicAny => {
vec![current_types.to_vec()]
}
TypeSignature::Exact(valid_types) => vec![valid_types.clone()],
TypeSignature::ArraySignature(ref function_signature) => match function_signature
{
ArrayFunctionSignature::ArrayAndElement => {
array_append_or_prepend_valid_types(current_types, true)?
}
ArrayFunctionSignature::ElementAndArray => {
array_append_or_prepend_valid_types(current_types, false)?
}
ArrayFunctionSignature::ArrayAndIndex => {
if current_types.len() != 2 {
return Ok(vec![vec![]]);
}
array(&current_types[0]).map_or_else(
|| vec![vec![]],
|array_type| vec![vec![array_type, DataType::Int64]],
)
}
ArrayFunctionSignature::ArrayAndElementAndOptionalIndex => {
array_element_and_optional_index(current_types)?
}
ArrayFunctionSignature::Array => {
if current_types.len() != 1 {
return Ok(vec![vec![]]);
}
array(&current_types[0])
.map_or_else(|| vec![vec![]], |array_type| vec![vec![array_type]])
}
},
TypeSignature::Any(number) => {
if current_types.len() != *number {
return plan_err!(
"The function expected {} arguments but received {}",
number,
current_types.len()
);
}
vec![(0..*number).map(|i| current_types[i].clone()).collect()]
}
TypeSignature::OneOf(types) => types
.iter()
.filter_map(|t| get_valid_types(t, current_types).ok())
.flatten()
.collect::<Vec<_>>(),
};
Ok(valid_types)
}

  1. Non-comparison coercion: type resolution coercion for Coalesce, MakeArray, and so on.

https://www.postgresql.org/docs/current/typeconv-union-case.html

  1. Deprecate/Remove coerced_from at the end!

@alamb
Copy link
Contributor

alamb commented May 1, 2024

BTW this is a really nicely written ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants