Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: load parquet use own can_cast_to() #16072

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

101 changes: 94 additions & 7 deletions src/query/expression/src/type_check.rs
Original file line number Diff line number Diff line change
Expand Up @@ -571,13 +571,12 @@ pub fn can_auto_cast_to(
}
(_, _) => unreachable!(),
},
(DataType::Tuple(src_tys), DataType::Tuple(dest_tys))
if src_tys.len() == dest_tys.len() =>
{
src_tys
.iter()
.zip(dest_tys)
.all(|(src_ty, dest_ty)| can_auto_cast_to(src_ty, dest_ty, auto_cast_rules))
(DataType::Tuple(src_tys), DataType::Tuple(dest_tys)) => {
src_tys.len() == dest_tys.len()
&& src_tys
.iter()
.zip(dest_tys)
.all(|(src_ty, dest_ty)| can_auto_cast_to(src_ty, dest_ty, auto_cast_rules))
}
(DataType::String, DataType::Decimal(_)) => true,
(DataType::Decimal(x), DataType::Decimal(y)) => {
Expand Down Expand Up @@ -745,3 +744,91 @@ pub const ALL_SIMPLE_CAST_FUNCTIONS: &[&str] = &[
pub fn is_simple_cast_function(name: &str) -> bool {
ALL_SIMPLE_CAST_FUNCTIONS.contains(&name)
}

/// # Differences with `can_auto_cast_to`
///
/// ## For Users
/// Suppose we have `fn foo(dest_ty)`.
/// - `can_auto_cast` means the user can call `foo(src_ty)` directly.
/// - `can_cast` means the user can call `foo(cast(c1 as dest_ty))`. This should be consistent with `Evaluator::run_cast`, which is partially ensured by the test `test_can_cast_to`.
///
/// ## For Internal Usage
/// - `can_auto_cast` helps us choose the most appropriate destination type:
/// - From multiple overloaded instances of `foo`.
/// - As a common supertype of multiple source types.
/// - `can_cast` is currently only used when loading Parquet/ORC/Iceberg files as a pre-check.
///
/// ## Principle of Making the Rules
/// - `can_auto_cast` requires casting to a supertype.
/// - `can_cast` returns true as long as some values of the source type can, in some form, be interpreted as values of the destination type.
///
/// ## Examples
/// - `can_auto_cast` only allows casting from `int8` to `int16`, but not vice versa.
/// - `can_cast` allows casting between any two numeric types.
pub fn can_cast_to(src_type: &DataType, dest_type: &DataType) -> bool {
b41sh marked this conversation as resolved.
Show resolved Hide resolved
use DataType::*;
if src_type == dest_type {
return true;
}
// we mainly care about which types can/cannot cast to dest_type.
// the match is written in a way to make it easier to read this info.
// try to use less _ to avoid miss something
match (dest_type, src_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems put src_type on left is more customary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments explain the reason.

// we mainly care about which types can/cannot cast to dest_type.
// the match is written in a way to make it easier to read this info.

Copy link
Member Author

@youngsofun youngsofun Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, our use case is as follows: we have a dest type and several src types. We need to determine if the cast between them is valid.

and for each case (row), there is only one dest and multi src, so put dest in front is more alined and readable

(Null | EmptyArray | EmptyMap | Generic(_), _) => unreachable!(),

// ==== remove null first
(Nullable(_), Null) => true,
(Nullable(box dest_ty), Nullable(box src_ty))
| (dest_ty, Nullable(box src_ty))
| (Nullable(box dest_ty), src_ty) => can_cast_to(src_ty, dest_ty),

// ==== dive into nested types, must from the same type
(Map(box dest_inner), Map(box src_inner)) => match (dest_inner, src_inner) {
(Tuple(_), Tuple(_)) => can_cast_to(src_inner, dest_inner),
(_, _) => unreachable!(),
},
(Map(_), EmptyMap) => true,
(Map(_), _) => false,

(Tuple(dest_tys), Tuple(src_tys)) => {
src_tys.len() == dest_tys.len()
&& src_tys
.iter()
.zip(dest_tys)
.all(|(src_ty, dest_ty)| can_cast_to(src_ty, dest_ty))
}
(Tuple(_), _) => false,

(Array(box dest_ty), Array(box src_ty)) => can_cast_to(src_ty, dest_ty),
(Array(_), EmptyArray) => true,
(Array(_), _) => false,

// ==== handle atomic types at last, so the _ bellow only need to consider them.
(String | Variant, _) => true,

(Number(_), Binary | Bitmap) => false,
(Number(_), _) => true,

// not allow Binary|Date|Timestamp|Variant
(Decimal(_), Number(_) | String | Decimal(_) | Boolean) => true,
(Decimal(_), _) => false,

// not allow Binary|Date|Timestamp
(Boolean, Number(_) | String | Variant | Decimal(_) | Boolean) => true,
(Boolean, _) => false,

(Timestamp | Date, String | Variant | Timestamp | Date) => true,
(Timestamp | Date, Number(nt)) => nt.is_integer(),
(Timestamp | Date, _) => false,

(Binary, String) => true,
(Binary, _) => false,

(Bitmap, String) => true,
(Bitmap, Number(nt)) => !nt.is_signed(),
(Bitmap, _) => false,

(Geometry, String | Binary | Variant) => true,
(Geometry, _) => false,
}
}
1 change: 0 additions & 1 deletion src/query/expression/src/types/variant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,6 @@ impl ValueType for VariantType {
}

fn push_default(builder: &mut Self::ColumnBuilder) {
builder.put_slice(b"");
builder.commit_row();
}

Expand Down
1 change: 1 addition & 0 deletions src/query/functions/tests/it/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#![feature(try_blocks)]
#![feature(trait_alias)]
#![feature(iter_collect_into)]
#![feature(box_patterns)]
#![allow(clippy::arc_with_non_send_sync)]

// We can generate new test files via using `env REGENERATE_GOLDENFILES=1 cargo test` and `git diff` to show differs
Expand Down
146 changes: 146 additions & 0 deletions src/query/functions/tests/it/scalars/can_cast.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
// Copyright 2022 Datafuse Labs.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

use databend_common_exception::Result;
use databend_common_expression::date_helper::TzLUT;
use databend_common_expression::type_check::can_cast_to;
use databend_common_expression::type_check::check_cast;
use databend_common_expression::types::variant::cast_scalar_to_variant;
use databend_common_expression::types::DataType;
use databend_common_expression::types::DecimalDataType;
use databend_common_expression::types::DecimalSize;
use databend_common_expression::types::NumberDataType;
use databend_common_expression::BlockEntry;
use databend_common_expression::ColumnBuilder;
use databend_common_expression::DataBlock;
use databend_common_expression::Evaluator;
use databend_common_expression::Expr;
use databend_common_expression::Expr::ColumnRef;
use databend_common_expression::FunctionContext;
use databend_common_expression::Scalar;
use databend_common_expression::Value;
use databend_common_functions::BUILTIN_FUNCTIONS;
use roaring::RoaringTreemap;

fn cast_simple_value(src_type: &DataType, dest_type: &DataType) -> Result<Expr> {
let n = 2;
let src_str = match dest_type {
DataType::Boolean => "false",
DataType::Timestamp | DataType::Date => "1970-01-01 00:00:00.000000",
_ => "0",
};
let value = match src_type {
DataType::String => Value::Column(
ColumnBuilder::repeat(&Scalar::String(src_str.to_string()).as_ref(), n, src_type)
.build(),
),
DataType::Binary => Value::Column(
ColumnBuilder::repeat(
&Scalar::Binary(src_str.as_bytes().to_vec()).as_ref(),
n,
src_type,
)
.build(),
),
DataType::Variant => {
let s = Scalar::String(src_str.to_string());
let mut buf = vec![];
cast_scalar_to_variant(s.as_ref(), TzLUT::default(), &mut buf);
let s = Scalar::Variant(buf);
Value::Column(ColumnBuilder::repeat(&s.as_ref(), n, src_type).build())
}
_ => Value::Column(ColumnBuilder::repeat_default(src_type, n).build()),
};

let block = DataBlock::new(vec![BlockEntry::new(src_type.clone(), value)], 2);
let func_ctx = FunctionContext::default();
let evaluator = Evaluator::new(&block, &func_ctx, &BUILTIN_FUNCTIONS);
let expr = ColumnRef {
span: None,
id: 0,
data_type: src_type.clone(),
display_name: "c1".to_string(),
};
let expr = check_cast(None, false, expr, dest_type, &BUILTIN_FUNCTIONS)?;
let r = evaluator.run(&expr)?;
let r0 = r.index(0).unwrap();
let exp = match dest_type {
DataType::Boolean => Scalar::Boolean(false),
DataType::Binary => Scalar::Binary("0".as_bytes().to_vec()),
DataType::Timestamp => Scalar::Timestamp(0),
DataType::Date => Scalar::Date(0),
DataType::Bitmap => {
let mut buf = vec![];
let rb = RoaringTreemap::from_iter([0].iter());
rb.serialize_into(&mut buf).unwrap();
Scalar::Bitmap(buf)
}
_ => Scalar::default_value(dest_type),
};
assert_eq!(
r0,
exp.as_ref(),
"wrong cast result: {src_type} -> {dest_type}: expect {exp:?}, got {r0:?}",
);
Ok(expr)
}

#[test]
fn test_can_cast_to() {
let types = [
DataType::Number(NumberDataType::Float32),
DataType::Number(NumberDataType::Float64),
DataType::Number(NumberDataType::Int8),
DataType::Number(NumberDataType::UInt8),
DataType::Decimal(DecimalDataType::Decimal128(DecimalSize {
precision: 10,
scale: 2,
})),
DataType::Decimal(DecimalDataType::Decimal128(DecimalSize {
precision: 10,
scale: 0,
})),
DataType::Boolean,
DataType::Timestamp,
DataType::Date,
DataType::String,
DataType::Binary,
DataType::Variant,
DataType::Bitmap,
// todo: fix bug about default value of Geometry
// DataType::Geometry,
];

// evaluating
for dst in &types {
if !matches!(dst, DataType::String | DataType::Variant) {
for src in &types {
if src != dst {
let exp = can_cast_to(src, dst);
let res = cast_simple_value(src, dst);
if let Err(err) = &res {
assert!(err.message().contains("unable to cast type"), "{err}");
assert!(!err.message().contains("evaluating"), "{err}");
};
let res = res.is_ok();
assert_eq!(
res, exp,
"{src} to {dst} is {}, but can_cast_to return {exp}",
res
)
}
}
}
}
}
1 change: 1 addition & 0 deletions src/query/functions/tests/it/scalars/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ mod datetime;
mod geo;
// NOTE:(everpcpc) result different on macos
// TODO: fix this in running on linux
mod can_cast;
#[cfg(not(target_os = "macos"))]
mod geo_h3;
mod geometry;
Expand Down
2 changes: 0 additions & 2 deletions src/query/storages/common/stage/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ publish = { workspace = true }
edition = { workspace = true }

[dependencies]
arrow-cast = { workspace = true }
arrow-schema = { workspace = true, features = ["serde"] }
databend-common-catalog = { workspace = true }
databend-common-exception = { workspace = true }
databend-common-expression = { workspace = true }
Expand Down
40 changes: 19 additions & 21 deletions src/query/storages/common/stage/src/read/columnar/projection.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,8 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use arrow_cast::can_cast_types;
use arrow_schema::Field;
use databend_common_exception::ErrorCode;
use databend_common_expression::type_check::can_cast_to;
use databend_common_expression::type_check::check_cast;
use databend_common_expression::Expr;
use databend_common_expression::RemoteExpr;
Expand Down Expand Up @@ -53,28 +52,27 @@ pub fn project_columnar(
display_name: from_field.name().clone(),
};

// find a better way to do check cast
if from_field.data_type == to_field.data_type {
expr
} else if can_cast_types(
Field::from(from_field).data_type(),
Field::from(to_field).data_type(),
) {
check_cast(
None,
false,
expr,
&to_field.data_type().into(),
&BUILTIN_FUNCTIONS,
)?
} else {
return Err(ErrorCode::BadDataValueType(format!(
"fail to load file {}: Cannot cast column {} from {:?} to {:?}",
location,
field_name,
from_field.data_type(),
to_field.data_type()
)));
// note: tuple field name is dropped here, matched by pos here
if can_cast_to(&from_field.data_type().into(), &to_field.data_type().into()) {
check_cast(
None,
false,
expr,
&to_field.data_type().into(),
&BUILTIN_FUNCTIONS,
)?
} else {
return Err(ErrorCode::BadDataValueType(format!(
"fail to load file {}: Cannot cast column {} from {:?} to {:?}",
location,
field_name,
from_field.data_type(),
to_field.data_type()
)));
}
}
}
None => {
Expand Down
36 changes: 36 additions & 0 deletions tests/sqllogictests/suites/stage/formats/parquet/auto_cast.test
Original file line number Diff line number Diff line change
@@ -1,3 +1,39 @@
statement ok
drop stage if exists s1;

statement ok
create stage s1;

statement ok
create or replace table t(a int null);

statement ok
insert into table t values (1), (2);

statement ok
copy into @s1 from t;

statement ok
create or replace table t2(a int not null);

statement ok
copy into t2 from @s1;

statement ok
insert into table t values (null);

statement ok
copy into @s1 from t;

statement error 1006.*fail to auto cast column
copy into t2 from @s1;

statement ok
create or replace table t3(a string null);

statement ok
copy into t3 from @s1;

statement ok
drop table if exists ts;

Expand Down
Loading