-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add arrow_cast
function to support supports arbitrary arrow types
#5166
Conversation
# SELECT arrow_typeof('1') | ||
# ---- | ||
# Utf8 | ||
query T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just uncommented out these tests -- I am not sure why they were commented out 🤷
datafusion/sql/src/expr/function.rs
Outdated
} | ||
|
||
// Special case arrow_cast | ||
if &name == ARROW_CAST_NAME { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main difference -- if all other function resolution fails, try to resolve it as arrow_cast
before erroring
match self.schema_provider.get_function_meta(&name) { | ||
Some(fm) => { | ||
let args = self.function_args_to_expr(function.args, schema)?; | ||
if let Some(fm) = self.schema_provider.get_function_meta(&name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was just to reduce the indent level (as adding support for arrow_cast
in the existing structure would have added another level)
@@ -2497,7 +2506,7 @@ impl ContextProvider for MockContextProvider { | |||
} | |||
|
|||
fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> { | |||
unimplemented!() | |||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise the test context panic's before it gets to seeing arrow_cast
datafusion/sql/src/expr/cast.rs
Outdated
// do the actual lookup to the appropriate data type | ||
let data_type = parse_data_type(&data_type_string)?; | ||
|
||
return arg0.cast_to(&data_type, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the actual implementation simply calls the existing Expr::Cast
|
||
statement ok | ||
create table foo as select | ||
arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), 'Timestamp(Second, None)') as col_ts_s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of cool -- it shows how to convert to arbitrary Arrow timestamp types.
I wonder if we should deprecate / remove to_timestamp_millis
, to_timstamp_nanos
, ... 🤔 Any thoughts @waitingkuo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if to_timestamp_micros with two parameters (as mentioned in #5398) is needed
I think to_timestamp with two parameters (specifically with user defined formatting) is still a feature gap after this PR.
What I was thinking is that the to_timestamp_millis
, to_timestamp_secs
, etc variants whose only difference is the output arrow timestamp precision (TimeUnit
) are not needed as arrow_cast
can be used to convert between those precisions
|
||
## Supported Arrow Types | ||
|
||
The following types are supported by the `arrow_typeof` function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty stoked about the ability to cast to types that have no direct SQL mapping (for IOx it is dictionary types) but the different timestamp and durations are also quite relevant I think
arg0.cast_to(&data_type, schema) | ||
} | ||
|
||
/// Parses `str` into a `DataType`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the main entrypoint. This uses a Parser very much like the one in https://github.com/sqlparser-rs/sqlparser-rs because I was familiar with that structure and I figured that other contributors to DataFusion might be too
I am not sure what reviewers think of proposing putting this in the arrow-rs crate -- it feels like a better fit there to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed apache/arrow-rs#3821
), | ||
), | ||
), | ||
// TODO support more structured types (List, LargeList, Struct, Union, Map, RunEndEncoded, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List, Struct, etc is not important for my use cases (yet) so I didn't add support for the yet, but I think all the necessary patterns are present. Someone who cares just needs to implement them
This PR is now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @alamb thank you, this feature is great.
I'll be happy to implement the timezone parsing after this pr merged
Co-authored-by: Wei-Ting Kuo <waitingkuo0527@gmail.com>
Co-authored-by: Wei-Ting Kuo <waitingkuo0527@gmail.com>
(I would like to use it to write tests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @alamb
arrow_typeof(col_u64), | ||
arrow_typeof(col_f32), | ||
arrow_typeof(col_f64) | ||
FROM foo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can remove from foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I did that I got the error:
arrow_typeof(col_f64);"
Error: query failed: DataFusion error: Schema error: No field named 'col_i8'.
[SQL] SELECT
arrow_typeof(col_i8),
arrow_typeof(col_i16),
arrow_typeof(col_i32),
arrow_typeof(col_i64),
arrow_typeof(col_u8),
arrow_typeof(col_u16),
arrow_typeof(col_u32),
arrow_typeof(col_u64),
arrow_typeof(col_f32),
arrow_typeof(col_f64);
The point of this test was that the values inserted into the foo
table do indeed have the proper types. I will add some comments to make this clearer.
You can cast a SQL expression to a specific Arrow type using the `arrow_cast` function | ||
For example, to cast the output of `now()` to a `Timestamp` with second precision rather: | ||
|
||
```sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if select arrow_cast(now(), 'Timestamp(Second, Some("+00:00"))')
| `Timestamp(Second, None)` | | ||
| `Timestamp(Millisecond, None)` | | ||
| `Timestamp(Microsecond, None)` | | ||
| `Timestamp(Nanosecond, None)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No support for timestamptz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, it wasn't important for my initial usecase -- @waitingkuo has nicely offered to implement it https://github.com/apache/arrow-datafusion/pull/5166/files#r1125621584 as a follow on. I will file appropriate tickets.
|
||
query TTTT | ||
SELECT | ||
arrow_typeof(col_ts_s), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be also great to have tests select arrow_type_of(arrow_cast(.....))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a4f2753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some relatively minor comments, LGTM
|
||
/// parse the characters in val starting at pos, until the next | ||
/// `,`, `(`, or `)` or end of line | ||
fn parse_word(&mut self) -> Result<Token> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to copy here, could we instead use something like str::split_once
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time messing around with using split_once
-- it was better to avoid copying but I think it made the code quite ugly.
I eventually went with a compromise approach of reusing the String
in 84901bf which saves an allocation on each token.
/// * Token::Rparen, | ||
struct Tokenizer<'a> { | ||
val: &'a str, | ||
chars: Peekable<Chars<'a>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am mistaken, all tokens are ASCII, and so I think this could use bytes directly without needing to worry about UTF-8 shenanigans...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be able to use bytes directly, though then I found it challenging to create str
from those bytes without revalidating each time the data is uf8. Maybe using bytes directly would be a good performance optimization in the future 🤔
Given this function is called once per call to arrow_cast
I don't think it is performance critical (yet)
/// assert_eq!(data_type, DataType::Int32); | ||
/// ``` | ||
/// | ||
/// TODO file a ticket about bringing this into arrow possibly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having FromStr and Display implementations for DataType would be very compelling 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// convert data_type to a string, and then parse it as a type | ||
/// verifying it is the same | ||
fn round_trip(data_type: DataType) { | ||
let data_type_string = data_type.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this is round-tripping the Display output 👍
Just did some quick testing, noticed a minor quirk: ❯ select arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), ' Timestamp (Nanosecond, None )');
+------------------------------------------------------+
| totimestamp(Utf8("2020-01-02 01:01:11.1234567890Z")) |
+------------------------------------------------------+
| 2020-01-02T01:01:11.123456789 |
+------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), ' Timestamp (Nanosecond, None ) ');
Error during planning: Unsupported type ' Timestamp (Nanosecond, None ) '. Must be a supported arrow type name such as 'Int32' or 'Timestamp(Nanosecond, None)'. Error checking trailing content after parsing 'Timestamp(Nanosecond, None)'
❯ The parser is very tolerant of whitespace in the constant string containing the type unless the whitespace is at the very end, unsure if this is intended |
Benchmark runs are scheduled for baseline = 84530a2 and contender = e46924d. e46924d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #5016
Rationale for this change
This function is important to be able to test thing such as
DictionaryArray
via sql (datafusion-cli as well as sqllogictests). It also will help control output into parquet files more precisely, for example.See #5016 for more details
What changes are included in this PR?
arrow_cast
in sql plannerDataType
(opposite ofdata_type.to_string()
)Are these changes tested?
Yes
Are there any user-facing changes?
a new
arrow_cast
function