Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add arrow_cast function to support supports arbitrary arrow types #5166

Merged
merged 13 commits into from
Mar 8, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 2, 2023

Which issue does this PR close?

Closes #5016

Rationale for this change

This function is important to be able to test thing such as DictionaryArray via sql (datafusion-cli as well as sqllogictests). It also will help control output into parquet files more precisely, for example.

See #5016 for more details

What changes are included in this PR?

  • Add special handling for arrow_cast in sql planner
  • code to convert from string to an DataType (opposite of data_type.to_string())
  • sqllogicbased tests
  • Documentation of the new function

Are these changes tested?

Yes

Are there any user-facing changes?

a new arrow_cast function

@github-actions github-actions bot added core Core DataFusion crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2023
# SELECT arrow_typeof('1')
# ----
# Utf8
query T
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just uncommented out these tests -- I am not sure why they were commented out 🤷

}

// Special case arrow_cast
if &name == ARROW_CAST_NAME {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main difference -- if all other function resolution fails, try to resolve it as arrow_cast before erroring

match self.schema_provider.get_function_meta(&name) {
Some(fm) => {
let args = self.function_args_to_expr(function.args, schema)?;
if let Some(fm) = self.schema_provider.get_function_meta(&name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was just to reduce the indent level (as adding support for arrow_cast in the existing structure would have added another level)

@@ -2497,7 +2506,7 @@ impl ContextProvider for MockContextProvider {
}

fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> {
unimplemented!()
None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise the test context panic's before it gets to seeing arrow_cast

// do the actual lookup to the appropriate data type
let data_type = parse_data_type(&data_type_string)?;

return arg0.cast_to(&data_type, schema);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual implementation simply calls the existing Expr::Cast


statement ok
create table foo as select
arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), 'Timestamp(Second, None)') as col_ts_s,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of cool -- it shows how to convert to arbitrary Arrow timestamp types.

I wonder if we should deprecate / remove to_timestamp_millis, to_timstamp_nanos, ... 🤔 Any thoughts @waitingkuo ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! thank you @alamb

i wonder if to_timestamp_micros with two parameters (as mentioned in #5398) is needed. If not, I think it's a great timing to deprecate them and align the current to_timestamp to postgrseql's (1. return type 2. accept the 2nd argument).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if to_timestamp_micros with two parameters (as mentioned in #5398) is needed

I think to_timestamp with two parameters (specifically with user defined formatting) is still a feature gap after this PR.

What I was thinking is that the to_timestamp_millis, to_timestamp_secs, etc variants whose only difference is the output arrow timestamp precision (TimeUnit) are not needed as arrow_cast can be used to convert between those precisions


## Supported Arrow Types

The following types are supported by the `arrow_typeof` function:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty stoked about the ability to cast to types that have no direct SQL mapping (for IOx it is dictionary types) but the different timestamp and durations are also quite relevant I think

arg0.cast_to(&data_type, schema)
}

/// Parses `str` into a `DataType`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the main entrypoint. This uses a Parser very much like the one in https://github.com/sqlparser-rs/sqlparser-rs because I was familiar with that structure and I figured that other contributors to DataFusion might be too

I am not sure what reviewers think of proposing putting this in the arrow-rs crate -- it feels like a better fit there to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

),
),
),
// TODO support more structured types (List, LargeList, Struct, Union, Map, RunEndEncoded, etc)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List, Struct, etc is not important for my use cases (yet) so I didn't add support for the yet, but I think all the necessary patterns are present. Someone who cares just needs to implement them

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2023

This PR is now ready for review

Copy link
Contributor

@waitingkuo waitingkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @alamb thank you, this feature is great.

I'll be happy to implement the timezone parsing after this pr merged

datafusion/sql/src/expr/arrow_cast.rs Outdated Show resolved Hide resolved
datafusion/sql/src/expr/arrow_cast.rs Outdated Show resolved Hide resolved
datafusion/sql/src/expr/arrow_cast.rs Outdated Show resolved Hide resolved
datafusion/sql/src/expr/arrow_cast.rs Outdated Show resolved Hide resolved
alamb and others added 2 commits March 5, 2023 06:56
Co-authored-by: Wei-Ting Kuo <waitingkuo0527@gmail.com>
Co-authored-by: Wei-Ting Kuo <waitingkuo0527@gmail.com>
@alamb
Copy link
Contributor Author

alamb commented Mar 7, 2023

@comphead @Jefffrey or @ygf11 might one of you have time to help review this pull request?

@alamb
Copy link
Contributor Author

alamb commented Mar 7, 2023

(I would like to use it to write tests)

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @alamb

arrow_typeof(col_u64),
arrow_typeof(col_f32),
arrow_typeof(col_f64)
FROM foo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can remove from foo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did that I got the error:

  arrow_typeof(col_f64);"
Error: query failed: DataFusion error: Schema error: No field named 'col_i8'.
[SQL] SELECT
  arrow_typeof(col_i8),
  arrow_typeof(col_i16),
  arrow_typeof(col_i32),
  arrow_typeof(col_i64),
  arrow_typeof(col_u8),
  arrow_typeof(col_u16),
  arrow_typeof(col_u32),
  arrow_typeof(col_u64),
  arrow_typeof(col_f32),
  arrow_typeof(col_f64);

The point of this test was that the values inserted into the foo table do indeed have the proper types. I will add some comments to make this clearer.

You can cast a SQL expression to a specific Arrow type using the `arrow_cast` function
For example, to cast the output of `now()` to a `Timestamp` with second precision rather:

```sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if select arrow_cast(now(), 'Timestamp(Second, Some("+00:00"))')

| `Timestamp(Second, None)` |
| `Timestamp(Millisecond, None)` |
| `Timestamp(Microsecond, None)` |
| `Timestamp(Nanosecond, None)` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No support for timestamptz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it wasn't important for my initial usecase -- @waitingkuo has nicely offered to implement it https://github.com/apache/arrow-datafusion/pull/5166/files#r1125621584 as a follow on. I will file appropriate tickets.


query TTTT
SELECT
arrow_typeof(col_ts_s),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be also great to have tests select arrow_type_of(arrow_cast(.....))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a4f2753

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some relatively minor comments, LGTM


/// parse the characters in val starting at pos, until the next
/// `,`, `(`, or `)` or end of line
fn parse_word(&mut self) -> Result<Token> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy here, could we instead use something like str::split_once perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time messing around with using split_once -- it was better to avoid copying but I think it made the code quite ugly.

I eventually went with a compromise approach of reusing the String in 84901bf which saves an allocation on each token.

/// * Token::Rparen,
struct Tokenizer<'a> {
val: &'a str,
chars: Peekable<Chars<'a>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am mistaken, all tokens are ASCII, and so I think this could use bytes directly without needing to worry about UTF-8 shenanigans...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be able to use bytes directly, though then I found it challenging to create str from those bytes without revalidating each time the data is uf8. Maybe using bytes directly would be a good performance optimization in the future 🤔

Given this function is called once per call to arrow_cast I don't think it is performance critical (yet)

/// assert_eq!(data_type, DataType::Int32);
/// ```
///
/// TODO file a ticket about bringing this into arrow possibly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having FromStr and Display implementations for DataType would be very compelling 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// convert data_type to a string, and then parse it as a type
/// verifying it is the same
fn round_trip(data_type: DataType) {
let data_type_string = data_type.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this is round-tripping the Display output 👍

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 8, 2023

Just did some quick testing, noticed a minor quirk:

select arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), '          Timestamp        (Nanosecond,      None               )');
+------------------------------------------------------+
| totimestamp(Utf8("2020-01-02 01:01:11.1234567890Z")) |
+------------------------------------------------------+
| 2020-01-02T01:01:11.123456789                        |
+------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select arrow_cast(to_timestamp('2020-01-02 01:01:11.1234567890Z'), '          Timestamp        (Nanosecond,      None               ) ');
Error during planning: Unsupported type '          Timestamp        (Nanosecond,      None               ) '. Must be a supported arrow type name such as 'Int32' or 'Timestamp(Nanosecond, None)'. Error checking trailing content after parsing 'Timestamp(Nanosecond, None)'

The parser is very tolerant of whitespace in the constant string containing the type unless the whitespace is at the very end, unsure if this is intended

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2023

Just did some quick testing, noticed a minor quirk:

Great catch. Thank you @Jefffrey -- It was not intended. Fixed in ff5d72b

@alamb alamb merged commit e46924d into apache:main Mar 8, 2023
@ursabot
Copy link

ursabot commented Mar 8, 2023

Benchmark runs are scheduled for baseline = 84530a2 and contender = e46924d. e46924d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add casting from arbitrary arrow types
7 participants