Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not create a TimestampNanosecondArray that has a specified timezone #597

Closed
Tracked by #3148
alamb opened this issue Jul 22, 2021 · 12 comments
Closed
Tracked by #3148

Can not create a TimestampNanosecondArray that has a specified timezone #597

alamb opened this issue Jul 22, 2021 · 12 comments
Labels
bug enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 22, 2021

Describe the bug
I can not create a PrimitiveArray that has a DataType::Timestamp with a timezone other than None

To Reproduce
I am trying to pretty print an array that has timestamps using UTC time.

So instead of 1970-01-01 00:00:00.000000100 I want to print something more like the following (note the Z, in RFC3339): 1970-01-01T00:00:00.000000100Z

To do so I figured I would "simply" create a new array that had a Timestamp type with the timezone set to UTC

TimestampNanosecondArray is defined like this:

pub type TimestampNanosecondArray = PrimitiveArray<TimestampNanosecondType>;

And TimestampNanosecondType is (effectively) something like

struct TimestampNanosecondType {}
impl ArrowPrimitiveType for TimestampNanosecondType {
  type Native = i64;
  const DATA_TYPE = DataType::Timestamp(TimeUnit::Nanosecond, None)
}

So I figured I would make a PrimitiveArray<SOME_TYPE_THAT_HAD_MY_TIMEZONE_SPECIFIED>

struct TimestampUtcNanosecondType {}
impl ArrowPrimitiveType for TimestampUtcNanosecondType {
    type Native = i64;
    const DATA_TYPE: DataType = DataType::Timestamp(TimeUnit::Nanosecond, Some("+00:00".to_string()));
}

Uhoh! The compiler doesn't like that because it needs a const String which you can't have....

error[E0015]: calls in constants are limited to constant functions, tuple structs and tuple variants
  --> src/main.rs:38:80
   |
38 |     const DATA_TYPE: DataType = DataType::Timestamp(TimeUnit::Nanosecond, Some("+00:00".to_string()));
   |                                                                                ^^^^^^^^^^^^^^^^^^^^

So I conclude it is basically impossible to create an array with a timezone other than None

Expected behavior
I expect to be able to create an array using a timezone

Proposal

Proposal:

Change DataType::Timestamp from

Timestamp(TimeUnit, Option<String>),

to

Timestamp(TimeUnit, Option<&static str>),

Additional context
FWIW I also tried using lazy_static but it doesn't provide a const String (only a static String)

I view this as a potential first step towards handling timestamps properly in arrow and datafusion: apache/datafusion#686

@alamb alamb added bug enhancement Any new improvement worthy of a entry in the changelog labels Jul 22, 2021
@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2021

@jorgecarleitao / @Dandandan / @GregBowyer / @velvia can you think of something I have missed here? Is it a reasonable proposal to switch to using &static string for the timezone specifier?

@rok
Copy link
Member

rok commented Jul 22, 2021

C++ uses a string but I'm not familiar with rust implementation at all.

@velvia
Copy link
Contributor

velvia commented Jul 22, 2021

@alamb there is actually a way to create primitive timezone arrays in other timezones. I have some code which I'll be PR'ing against both Arrow and DataFusion which does this, but includes other things we'll need. However, something like this:

let data = vec![Some(.....), Some(....)];
let array_utc = TimestampNanosecondArray::from_opt_vec(data, Some("UTC".to_owned()));

You can use from_vec as well and it works too. I found other ways of creating timezone-based arrays as well.

You don't need to create a type which has non-None timezone (yes I'm aware of that limitation and was going to point it out too). The key is that the underlying ArrayData has the correct, actual timezone, and this is what is returned by the data_type() calls and checked dynamically. There is an inconsistency there though, which I agree with, and solving that needs more discussion.

Regarding switching to &str, some thoughts:

  • Adding a ref/pointer will require adding lifetimes to type annotations, which would be really annoying and a huge global change
  • I actually think if we were to change the underlying timezone data type, it should be a numeric offset, not a string. Strings take up a large amount of space, at least 24 bytes on x86 + the actual storage of the string, and are slow; furthermore every time we need to actually translate it to the offset, which is more likely what we care about for computations. I'd prefer some kind of TimeOffset which could simply be Option<i32> or something like that - that's only 5 bytes. :)

@velvia
Copy link
Contributor

velvia commented Jul 22, 2021

@alamb PRs/issues which are related to this and other timezone issues and I plan to submit soon-ish:

  • Adding proper timezone support for ScalarValue, without which many things in SQL with timezones just don't work
  • Having arrays with timezones be properly propagated through various kernels. Simple fixes but really important
  • Having Debug actually print the actual timezone, not the None implied by the type, for timestamp arrays

@rok
Copy link
Member

rok commented Jul 23, 2021

@velvia

  • I actually think if we were to change the underlying timezone data type, it should be a numeric offset, not a string. Strings take up a large amount of space, at least 24 bytes on x86 + the actual storage of the string, and are slow; furthermore every time we need to actually translate it to the offset, which is more likely what we care about for computations. I'd prefer some kind of TimeOffset which could simply be Option<i32> or something like that - that's only 5 bytes. :)

Offsets are much more practical indeed but don't cover things like daylight savings changes and human readable timezones (e.g. EST, PST, CET). But I don't know what your scope is and offsets might be enough.

Chrono-tz seems to implement a timezone database in case you decide to look into that direction.

@velvia
Copy link
Contributor

velvia commented Jul 23, 2021

@rok

Offsets are much more practical indeed but don't cover things like daylight savings changes and human readable timezones (e.g. EST, PST, CET). But I don't know what your scope is and offsets might be enough.

The scope is an internal, compact representation to 1) facilitate easy and fast computation, 2) minimize memory and storage costs, 3) minimize network serialization overhead in Ballista.

It should be easily convertible to string representation. It is true that numeric offsets would lose some information, perhaps there are better representations out there, like an enum from chronos or something.

At the same time, this is also a deep rabbit hole and trying to be completely accurate might also not be practical....

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2021

@velvia I suggest we ensure the Rust implementation can interoperate with the timezone / time definition (e.g. use a string as defined here: https://github.com/apache/arrow-rs/blob/master/format/Schema.fbs#L226-L246) That string representation also allows for numeric offsets like "+01:00" or named timezones).

If str manipulation is too slow, perhaps we can use an enum like

enum Timezone {
  /// UTC offset, in minutes (?tbd units)
  Offset(i32)
  /// character with the meaning from arrow definition
  name(&'static str)
}

@alamb there is actually a way to create primitive timezone arrays in other timezones.

That is quite cool -- I did not realize that 👍 It is still somewhat strange because the TimestampNanosecondArray is effectively "typed" to a DataType::Timestamp(TimeUnit::Nanoseconds, None) but the actual instance may have a non-None timestamp, which seems confusing

Adding a ref/pointer will require adding lifetimes to type annotations, which would be really annoying and a huge global change

That is why I proposed using 'static (aka so that we didn't have to plumb through lifetimes everywhere) - but it might be annoying to arrange for static strings

@velvia
Copy link
Contributor

velvia commented Jul 24, 2021

@alamb I agree that enum would cover all existing use cases. Still I'm curious, isn't there any restriction on what that string can be? In reality in order to perform any timezone adjustments, that string should really conform to one of the standard timezone abbreviations or fully qualified timezone names, or offset - otherwise it cannot be useful.

Ideally it would be like an enum for each of the possible timezones in the IANA database, for which chronos-tz has Timezone implementations.

If we leave it as a string, there may be timezones which cannot be acted on.... I guess this is likely to be the case. Having the offset though (or a timezone enum) offers the possibility of faster translation.

Agreed &static would work without lifetimes.

@jorgecarleitao
Copy link
Member

I do not see what is the concern with String. It represent exactly what the spec expects; an owned utf8.

The root cause here seems to be the trait of the generic of the PrimitiveArray, that expects a type with a compiled-defined data type, which only works for small, finite-sized datatypes. Even the decimal array could have been represented as a PrimitiveArray.

Arrow2 solves this by decoupling the generic on the PrimitiveArray from the (logical) DataType, exactly to natively support timestamps with timezones.

I would recommend bridging this gap by refactoring the code to not have the DATA_TYPE as the constant of the trait.

AFAI understand &'static forbids people from declaring an offset based on some transformation. E.g. a field composed by something like "some_garbage_{offset}" can no longer be parsed into an offset without matching for all cases, since the "_offset" will not be 'static.

Having a generic for every possible timezone significantly bloats the binary, as it requires all generics to be re-compiled for every declared timezone variation, and there are a lot of them. The typical problem will be in matching the DataType and having to have one branch per Timezone, which requires a significant number of cases.

Generics are used to support different physical in-memory representations of a struct, not to declare semantics. Semantics based on generics is a recipe for large binaries and/or a large number of matches in downcasting trait objects.

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2021

Good points @jorgecarleitao and @velvia -- it sounds like my challenge / problem with TimestampNanosecondArray will be solved when we bring in the ideas of arrow2. If it becomes a problem I can look into removing DATA_TYPE as Jorge suggests.

One thing I have noticed that might warrant more thought about something other than String for storing timezones is that the DataType struct is copied around a lot in arrow code. Maybe something more like Arc<str> would be appropriate if we ever want to change the type.

@alamb alamb changed the title Can not create a Timestamp array that has a specified timezone Can not create a TimestampNanosecondArray that has a specified timezone Jul 24, 2021
@velvia
Copy link
Contributor

velvia commented Jul 24, 2021 via email

@tustvold
Copy link
Contributor

I believe https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.with_timezone provides the necessary API.

This is also consistent with how we now handle decimals, where the DATA_TYPE is only the default value, and can be overriden with a later call to https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.with_precision_and_scale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

5 participants