-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-675: Specify Interval LogicalType #165
Conversation
I apologise if this is spammy, but I'd like to tag the following to bring it to your attention: @julienledem from working on the initial PR, and well ... This is my first time contributing to parquet-format, so I don't know if I need to generate the Java thrift files, but I can try to do that based on the guidance that I receive. Thanks |
@nevi-me, based on the previous PR it will require some time to agree on it. Starting a discussion in the dev list might also help as a heads up. (I do not have the required experience on the SQL standard and the different engines to have an opinion.) About this PR. We will certainly need an update in LogicalTypes.md as well. |
@nevi-me I don't think we should be imposing arrow's modeling of interval type on parquet. The existing interval type seems reasonable in parquet. I think there are three cases to consider:
|
I do agree that if this is something specific to Arrow, then we shouldn't impose it onto parquet. However, given that #43 was created to non-arrow use-case, it looks like the application of a more flexible interval type would have a broader impact? |
If I read the conclusion from #43 it was to potentially store these as separate columns for day_time. So if we really want to introduce these types I think not merging values together would be the way to go. Also, at least from the Arrow C++ perspective, we don't have great support for the interval types (duration is the preferred type in C++ see discussion on apache/arrow#3644) |
Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.
It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application. The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type, as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL. I'll research this subject, read and write files with v2 of the writer, and come back with observations. |
Maybe we should resurrect this to accomodate the new COMPLEX arrow type we are working on? I think we also need backwards compatibility with the existing parquet interval type? |
I am working on the Parquet Rust implementation, specifically conversion with Arrow.
One of the outstanding items in the Parquet types is how to deal with interval types.
This PR proposes adding
LogicalType::Interval(IntervalType)
, which is compatible with Arrow. I have only made changes to the thrift file, as I'd like to get feedback on viability, before documenting the behaviour in the LogicalTypes.md.Much of the detail is however below in this message.
The legacy
ConvertedType
hasINTERVAL
, but this interval is ambiguous for Arrow, because Arrow defines:@julienledem initially suggested deprecating
INTERVAL
by replacing it with 2 converted types in #43, but given that Parquet got logical types since then, we could offer a better alternative by using theLogicalType::Interval(IntervalUnit)
alternative.This would either be 32-bit or 64-bit based on the interval unit.
On the 64-bit representation, I'm not opinionated on whether we should use an INT64 or FiXED_LEN_BYTE_ARRAY(8). I suspect though that we initially used FIXED_LEN_BYTE_ARRAY(12) because there's no 96-bit primitive.
Backward Compatibility with ConvertedType
We do not deprecate the
INTERVAL
converted type, as one could always convert the LogicalType value to either the first 4 bytes, or the last 8, depending on the IntervalUnit; and so write only those bytes.We would mark converting from ConvertedType to LogicalType as undefined behaviour, because without any additional information on which of the 12 bytes are populated, readers could lose information (what Rust is currently doing).
implementations that rely on the old behaviour still have the option of populating both converted type and logical type, so they should not populate the logical type in this instance.