PARQUET-675: Specify Interval LogicalType #165

nevi-me · 2021-01-23T10:56:23Z

I am working on the Parquet Rust implementation, specifically conversion with Arrow.
One of the outstanding items in the Parquet types is how to deal with interval types.

This PR proposes adding LogicalType::Interval(IntervalType), which is compatible with Arrow. I have only made changes to the thrift file, as I'd like to get feedback on viability, before documenting the behaviour in the LogicalTypes.md.
Much of the detail is however below in this message.

The legacy ConvertedType has INTERVAL, but this interval is ambiguous for Arrow, because Arrow defines:

Interval::YearMonth: i32 representing the number of elapsed whole months
Interval::DayTime: i64 stored as 2 contiguous 32-bit integers, representing the number of elapsed days and milliseconds, respectively

@julienledem initially suggested deprecating INTERVAL by replacing it with 2 converted types in #43, but given that Parquet got logical types since then, we could offer a better alternative by using the LogicalType::Interval(IntervalUnit)alternative.

This would either be 32-bit or 64-bit based on the interval unit.
On the 64-bit representation, I'm not opinionated on whether we should use an INT64 or FiXED_LEN_BYTE_ARRAY(8). I suspect though that we initially used FIXED_LEN_BYTE_ARRAY(12) because there's no 96-bit primitive.

Backward Compatibility with ConvertedType

We do not deprecate the INTERVAL converted type, as one could always convert the LogicalType value to either the first 4 bytes, or the last 8, depending on the IntervalUnit; and so write only those bytes.

We would mark converting from ConvertedType to LogicalType as undefined behaviour, because without any additional information on which of the 12 bytes are populated, readers could lose information (what Rust is currently doing).

implementations that rely on the old behaviour still have the option of populating both converted type and logical type, so they should not populate the logical type in this instance.

nevi-me · 2021-01-23T11:01:02Z

I apologise if this is spammy, but I'd like to tag the following to bring it to your attention:

@julienledem from working on the initial PR, and well ...
@pitrou @pitrou @emkornfield from the C++ Arrow & Parquet side
@sunchao from the Rust Parquet side

This is my first time contributing to parquet-format, so I don't know if I need to generate the Java thrift files, but I can try to do that based on the guidance that I receive.

Thanks

gszadovszky · 2021-01-25T10:53:15Z

@nevi-me, based on the previous PR it will require some time to agree on it. Starting a discussion in the dev list might also help as a heads up. (I do not have the required experience on the SQL standard and the different engines to have an opinion.)

About this PR. We will certainly need an update in LogicalTypes.md as well. parquet-format has the java code generation included with some additional unit tests so it should be covered automatically. The java code for production is generation inside parquet-mr after the parquet-format version reference is upgraded. Because of that we will need a parquet-format release first.
Currently, I am not sure why the checks are failing. It seems to be a travis issue but I am not sure. You may try a close/re-open on this PR to re-trigger the travis build.

emkornfield · 2021-01-25T20:37:37Z

@nevi-me I don't think we should be imposing arrow's modeling of interval type on parquet. The existing interval type seems reasonable in parquet. I think there are three cases to consider:

Arrow writing to parquet (this is well defined without data loss).
Arrow Reading from a parquet file that was written with arrow. In this case, the additional written arrow schema should be sufficient for decoding.
Reading interval written from another implementation. In this case, I think we probably want a user configurable options in arrow, as there can be a few options:
1. Read as Duration
2. Read as struct of both interval types.
3. Read assuming as of the two interval types and truncate/error out data that doesn't fit in the type.

houqp · 2021-01-25T20:52:19Z

I do agree that if this is something specific to Arrow, then we shouldn't impose it onto parquet. However, given that #43 was created to non-arrow use-case, it looks like the application of a more flexible interval type would have a broader impact?

emkornfield · 2021-01-25T21:09:59Z

If I read the conclusion from #43 it was to potentially store these as separate columns for day_time. So if we really want to introduce these types I think not merging values together would be the way to go. Also, at least from the Arrow C++ perspective, we don't have great support for the interval types (duration is the preferred type in C++ see discussion on apache/arrow#3644)

nevi-me · 2021-01-25T21:44:25Z

Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.

The existing interval type seems reasonable in parquet.

It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.

The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type, as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.

I'll research this subject, read and write files with v2 of the writer, and come back with observations.

emkornfield · 2021-06-16T04:10:50Z

Maybe we should resurrect this to accomodate the new COMPLEX arrow type we are working on? I think we also need backwards compatibility with the existing parquet interval type?

PARQUET-675: Specify Interval LogicalType

8656e31

nevi-me mentioned this pull request Jan 23, 2021

PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #43

Open

nevi-me closed this Jan 25, 2021

nevi-me reopened this Jan 25, 2021

tustvold mentioned this pull request May 7, 2022

Handling Unsupported Arrow Types in Parquet apache/arrow-rs#1666

Closed

nwt mentioned this pull request Apr 26, 2023

Use pqarrow in zio/parquetio brimdata/super#4547

Merged

nevi-me closed this Jun 8, 2023

jorisvandenbossche mentioned this pull request Jun 12, 2024

PARQUET-2310: implementation status apache/parquet-site#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-675: Specify Interval LogicalType #165

PARQUET-675: Specify Interval LogicalType #165

nevi-me commented Jan 23, 2021

nevi-me commented Jan 23, 2021

gszadovszky commented Jan 25, 2021

emkornfield commented Jan 25, 2021

houqp commented Jan 25, 2021

emkornfield commented Jan 25, 2021

nevi-me commented Jan 25, 2021 •

edited

Loading

emkornfield commented Jun 16, 2021

PARQUET-675: Specify Interval LogicalType #165

PARQUET-675: Specify Interval LogicalType #165

Conversation

nevi-me commented Jan 23, 2021

Backward Compatibility with ConvertedType

nevi-me commented Jan 23, 2021

gszadovszky commented Jan 25, 2021

emkornfield commented Jan 25, 2021

houqp commented Jan 25, 2021

emkornfield commented Jan 25, 2021

nevi-me commented Jan 25, 2021 • edited Loading

emkornfield commented Jun 16, 2021

nevi-me commented Jan 25, 2021 •

edited

Loading