Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-675: Specify Interval LogicalType #165

Closed
wants to merge 1 commit into from

Conversation

nevi-me
Copy link

@nevi-me nevi-me commented Jan 23, 2021

I am working on the Parquet Rust implementation, specifically conversion with Arrow.
One of the outstanding items in the Parquet types is how to deal with interval types.

This PR proposes adding LogicalType::Interval(IntervalType), which is compatible with Arrow. I have only made changes to the thrift file, as I'd like to get feedback on viability, before documenting the behaviour in the LogicalTypes.md.
Much of the detail is however below in this message.

The legacy ConvertedType has INTERVAL, but this interval is ambiguous for Arrow, because Arrow defines:

  1. Interval::YearMonth: i32 representing the number of elapsed whole months
  2. Interval::DayTime: i64 stored as 2 contiguous 32-bit integers, representing the number of elapsed days and milliseconds, respectively

@julienledem initially suggested deprecating INTERVAL by replacing it with 2 converted types in #43, but given that Parquet got logical types since then, we could offer a better alternative by using the LogicalType::Interval(IntervalUnit)alternative.

This would either be 32-bit or 64-bit based on the interval unit.
On the 64-bit representation, I'm not opinionated on whether we should use an INT64 or FiXED_LEN_BYTE_ARRAY(8). I suspect though that we initially used FIXED_LEN_BYTE_ARRAY(12) because there's no 96-bit primitive.

Backward Compatibility with ConvertedType

We do not deprecate the INTERVAL converted type, as one could always convert the LogicalType value to either the first 4 bytes, or the last 8, depending on the IntervalUnit; and so write only those bytes.

We would mark converting from ConvertedType to LogicalType as undefined behaviour, because without any additional information on which of the 12 bytes are populated, readers could lose information (what Rust is currently doing).

implementations that rely on the old behaviour still have the option of populating both converted type and logical type, so they should not populate the logical type in this instance.

@nevi-me
Copy link
Author

nevi-me commented Jan 23, 2021

I apologise if this is spammy, but I'd like to tag the following to bring it to your attention:

@julienledem from working on the initial PR, and well ...
@pitrou @pitrou @emkornfield from the C++ Arrow & Parquet side
@sunchao from the Rust Parquet side

This is my first time contributing to parquet-format, so I don't know if I need to generate the Java thrift files, but I can try to do that based on the guidance that I receive.

Thanks

@gszadovszky
Copy link
Contributor

@nevi-me, based on the previous PR it will require some time to agree on it. Starting a discussion in the dev list might also help as a heads up. (I do not have the required experience on the SQL standard and the different engines to have an opinion.)

About this PR. We will certainly need an update in LogicalTypes.md as well. parquet-format has the java code generation included with some additional unit tests so it should be covered automatically. The java code for production is generation inside parquet-mr after the parquet-format version reference is upgraded. Because of that we will need a parquet-format release first.
Currently, I am not sure why the checks are failing. It seems to be a travis issue but I am not sure. You may try a close/re-open on this PR to re-trigger the travis build.

@nevi-me nevi-me closed this Jan 25, 2021
@nevi-me nevi-me reopened this Jan 25, 2021
@emkornfield
Copy link
Contributor

@nevi-me I don't think we should be imposing arrow's modeling of interval type on parquet. The existing interval type seems reasonable in parquet. I think there are three cases to consider:

  1. Arrow writing to parquet (this is well defined without data loss).
  2. Arrow Reading from a parquet file that was written with arrow. In this case, the additional written arrow schema should be sufficient for decoding.
  3. Reading interval written from another implementation. In this case, I think we probably want a user configurable options in arrow, as there can be a few options:
    1. Read as Duration
    2. Read as struct of both interval types.
    3. Read assuming as of the two interval types and truncate/error out data that doesn't fit in the type.

@houqp
Copy link
Member

houqp commented Jan 25, 2021

I do agree that if this is something specific to Arrow, then we shouldn't impose it onto parquet. However, given that #43 was created to non-arrow use-case, it looks like the application of a more flexible interval type would have a broader impact?

@emkornfield
Copy link
Contributor

If I read the conclusion from #43 it was to potentially store these as separate columns for day_time. So if we really want to introduce these types I think not merging values together would be the way to go. Also, at least from the Arrow C++ perspective, we don't have great support for the interval types (duration is the preferred type in C++ see discussion on apache/arrow#3644)

@nevi-me
Copy link
Author

nevi-me commented Jan 25, 2021

Thanks for the responses, I'll look at the interval vs duration support in more detail in the coming weeks, and rather put together a Google doc for discussion in the Parquet mailing list.

The existing interval type seems reasonable in parquet.

It's reasonable for converted types, but has no equivalent logical type. While this is still fine as we have to write converted types in the metadata for backwards-compatibility, it still means that one has to write a 4-byte value into a 12-byte FLA in order for the converted type to make sense when being read by another application.

The Arrow schema is binary, so in the long-run I wouldn't expect a non-Arrow reader to care about it. So, I was trying to approach adding INTERVAL to logical types, so that one could drop the INTERVAL converted type, as it wouldn't be correct to map Logical::INTERVAL to Converted::INTERVAL.

I'll research this subject, read and write files with v2 of the writer, and come back with observations.

@emkornfield
Copy link
Contributor

Maybe we should resurrect this to accomodate the new COMPLEX arrow type we are working on? I think we also need backwards compatibility with the existing parquet interval type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants