Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6 #35746

Closed
danepitkin opened this issue May 24, 2023 · 7 comments · Fixed by #36137
Closed

[Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6 #35746

danepitkin opened this issue May 24, 2023 · 7 comments · Fixed by #36137

Comments

@danepitkin
Copy link
Member

Describe the enhancement requested

Parquet format version 2.6 introduces the NanoSecond time unit for Time and Timestamp logical types.

Component(s)

Parquet

@mapleFU
Copy link
Member

mapleFU commented May 25, 2023

I found that our implemention already supports Timestamp:

/// \brief Allowed for physical type INT64.
class PARQUET_EXPORT TimestampLogicalType : public LogicalType {
 public:
  static std::shared_ptr<const LogicalType> Make(bool is_adjusted_to_utc,
                                                 LogicalType::TimeUnit::unit time_unit,
                                                 bool is_from_converted_type = false,
                                                 bool force_set_converted_type = false);
  bool is_adjusted_to_utc() const;
  LogicalType::TimeUnit::unit time_unit() const;

  /// \brief If true, will not set LogicalType in Thrift metadata
  bool is_from_converted_type() const;

  /// \brief If true, will set ConvertedType for micros and millis
  /// resolution in legacy ConvertedType Thrift metadata
  bool force_set_converted_type() const;

 private:
  TimestampLogicalType() = default;
};

And timeunit has NANO

  struct TimeUnit {
    enum unit { UNKNOWN = 0, MILLIS = 1, MICROS, NANOS };
  };

Do we already supports them?

/cc @wgtmac

@mapleFU
Copy link
Member

mapleFU commented May 25, 2023

I've find out that, C++ implements parquet 2.6, and need the flag ParquetVersion::PARQUET_2_6. I think just enabling this flag in Python is ok

@jorisvandenbossche jorisvandenbossche changed the title [Parquet] Bump the format version from 2.4 -> 2.6 [Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6 May 25, 2023
@jorisvandenbossche
Copy link
Member

Yes, we indeed already support that version, but we default to 2.4 at the moment. I edited the title that the issue is about changing the default version.

The default in C++:

class PARQUET_EXPORT WriterProperties {
public:
class Builder {
public:
Builder()
: pool_(::arrow::default_memory_pool()),
dictionary_pagesize_limit_(DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT),
write_batch_size_(DEFAULT_WRITE_BATCH_SIZE),
max_row_group_length_(DEFAULT_MAX_ROW_GROUP_LENGTH),
pagesize_(kDefaultDataPageSize),
version_(ParquetVersion::PARQUET_2_4),
data_page_version_(ParquetDataPageVersion::V1),
created_by_(DEFAULT_CREATED_BY),
store_decimal_as_integer_(false),
page_checksum_enabled_(false) {}

And similarly in the Python bindings the default is also "2.4".

@wgtmac
Copy link
Member

wgtmac commented May 25, 2023

It seems that parquet-cpp has implemented features (e.g. modular encryption and BYTE_STREAM_SPLIT encoding) beyond version 2.6. We probably need to update supported versions and manage features based on the version.

https://github.com/apache/parquet-format/blob/master/CHANGES.md

@mapleFU
Copy link
Member

mapleFU commented May 25, 2023

It seems that parquet-cpp has implemented features (e.g. modular encryption and BYTE_STREAM_SPLIT encoding) beyond version 2.6. We probably need to update supported versions and manage features based on the version.

https://github.com/apache/parquet-format/blob/master/CHANGES.md

We can open issue about this, I can take time to fix it.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2023

Ok, so parquet-mr implemented nanosecond precision timestamps in 2018: apache/parquet-java#519
while we did so in 2018 (basic support: #4185) and 2019 (Arrow roundtrip: #4421).

I think this makes it ok to bump the default to 2.6.

@jorisvandenbossche
Copy link
Member

For reference (and to see where things need to be changed), the commit of the previous time bumping the version: 797c88a

anjakefala added a commit to anjakefala/arrow that referenced this issue Jun 16, 2023
@jorisvandenbossche jorisvandenbossche added this to the 13.0.0 milestone Jul 3, 2023
jorisvandenbossche added a commit that referenced this issue Jul 6, 2023
#36137)

Change the default parquet version to 2.6.

Discussed in: 
* [ML](https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm)
* [Issue](#35746)

* Closes: #35746

Lead-authored-by: anjakefala <anja@voltrondata.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants