[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

MaxGekk · 2020-05-04T18:25:39Z

What changes were proposed in this pull request?

This reverts commit 43a73e3. It sets INT96 as the timestamp type while saving timestamps to parquet files.

Why are the changes needed?

To be compatible with Hive and Presto that don't support the TIMESTAMP_MICROS type in current stable releases.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing test suites.

…by default" This reverts commit 43a73e3.

MaxGekk · 2020-05-04T18:26:37Z

@cloud-fan @zsxwing @HyukjinKwon @gatorsmile Please, review this PR.

dongjoon-hyun · 2020-05-04T18:59:52Z

Hi, @MaxGekk .
The PR is a clean revert, but it would be great if we have a separate JIRA issue since the reverting target commit was merged over one year ago.

dongjoon-hyun · 2020-05-04T19:02:13Z

I filed a subtask, SPARK-31639, for your request under SPARK-31085 (Amend Spark's Semantic Versioning Policy). Thanks.

hvanhovell

LGTM

SparkQA · 2020-05-05T00:09:48Z

Test build #122283 has finished for PR 28450 at commit 1641f74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM since this is complying SPARK-31085.
Merged to master/3.0 .

…ICROS by default ### What changes were proposed in this pull request? This reverts commit 43a73e3. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 372ccba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

The upstream default is INT96 but INT96 is considered deprecated by Parquet [1] and we rely internally on the default being INT64 (TIMESTAMP_MICROS). INT64 reduces the size of Parquet files and avoids unnecessary conversions of microseconds to nanoseconds, see [2]. Apache went down the same route in [2] but then reverted to remain compatible with Hive and Presto in [3]. [1] https://issues.apache.org/jira/browse/PARQUET-323 [2] apache#24425 [3] apache#28450

clee704 · 2022-11-02T19:12:22Z

Can someone explain why we reverted to INT96? I read https://issues.apache.org/jira/browse/SPARK-31085 but want to know how the discussion happened. To me the cost of breaking the API (INT96 by default) seems smaller than the benefit (better performance for reading, for users not using Hive/Presto).

cloud-fan · 2022-11-03T02:07:34Z

At that time, the ecosystem does not fully support standard parquet timestamp yet. We can recheck now. If the latest version of popular data systems (Hive, Presto, Flink, etc.) all support parquet standard timestamp, Spark can change the default behavior.

Revert "[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS …

1641f74

…by default" This reverts commit 43a73e3.

probot-autolabeler bot added DOCS SQL labels May 4, 2020

dongjoon-hyun changed the title ~~Revert "[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default"~~ [SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default May 4, 2020

hvanhovell approved these changes May 4, 2020

View reviewed changes

dongjoon-hyun approved these changes May 5, 2020

View reviewed changes

dongjoon-hyun closed this in 372ccba May 5, 2020

jlowe mentioned this pull request May 5, 2020

[FEA] Support INT96 timestamp type when writing Parquet rapidsai/cudf#5096

Closed

MaxGekk deleted the parquet-int96 branch June 5, 2020 19:48

jnturton mentioned this pull request Feb 1, 2023

[DISCUSSION] Use INT96 as default timestamp format in Parquet tables apache/drill#2746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

MaxGekk commented May 4, 2020 •

edited

Loading

MaxGekk commented May 4, 2020

dongjoon-hyun commented May 4, 2020 •

edited

Loading

dongjoon-hyun commented May 4, 2020

hvanhovell left a comment

SparkQA commented May 5, 2020

dongjoon-hyun left a comment

clee704 commented Nov 2, 2022

cloud-fan commented Nov 3, 2022

[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

Conversation

MaxGekk commented May 4, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented May 4, 2020

dongjoon-hyun commented May 4, 2020 • edited Loading

dongjoon-hyun commented May 4, 2020

hvanhovell left a comment

Choose a reason for hiding this comment

SparkQA commented May 5, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

clee704 commented Nov 2, 2022

cloud-fan commented Nov 3, 2022

MaxGekk commented May 4, 2020 •

edited

Loading

dongjoon-hyun commented May 4, 2020 •

edited

Loading