Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450

Closed
wants to merge 1 commit into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 4, 2020

What changes were proposed in this pull request?

This reverts commit 43a73e3. It sets INT96 as the timestamp type while saving timestamps to parquet files.

Why are the changes needed?

To be compatible with Hive and Presto that don't support the TIMESTAMP_MICROS type in current stable releases.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing test suites.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 4, 2020

@cloud-fan @zsxwing @HyukjinKwon @gatorsmile Please, review this PR.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 4, 2020

Hi, @MaxGekk .
The PR is a clean revert, but it would be great if we have a separate JIRA issue since the reverting target commit was merged over one year ago.

@dongjoon-hyun dongjoon-hyun changed the title Revert "[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default" [SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default May 4, 2020
@dongjoon-hyun
Copy link
Member

I filed a subtask, SPARK-31639, for your request under SPARK-31085 (Amend Spark's Semantic Versioning Policy). Thanks.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented May 5, 2020

Test build #122283 has finished for PR 28450 at commit 1641f74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM since this is complying SPARK-31085.
Merged to master/3.0 .

dongjoon-hyun pushed a commit that referenced this pull request May 5, 2020
…ICROS by default

### What changes were proposed in this pull request?
This reverts commit 43a73e3. It sets `INT96` as the timestamp type while saving timestamps to parquet files.

### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #28450 from MaxGekk/parquet-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 372ccba)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@MaxGekk MaxGekk deleted the parquet-int96 branch June 5, 2020 19:48
rshkv added a commit to palantir/spark that referenced this pull request Mar 9, 2021
The upstream default is INT96 but INT96 is considered deprecated by
Parquet [1] and we rely internally on the default being INT64
(TIMESTAMP_MICROS).

INT64 reduces the size of Parquet files and avoids unnecessary conversions
of microseconds to nanoseconds, see [2].

Apache went down the same route in [2] but then reverted to remain
compatible with Hive and Presto in [3].

[1] https://issues.apache.org/jira/browse/PARQUET-323
[2] apache#24425
[3] apache#28450
@clee704
Copy link

clee704 commented Nov 2, 2022

Can someone explain why we reverted to INT96? I read https://issues.apache.org/jira/browse/SPARK-31085 but want to know how the discussion happened. To me the cost of breaking the API (INT96 by default) seems smaller than the benefit (better performance for reading, for users not using Hive/Presto).

@cloud-fan
Copy link
Contributor

At that time, the ecosystem does not fully support standard parquet timestamp yet. We can recheck now. If the latest version of popular data systems (Hive, Presto, Flink, etc.) all support parquet standard timestamp, Spark can change the default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants