-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default #28450
Conversation
…by default" This reverts commit 43a73e3.
@cloud-fan @zsxwing @HyukjinKwon @gatorsmile Please, review this PR. |
Hi, @MaxGekk . |
I filed a subtask, SPARK-31639, for your request under SPARK-31085 (Amend Spark's Semantic Versioning Policy). Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #122283 has finished for PR 28450 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM since this is complying SPARK-31085.
Merged to master/3.0 .
…ICROS by default ### What changes were proposed in this pull request? This reverts commit 43a73e3. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 372ccba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The upstream default is INT96 but INT96 is considered deprecated by Parquet [1] and we rely internally on the default being INT64 (TIMESTAMP_MICROS). INT64 reduces the size of Parquet files and avoids unnecessary conversions of microseconds to nanoseconds, see [2]. Apache went down the same route in [2] but then reverted to remain compatible with Hive and Presto in [3]. [1] https://issues.apache.org/jira/browse/PARQUET-323 [2] apache#24425 [3] apache#28450
Can someone explain why we reverted to INT96? I read https://issues.apache.org/jira/browse/SPARK-31085 but want to know how the discussion happened. To me the cost of breaking the API (INT96 by default) seems smaller than the benefit (better performance for reading, for users not using Hive/Presto). |
At that time, the ecosystem does not fully support standard parquet timestamp yet. We can recheck now. If the latest version of popular data systems (Hive, Presto, Flink, etc.) all support parquet standard timestamp, Spark can change the default behavior. |
What changes were proposed in this pull request?
This reverts commit 43a73e3. It sets
INT96
as the timestamp type while saving timestamps to parquet files.Why are the changes needed?
To be compatible with Hive and Presto that don't support the
TIMESTAMP_MICROS
type in current stable releases.Does this PR introduce any user-facing change?
No
How was this patch tested?
By existing test suites.