Skip to content

Commit

Permalink
[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files.

## How was this patch tested?

By existing test suites.

Closes apache#24425 from MaxGekk/parquet-timestamp_micros.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
MaxGekk authored and HyukjinKwon committed Apr 23, 2019
1 parent 3240e52 commit 43a73e3
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 4 deletions.
2 changes: 2 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ license: |

- In Spark version 2.4, when a spark session is created via `cloneSession()`, the newly created spark session inherits its configuration from its parent `SparkContext` even though the same configuration may exist with a different value in its parent spark session. Since Spark 3.0, the configurations of a parent `SparkSession` have a higher precedence over the parent `SparkContext`.

- Since Spark 3.0, parquet logical type `TIMESTAMP_MICROS` is used by default while saving `TIMESTAMP` columns. In Spark version 2.4 and earlier, `TIMESTAMP` columns are saved as `INT96` in parquet files. To set `INT96` to `spark.sql.parquet.outputTimestampType` restores the previous behavior.

## Upgrading from Spark SQL 2.4 to 2.4.1

- The value of `spark.executor.heartbeatInterval`, when specified without units like "30" rather than "30s", was
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,7 @@ object SQLConf {
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(ParquetOutputTimestampType.values.map(_.toString))
.createWithDefault(ParquetOutputTimestampType.INT96.toString)
.createWithDefault(ParquetOutputTimestampType.TIMESTAMP_MICROS.toString)

val PARQUET_INT64_AS_TIMESTAMP_MILLIS = buildConf("spark.sql.parquet.int64AsTimestampMillis")
.doc(s"(Deprecated since Spark 2.3, please set ${PARQUET_OUTPUT_TIMESTAMP_TYPE.key}.) " +
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,12 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
).map { s => java.sql.Timestamp.valueOf(s) }
import testImplicits._
// match the column names of the file from impala
val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts")
df.write.parquet(tableDir.getAbsolutePath)
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
SQLConf.ParquetOutputTimestampType.INT96.toString) {
val df = spark.createDataset(ts).toDF().repartition(1)
.withColumnRenamed("value", "ts")
df.write.parquet(tableDir.getAbsolutePath)
}
FileUtils.copyFile(new File(impalaPath), new File(tableDir, "part-00001.parq"))

Seq(false, true).foreach { int96TimestampConversion =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ class SQLConfSuite extends QueryTest with SharedSQLContext {

// check default value
assert(spark.sessionState.conf.parquetOutputTimestampType ==
SQLConf.ParquetOutputTimestampType.INT96)
SQLConf.ParquetOutputTimestampType.TIMESTAMP_MICROS)

// PARQUET_INT64_AS_TIMESTAMP_MILLIS should be respected.
spark.sessionState.conf.setConf(SQLConf.PARQUET_INT64_AS_TIMESTAMP_MILLIS, true)
Expand Down

0 comments on commit 43a73e3

Please sign in to comment.