[SPARK-22829] Add new built-in function date_trunc() #20015

youngbink · 2017-12-19T01:55:13Z

What changes were proposed in this pull request?

Adding date_trunc() as a built-in function.
date_trunc is common in other databases, but Spark or Hive does not have support for this. date_trunc is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset).
We do have trunc but this only works with 'MONTH' and 'YEAR' level on the DateType input.

date_trunc() in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html

How was this patch tested?

Unit tests

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

youngbink · 2017-12-19T01:59:13Z

@gatorsmile @cloud-fan

cloud-fan · 2017-12-19T02:26:35Z

ok to test

cloud-fan · 2017-12-19T02:26:50Z

we need to create a JIRA ticket

HyukjinKwon · 2017-12-19T02:44:22Z

@cloud-fan and @youngbink how about reviving #14788 with a configuration to control this?

AWS Redshift seems having TRUNC which just converts a timestamp to a date whereas we have Spark's trunc which supports date formats. This is not quite equivalent. I think Spark's trunc is more like Redshift's DATE_TRUNC.

PostgreSQL does not have trunc but has date_trunc where we can specify the format and returns a timestamp always.

Presto also looks not having a duplicated functionality.

I think we can simply introduce an alias for trunc after resolving #14788 if the naming matters.

Did I maybe miss something?

youngbink · 2017-12-19T04:30:11Z

@HyukjinKwon Just took a look at this PR #14788.

My point of mentioning those databases was just to give examples of the function that Spark doesn't support but other databases commonly do. (They all have this date_trunc which takes timestamp and output timestamp)
As you said, we could extend trunc and simply create an alias date_trunc, but it's actually not as simple. For e.g, PR #14788 won't be able to handle the following command collectly on PySpark:

df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d'])
df.select(functions.trunc(df.d, 'year').alias('year')).collect()  
df.select(functions.trunc(df.d, 'SS').alias('SS')).collect()

This is because trunc(string, string) isn't correctly handled. We could find a way around this and get it working, but after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold, we thought adding date_trunc is the simplest way for now.

HyukjinKwon · 2017-12-19T05:19:41Z

after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold

Where did the discussion happen? Was this offline discussion? I also want to actively join in the discussion. Many implementations of the trunc works differently and I think we decide the "right" behaviour after sufficient discussion.

If we don't fix the stuff about #14788 in 2.3.0 timeline, it could be even more difficult because we need to keep the previous behaviour.

SparkQA · 2017-12-19T05:21:04Z

Test build #85083 has finished for PR 20015 at commit f94f401.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-19T05:26:42Z

OK. I am fine if you all guys strongly feel about this.

HyukjinKwon

Just took a quick pass.

HyukjinKwon · 2017-12-19T06:34:28Z

python/pyspark/sql/functions.py

+    :param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',
+        'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER'
+
+    >>> df = spark.createDataFrame([('1997-02-28',)], ['d'])


Can we use a timestamp string like 1997-02-28 05:02:11 to show the difference from trunc a bit more clearly?

HyukjinKwon · 2017-12-19T06:35:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

-  override def eval(input: InternalRow): Any = {
+  /**
+   *
+   * @param input


Seems input and truncFunc descriptions missing.

HyukjinKwon · 2017-12-19T06:56:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

-
-  override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType)
-  override def dataType: DataType = DateType
+trait TruncTime extends BinaryExpression with ImplicitCastInputTypes {


Maybe TruncInstant? I received this advice before and I liked it too. Not a big deal tho.

HyukjinKwon · 2017-12-19T07:07:13Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * Returns timestamp truncated to the unit specified by the format.
+   *
+   * @param format: 'year', 'yyyy', 'yy' for truncate by year,
+   *               'month', 'mon', 'mm' for truncate by month,


nit: one space each more.

HyukjinKwon · 2017-12-19T07:08:59Z

python/pyspark/sql/functions.py

+    Returns timestamp truncated to the unit specified by the format.
+
+    :param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',
+        'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER'


Could we make those lowercased too?

HyukjinKwon · 2017-12-19T07:09:42Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @param format: 'year', 'yyyy', 'yy' for truncate by year,
+   *               'month', 'mon', 'mm' for truncate by month,
+   *               'day', 'dd' for truncate by day,
+   *               Other options are: second, minute, hour, week, month, quarter


Maybe, 'second', 'minute', 'hour', 'week', 'month' and 'quarter'

gcz2022 · 2017-12-19T06:47:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

      // unknown format
      null
    } else {
-      val d = date.eval(input)
+      val d = time.eval(input)


nit: Since this is a time, it can be val t = ...

gcz2022 · 2017-12-19T07:03:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

    val level = if (format.foldable) {
      truncLevel
    } else {
      DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String])
    }
-    if (level == -1) {
+    if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) {


// unknown format or too small level?

gcz2022 · 2017-12-19T07:19:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

      }
    }
  }

-  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+  protected def codeGenHelper[T](


Why do we need a type parameter T?

gcz2022 · 2017-12-19T07:33:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+  val TRUNC_TO_SECOND = 6
+  val TRUNC_TO_WEEK = 7
+  val TRUNC_TO_QUARTER = 8
+  val TRUNC_INVALID = -1


Can we bring quarter and week forward, maybe to 3 and 4? Then it's more conform to the order of time granularity and max-level design is not influenced.

gcz2022 · 2017-12-19T08:15:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+   * @return
+   */
+  protected def evalHelper[T](input: InternalRow, maxLevel: Int)(
+    truncFunc: (Any, Int) => T): Any = {


Maybe truncFunc: (Any, Int) => Any is enough? So we don't need to use the T, but I'm not sure if this is better...

srowen · 2017-12-19T13:39:53Z

Yeah keep any substantive discussion on the public lists. Sometimes a side conversation happens; summarize the points here.

We've rejected a lot of other functions that other DBs, but not Hive, support. Spark mostly follows Hive, and for everything else, there are UDFs. I'm not against this so much as not clear why it's exceptional

cloud-fan · 2017-12-19T14:05:55Z

We had an offline discussion and wanna send this out to get more feedbacks. So generally just adding date_trunc is pretty straightforward and makes Spark consistent with other databases about this function, while extending trunc to support timestamp type is a better API design.

HyukjinKwon · 2017-12-19T15:30:00Z

If we haven't get a similar function, I would have gone +1 but what I am less sure is date_trunc actually quite sounds a better version of trunc to be honest. Seems both also extend the same parent here TruncTime.

I feel like we are trying to add this better version alone by working around because it takes a relatively larger change to update other related functions consistently.

HyukjinKwon · 2017-12-19T15:44:29Z

I get date_trunc is common in other DBMS. I can see that this can be done now and we can still proceed trunc, etc. later. So, I am fine but still less sure tho.

youngbink · 2017-12-19T15:56:22Z

hmm...even if we decide to change this later, I honestly think merging trunc and date_trunc would be simple, only touching a couple of files (mostly datetimeExpressions.scala).
This PR isn't too small as you said, but most of the codes here can be used without modification if we are to merge date_trunc.

HyukjinKwon · 2017-12-19T16:18:22Z

SPARK-17174 originally described few functions related with hour, min, etc. but I received an advice to fix up other related functions too even though they could also be done alone too. I agreed with doing other functions too at that time and I tried to propose as so.

I am saying I think this PR actually more targets adding another (better) version of trunc to support day, hour, min, etc. in the format. In this case, I think we should deduplicate/support the logics with related functions too.

Ah, so, I think I am less sure about why this should be done alone leaving out other related changes, and other functions we (I) usually reject.

and I think you and @cloud-fan say the reasons are, it's common and this PR targets a separate functionality consistent with other DBMS.

gatorsmile · 2017-12-19T17:26:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+@ExpressionDescription(
+  usage = """
+    _FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.
+    `fmt` should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM"]


Let us use the lower case and also update the other functions in this file. For example, ToUnixTimestamp

gatorsmile · 2017-12-19T17:53:28Z

The API proposed by this PR is consistent with the other DBs.
The implementation does not introduce the behavior changes.

The implementation is clean and the PR quality is pretty good.

SparkQA · 2017-12-19T21:54:46Z

Test build #85131 has finished for PR 20015 at commit 3547b7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

SparkQA · 2017-12-19T23:09:14Z

Test build #85132 has finished for PR 20015 at commit b12ba92.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

gatorsmile · 2017-12-19T23:20:17Z

python/pyspark/sql/functions.py

+    """
+    Returns timestamp truncated to the unit specified by the format.
+
+    :param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',


Nit: YYYY -> yyyy

Also update the original trunc

HyukjinKwon

Change itself seems fine.

HyukjinKwon · 2017-12-19T22:44:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+   * Returns the trunc date time from original date time and trunc level.
+   * Trunc level should be generated using `parseTruncLevel()`, should be between 1 and 8
+   */
+  def truncTimestamp(d: SQLTimestamp, level: Int, timeZone: TimeZone): SQLTimestamp = {


nit: d -> ts or t

HyukjinKwon · 2017-12-19T23:46:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

-  override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType)
-  override def dataType: DataType = DateType
+trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes {
+  val time: Expression


Maybe, time -> instant.

gatorsmile · 2017-12-20T00:07:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+   * @param input internalRow (time)
+   * @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input)
+   * @param truncFunc function: (time, level) => time
+   * @return


Remove @return

gatorsmile · 2017-12-20T00:07:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala


  private lazy val truncLevel: Int =
    DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String])

-  override def eval(input: InternalRow): Any = {
+  /**
+   *


Remove this line.

SparkQA · 2017-12-20T02:22:12Z

Test build #85140 has finished for PR 20015 at commit 80a1959.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

HyukjinKwon

Only minor nits

HyukjinKwon · 2017-12-20T02:36:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(fmt, date) - Returns timestamp `ts` truncated to the unit specified by the format model `fmt`.


date -> ts.

HyukjinKwon · 2017-12-20T02:36:44Z

python/pyspark/sql/functions.py

+    :param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm',
+        'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'
+
+    >>> df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d'])


d -> t or ts.

HyukjinKwon · 2017-12-20T02:38:27Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

@@ -563,6 +563,76 @@ class DateTimeUtilsSuite extends SparkFunSuite {
    }
  }

+  test("truncTimestamp") {
+    def test(


test -> testTrunc ?

SparkQA · 2017-12-20T03:12:46Z

Test build #85141 has finished for PR 20015 at commit 0d1a8cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

gatorsmile · 2017-12-20T04:21:34Z

LGTM

gatorsmile · 2017-12-20T04:21:45Z

Thanks! Merged to master

SparkQA · 2017-12-20T05:55:18Z

Test build #85146 has finished for PR 20015 at commit 238d7d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

## What changes were proposed in this pull request? Adding date_trunc() as a built-in function. `date_trunc` is common in other databases, but Spark or Hive does not have support for this. `date_trunc` is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset). We do have `trunc` but this only works with 'MONTH' and 'YEAR' level on the DateType input. date_trunc() in other databases: AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html Presto: https://prestodb.io/docs/current/functions/datetime.html ## How was this patch tested? Unit tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Youngbin Kim <ykim828@hotmail.com> Closes apache#20015 from youngbink/date_trunc.

youngbink added 2 commits December 18, 2017 17:48

date_trunc support

d84fc0e

minor edits: version

f94f401

youngbink changed the title ~~Add new built-in function date_trunc()~~ [SPARK-22829] Add new built-in function date_trunc() Dec 19, 2017

HyukjinKwon reviewed Dec 19, 2017

View reviewed changes

gcz2022 reviewed Dec 19, 2017

View reviewed changes

gatorsmile reviewed Dec 19, 2017

View reviewed changes

youngbink force-pushed the date_trunc branch from 3547b7c to b12ba92 Compare December 19, 2017 20:12

gatorsmile reviewed Dec 19, 2017

View reviewed changes

youngbink force-pushed the date_trunc branch from b12ba92 to 80a1959 Compare December 19, 2017 23:26

HyukjinKwon reviewed Dec 19, 2017

View reviewed changes

gatorsmile reviewed Dec 20, 2017

View reviewed changes

youngbink force-pushed the date_trunc branch from 80a1959 to 0d1a8cb Compare December 20, 2017 00:25

HyukjinKwon reviewed Dec 20, 2017

View reviewed changes

youngbink force-pushed the date_trunc branch from 0d1a8cb to 7f08daf Compare December 20, 2017 02:49

style fixes

238d7d4

youngbink force-pushed the date_trunc branch from 7f08daf to 238d7d4 Compare December 20, 2017 02:52

asfgit closed this in 6e36d8d Dec 20, 2017

[SPARK-22829] Add new built-in function date_trunc() #20015

[SPARK-22829] Add new built-in function date_trunc() #20015

Conversation

youngbink commented Dec 19, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

youngbink commented Dec 19, 2017

cloud-fan commented Dec 19, 2017

cloud-fan commented Dec 19, 2017

HyukjinKwon commented Dec 19, 2017 • edited Loading

youngbink commented Dec 19, 2017 • edited Loading

HyukjinKwon commented Dec 19, 2017 • edited Loading

SparkQA commented Dec 19, 2017

HyukjinKwon commented Dec 19, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Dec 19, 2017

cloud-fan commented Dec 19, 2017

HyukjinKwon commented Dec 19, 2017

HyukjinKwon commented Dec 19, 2017

youngbink commented Dec 19, 2017 • edited Loading

HyukjinKwon commented Dec 19, 2017 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Dec 19, 2017

SparkQA commented Dec 19, 2017

SparkQA commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2017

gatorsmile commented Dec 20, 2017

gatorsmile commented Dec 20, 2017

SparkQA commented Dec 20, 2017

youngbink commented Dec 19, 2017 •

edited

Loading

HyukjinKwon commented Dec 19, 2017 •

edited

Loading

youngbink commented Dec 19, 2017 •

edited

Loading

HyukjinKwon commented Dec 19, 2017 •

edited

Loading

youngbink commented Dec 19, 2017 •

edited

Loading

HyukjinKwon commented Dec 19, 2017 •

edited

Loading