Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22829] Add new built-in function date_trunc() #20015

Closed
wants to merge 3 commits into from

Conversation

youngbink
Copy link
Contributor

@youngbink youngbink commented Dec 19, 2017

What changes were proposed in this pull request?

Adding date_trunc() as a built-in function.
date_trunc is common in other databases, but Spark or Hive does not have support for this. date_trunc is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset).
We do have trunc but this only works with 'MONTH' and 'YEAR' level on the DateType input.

date_trunc() in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html

How was this patch tested?

Unit tests

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@youngbink
Copy link
Contributor Author

@gatorsmile @cloud-fan

@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

we need to create a JIRA ticket

@youngbink youngbink changed the title Add new built-in function date_trunc() [SPARK-22829] Add new built-in function date_trunc() Dec 19, 2017
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 19, 2017

@cloud-fan and @youngbink how about reviving #14788 with a configuration to control this?

AWS Redshift seems having TRUNC which just converts a timestamp to a date whereas we have Spark's trunc which supports date formats. This is not quite equivalent. I think Spark's trunc is more like Redshift's DATE_TRUNC.

PostgreSQL does not have trunc but has date_trunc where we can specify the format and returns a timestamp always.

Presto also looks not having a duplicated functionality.

I think we can simply introduce an alias for trunc after resolving #14788 if the naming matters.

Did I maybe miss something?

@youngbink
Copy link
Contributor Author

youngbink commented Dec 19, 2017

@HyukjinKwon Just took a look at this PR #14788.

My point of mentioning those databases was just to give examples of the function that Spark doesn't support but other databases commonly do. (They all have this date_trunc which takes timestamp and output timestamp)
As you said, we could extend trunc and simply create an alias date_trunc, but it's actually not as simple. For e.g, PR #14788 won't be able to handle the following command collectly on PySpark:

df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d'])
df.select(functions.trunc(df.d, 'year').alias('year')).collect()  
df.select(functions.trunc(df.d, 'SS').alias('SS')).collect() 

This is because trunc(string, string) isn't correctly handled. We could find a way around this and get it working, but after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold, we thought adding date_trunc is the simplest way for now.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 19, 2017

after having a discussion with @cloud-fan, @gatorsmile, @rednaxelafx and Reynold

Where did the discussion happen? Was this offline discussion? I also want to actively join in the discussion. Many implementations of the trunc works differently and I think we decide the "right" behaviour after sufficient discussion.

If we don't fix the stuff about #14788 in 2.3.0 timeline, it could be even more difficult because we need to keep the previous behaviour.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85083 has finished for PR 20015 at commit f94f401.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

OK. I am fine if you all guys strongly feel about this.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a quick pass.

:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',
'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER'

>>> df = spark.createDataFrame([('1997-02-28',)], ['d'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a timestamp string like 1997-02-28 05:02:11 to show the difference from trunc a bit more clearly?

override def eval(input: InternalRow): Any = {
/**
*
* @param input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems input and truncFunc descriptions missing.


override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType)
override def dataType: DataType = DateType
trait TruncTime extends BinaryExpression with ImplicitCastInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TruncInstant? I received this advice before and I liked it too. Not a big deal tho.

* Returns timestamp truncated to the unit specified by the format.
*
* @param format: 'year', 'yyyy', 'yy' for truncate by year,
* 'month', 'mon', 'mm' for truncate by month,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one space each more.

Returns timestamp truncated to the unit specified by the format.

:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',
'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make those lowercased too?

* @param format: 'year', 'yyyy', 'yy' for truncate by year,
* 'month', 'mon', 'mm' for truncate by month,
* 'day', 'dd' for truncate by day,
* Other options are: second, minute, hour, week, month, quarter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, 'second', 'minute', 'hour', 'week', 'month' and 'quarter'

// unknown format
null
} else {
val d = date.eval(input)
val d = time.eval(input)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is a time, it can be val t = ...

val level = if (format.foldable) {
truncLevel
} else {
DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String])
}
if (level == -1) {
if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// unknown format or too small level?

}
}
}

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
protected def codeGenHelper[T](
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a type parameter T?

val TRUNC_TO_SECOND = 6
val TRUNC_TO_WEEK = 7
val TRUNC_TO_QUARTER = 8
val TRUNC_INVALID = -1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we bring quarter and week forward, maybe to 3 and 4? Then it's more conform to the order of time granularity and max-level design is not influenced.

* @return
*/
protected def evalHelper[T](input: InternalRow, maxLevel: Int)(
truncFunc: (Any, Int) => T): Any = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe truncFunc: (Any, Int) => Any is enough? So we don't need to use the T, but I'm not sure if this is better...

@srowen
Copy link
Member

srowen commented Dec 19, 2017

Yeah keep any substantive discussion on the public lists. Sometimes a side conversation happens; summarize the points here.

We've rejected a lot of other functions that other DBs, but not Hive, support. Spark mostly follows Hive, and for everything else, there are UDFs. I'm not against this so much as not clear why it's exceptional

@cloud-fan
Copy link
Contributor

We had an offline discussion and wanna send this out to get more feedbacks. So generally just adding date_trunc is pretty straightforward and makes Spark consistent with other databases about this function, while extending trunc to support timestamp type is a better API design.

@HyukjinKwon
Copy link
Member

If we haven't get a similar function, I would have gone +1 but what I am less sure is date_trunc actually quite sounds a better version of trunc to be honest. Seems both also extend the same parent here TruncTime.

I feel like we are trying to add this better version alone by working around because it takes a relatively larger change to update other related functions consistently.

@HyukjinKwon
Copy link
Member

I get date_trunc is common in other DBMS. I can see that this can be done now and we can still proceed trunc, etc. later. So, I am fine but still less sure tho.

@youngbink
Copy link
Contributor Author

youngbink commented Dec 19, 2017

hmm...even if we decide to change this later, I honestly think merging trunc and date_trunc would be simple, only touching a couple of files (mostly datetimeExpressions.scala).
This PR isn't too small as you said, but most of the codes here can be used without modification if we are to merge date_trunc.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 19, 2017

SPARK-17174 originally described few functions related with hour, min, etc. but I received an advice to fix up other related functions too even though they could also be done alone too. I agreed with doing other functions too at that time and I tried to propose as so.

I am saying I think this PR actually more targets adding another (better) version of trunc to support day, hour, min, etc. in the format. In this case, I think we should deduplicate/support the logics with related functions too.

Ah, so, I think I am less sure about why this should be done alone leaving out other related changes, and other functions we (I) usually reject.

and I think you and @cloud-fan say the reasons are, it's common and this PR targets a separate functionality consistent with other DBMS.

@ExpressionDescription(
usage = """
_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`.
`fmt` should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use the lower case and also update the other functions in this file. For example, ToUnixTimestamp

@gatorsmile
Copy link
Member

  • The API proposed by this PR is consistent with the other DBs.
  • The implementation does not introduce the behavior changes.

The implementation is clean and the PR quality is pretty good.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85131 has finished for PR 20015 at commit 3547b7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85132 has finished for PR 20015 at commit b12ba92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

"""
Returns timestamp truncated to the unit specified by the format.

:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: YYYY -> yyyy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the original trunc

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change itself seems fine.

* Returns the trunc date time from original date time and trunc level.
* Trunc level should be generated using `parseTruncLevel()`, should be between 1 and 8
*/
def truncTimestamp(d: SQLTimestamp, level: Int, timeZone: TimeZone): SQLTimestamp = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: d -> ts or t

override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType)
override def dataType: DataType = DateType
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes {
val time: Expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, time -> instant.

* @param input internalRow (time)
* @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input)
* @param truncFunc function: (time, level) => time
* @return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove @return


private lazy val truncLevel: Int =
DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String])

override def eval(input: InternalRow): Any = {
/**
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85140 has finished for PR 20015 at commit 80a1959.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nits

// scalastyle:off line.size.limit
@ExpressionDescription(
usage = """
_FUNC_(fmt, date) - Returns timestamp `ts` truncated to the unit specified by the format model `fmt`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date -> ts.

:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm',
'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter'

>>> df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d -> t or ts.

@@ -563,6 +563,76 @@ class DateTimeUtilsSuite extends SparkFunSuite {
}
}

test("truncTimestamp") {
def test(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test -> testTrunc ?

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85141 has finished for PR 20015 at commit 0d1a8cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master

@asfgit asfgit closed this in 6e36d8d Dec 20, 2017
@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85146 has finished for PR 20015 at commit 238d7d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes

chenzhx pushed a commit to chenzhx/spark that referenced this pull request Oct 19, 2020
## What changes were proposed in this pull request?

Adding date_trunc() as a built-in function.
`date_trunc` is common in other databases, but Spark or Hive does not have support for this. `date_trunc` is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset).
We do have `trunc` but this only works with 'MONTH' and 'YEAR' level on the DateType input.

date_trunc() in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html

## How was this patch tested?

Unit tests

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Youngbin Kim <ykim828@hotmail.com>

Closes apache#20015 from youngbink/date_trunc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants