-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22829] Add new built-in function date_trunc() #20015
Conversation
ok to test |
we need to create a JIRA ticket |
@cloud-fan and @youngbink how about reviving #14788 with a configuration to control this? AWS Redshift seems having PostgreSQL does not have Presto also looks not having a duplicated functionality. I think we can simply introduce an alias for Did I maybe miss something? |
@HyukjinKwon Just took a look at this PR #14788. My point of mentioning those databases was just to give examples of the function that Spark doesn't support but other databases commonly do. (They all have this
This is because |
Where did the discussion happen? Was this offline discussion? I also want to actively join in the discussion. Many implementations of the trunc works differently and I think we decide the "right" behaviour after sufficient discussion. If we don't fix the stuff about #14788 in 2.3.0 timeline, it could be even more difficult because we need to keep the previous behaviour. |
Test build #85083 has finished for PR 20015 at commit
|
OK. I am fine if you all guys strongly feel about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a quick pass.
python/pyspark/sql/functions.py
Outdated
:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm', | ||
'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER' | ||
|
||
>>> df = spark.createDataFrame([('1997-02-28',)], ['d']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a timestamp string like 1997-02-28 05:02:11
to show the difference from trunc
a bit more clearly?
override def eval(input: InternalRow): Any = { | ||
/** | ||
* | ||
* @param input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems input
and truncFunc
descriptions missing.
|
||
override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) | ||
override def dataType: DataType = DateType | ||
trait TruncTime extends BinaryExpression with ImplicitCastInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe TruncInstant
? I received this advice before and I liked it too. Not a big deal tho.
* Returns timestamp truncated to the unit specified by the format. | ||
* | ||
* @param format: 'year', 'yyyy', 'yy' for truncate by year, | ||
* 'month', 'mon', 'mm' for truncate by month, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one space each more.
python/pyspark/sql/functions.py
Outdated
Returns timestamp truncated to the unit specified by the format. | ||
|
||
:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm', | ||
'DAY', 'DD', 'HOUR', 'MINUTE', 'SECOND', 'WEEK', 'QUARTER' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make those lowercased too?
* @param format: 'year', 'yyyy', 'yy' for truncate by year, | ||
* 'month', 'mon', 'mm' for truncate by month, | ||
* 'day', 'dd' for truncate by day, | ||
* Other options are: second, minute, hour, week, month, quarter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, 'second', 'minute', 'hour', 'week', 'month' and 'quarter'
// unknown format | ||
null | ||
} else { | ||
val d = date.eval(input) | ||
val d = time.eval(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since this is a time, it can be val t = ...
val level = if (format.foldable) { | ||
truncLevel | ||
} else { | ||
DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) | ||
} | ||
if (level == -1) { | ||
if (level == DateTimeUtils.TRUNC_INVALID || level > maxLevel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// unknown format or too small level
?
} | ||
} | ||
} | ||
|
||
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
protected def codeGenHelper[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a type parameter T
?
val TRUNC_TO_SECOND = 6 | ||
val TRUNC_TO_WEEK = 7 | ||
val TRUNC_TO_QUARTER = 8 | ||
val TRUNC_INVALID = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we bring quarter and week forward, maybe to 3 and 4? Then it's more conform to the order of time granularity and max-level design is not influenced.
* @return | ||
*/ | ||
protected def evalHelper[T](input: InternalRow, maxLevel: Int)( | ||
truncFunc: (Any, Int) => T): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe truncFunc: (Any, Int) => Any
is enough? So we don't need to use the T
, but I'm not sure if this is better...
Yeah keep any substantive discussion on the public lists. Sometimes a side conversation happens; summarize the points here. We've rejected a lot of other functions that other DBs, but not Hive, support. Spark mostly follows Hive, and for everything else, there are UDFs. I'm not against this so much as not clear why it's exceptional |
We had an offline discussion and wanna send this out to get more feedbacks. So generally just adding |
If we haven't get a similar function, I would have gone +1 but what I am less sure is I feel like we are trying to add this better version alone by working around because it takes a relatively larger change to update other related functions consistently. |
I get |
hmm...even if we decide to change this later, I honestly think merging |
SPARK-17174 originally described few functions related with hour, min, etc. but I received an advice to fix up other related functions too even though they could also be done alone too. I agreed with doing other functions too at that time and I tried to propose as so. I am saying I think this PR actually more targets adding another (better) version of Ah, so, I think I am less sure about why this should be done alone leaving out other related changes, and other functions we (I) usually reject. and I think you and @cloud-fan say the reasons are, it's common and this PR targets a separate functionality consistent with other DBMS. |
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(date, fmt) - Returns `date` with the time portion of the day truncated to the unit specified by the format model `fmt`. | ||
`fmt` should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use the lower case and also update the other functions in this file. For example, ToUnixTimestamp
The implementation is clean and the PR quality is pretty good. |
3547b7c
to
b12ba92
Compare
Test build #85131 has finished for PR 20015 at commit
|
Test build #85132 has finished for PR 20015 at commit
|
python/pyspark/sql/functions.py
Outdated
""" | ||
Returns timestamp truncated to the unit specified by the format. | ||
|
||
:param format: 'year', 'YYYY', 'yy', 'month', 'mon', 'mm', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: YYYY
-> yyyy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the original trunc
b12ba92
to
80a1959
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change itself seems fine.
* Returns the trunc date time from original date time and trunc level. | ||
* Trunc level should be generated using `parseTruncLevel()`, should be between 1 and 8 | ||
*/ | ||
def truncTimestamp(d: SQLTimestamp, level: Int, timeZone: TimeZone): SQLTimestamp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: d -> ts or t
override def inputTypes: Seq[AbstractDataType] = Seq(DateType, StringType) | ||
override def dataType: DataType = DateType | ||
trait TruncInstant extends BinaryExpression with ImplicitCastInputTypes { | ||
val time: Expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, time
-> instant
.
* @param input internalRow (time) | ||
* @param maxLevel Maximum level that can be used for truncation (e.g MONTH for Date input) | ||
* @param truncFunc function: (time, level) => time | ||
* @return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove @return
|
||
private lazy val truncLevel: Int = | ||
DateTimeUtils.parseTruncLevel(format.eval().asInstanceOf[UTF8String]) | ||
|
||
override def eval(input: InternalRow): Any = { | ||
/** | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
80a1959
to
0d1a8cb
Compare
Test build #85140 has finished for PR 20015 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor nits
// scalastyle:off line.size.limit | ||
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(fmt, date) - Returns timestamp `ts` truncated to the unit specified by the format model `fmt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date
-> ts
.
python/pyspark/sql/functions.py
Outdated
:param format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', | ||
'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter' | ||
|
||
>>> df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['d']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d
-> t
or ts
.
@@ -563,6 +563,76 @@ class DateTimeUtilsSuite extends SparkFunSuite { | |||
} | |||
} | |||
|
|||
test("truncTimestamp") { | |||
def test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test
-> testTrunc
?
0d1a8cb
to
7f08daf
Compare
7f08daf
to
238d7d4
Compare
Test build #85141 has finished for PR 20015 at commit
|
LGTM |
Thanks! Merged to master |
Test build #85146 has finished for PR 20015 at commit
|
## What changes were proposed in this pull request? Adding date_trunc() as a built-in function. `date_trunc` is common in other databases, but Spark or Hive does not have support for this. `date_trunc` is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset). We do have `trunc` but this only works with 'MONTH' and 'YEAR' level on the DateType input. date_trunc() in other databases: AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html Presto: https://prestodb.io/docs/current/functions/datetime.html ## How was this patch tested? Unit tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Youngbin Kim <ykim828@hotmail.com> Closes apache#20015 from youngbink/date_trunc.
What changes were proposed in this pull request?
Adding date_trunc() as a built-in function.
date_trunc
is common in other databases, but Spark or Hive does not have support for this.date_trunc
is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset).We do have
trunc
but this only works with 'MONTH' and 'YEAR' level on the DateType input.date_trunc() in other databases:
AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html
PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html
Presto: https://prestodb.io/docs/current/functions/datetime.html
How was this patch tested?
Unit tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.