-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14160] Time Windowing functions for Datasets #12008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54343 has finished for PR 12008 at commit
|
|
Test build #54344 has finished for PR 12008 at commit
|
| override def dataType: DataType = outputType | ||
|
|
||
| private def outputType: StructType = StructType(Seq( | ||
| StructField("start", TimestampType), StructField("end", TimestampType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indents are off in this file and the next.
| Add(division, Literal(i - windowExpr.maxNumOverlapping)), | ||
| Literal(windowExpr.slideDuration)), | ||
| Literal(windowExpr.startTime)), | ||
| Literal(1000000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be easier to read if it was written with the dsl.
| windowDuration: String, | ||
| slideDuration: String, | ||
| startTime: String): Column = withExpr { | ||
| TimeWindow(timeColumn.expr, windowDuration, slideDuration, startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just parse here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe in the companion object? Or with another constructor? The _param / lazy val is a little odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on parsing it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the companion object apply method
| (cal.months * 4 * CalendarInterval.MICROS_PER_WEEK + cal.microseconds) / 1000000 | ||
| } | ||
|
|
||
| // The window duration in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the smallest window is 1 second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately yes. The conversion from LongType to TimestampType has second precision.
|
lower case Dataset |
|
Test build #54345 has finished for PR 12008 at commit
|
| * @group datetime_funcs | ||
| * @since 2.0.0 | ||
| */ | ||
| def window(timeColumn: Column, windowDuration: String): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experimental tag
|
Test build #54370 has finished for PR 12008 at commit
|
|
Test build #54371 has finished for PR 12008 at commit
|
|
Test build #54372 has finished for PR 12008 at commit
|
|
retest this please |
|
Test build #54387 has finished for PR 12008 at commit
|
|
Test build #54484 has finished for PR 12008 at commit
|
| throw new IllegalArgumentException( | ||
| s"The provided interval ($interval) did not correspond to a valid interval string.") | ||
| } | ||
| (cal.months * 4 * CalendarInterval.MICROS_PER_WEEK + cal.microseconds) / 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 weeks == 1 month looks weird. Maybe define window("timestamp", "1 month") as groupBy(getMonthInYear("timestamp")) is more intuitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 weeks == 1 month looks weird. Maybe define window("timestamp", "1 month") as groupBy(getMonthInYear("timestamp")) is more intuitive?
This looks hard to implement. Maybe we just don't need to support month or year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, does this line mean that the user cannot use window("timestamp", "500 milliseconds")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't operate on TimestampType as if they are Longs. They get cast to
LongType AFAIK, which has second precision. I'm not sure if supporting ms
is possible.
On Mar 29, 2016 11:14 PM, "Shixiong Zhu" notifications@github.com wrote:
In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala
#12008 (comment):
- private def getIntervalInSeconds(interval: String): Long = {
- if (StringUtils.isBlank(interval)) {
throw new IllegalArgumentException("The window duration, slide duration and start time cannot be null or blank.")- }
- val intervalString = if (interval.startsWith("interval")) {
interval- } else {
"interval " + interval- }
- val cal = CalendarInterval.fromString(intervalString)
- if (cal == null) {
throw new IllegalArgumentException(s"The provided interval ($interval) did not correspond to a valid interval string.")- }
- (cal.months * 4 * CalendarInterval.MICROS_PER_WEEK + cal.microseconds) / 1000000
By the way, does it mean that the user cannot use window("timestamp", "500
milliseconds")?—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12008/files/b4e2fc23585413b1bf50e2487437dd38b9cd748f#r57840903
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp precision i think is 100ns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we probably don't want to support month intervals since they are variable length. If people want to group on calendar boundaries they can use existing data functions.
|
Test build #54574 has finished for PR 12008 at commit
|
|
Test build #54575 has finished for PR 12008 at commit
|
|
Test build #54640 has finished for PR 12008 at commit
|
|
Thanks! Merging to master. |
## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](#12008). This PR adds the Python, and SQL, API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python, users can access all APIs above, but in addition they can do - In Python: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12136 from brkyvz/python-windows.
## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](#12008). This PR adds the R API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python and R, users can access all APIs above, but in addition they can do - In R: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12141 from brkyvz/R-windows.
What changes were proposed in this pull request?
This PR adds the function
windowas a column expression.windowcan be used to bucket rows into time windows given a time column. With this expression, performing time series analysis on batch data, as well as streaming data should become much more simpler.Usage
Assume the following schema:
sensor_id, measurement, timestampTo average 5 minute data every 1 minute (window length of 5 minutes, slide duration of 1 minute), we will use:
This will generate windows such as:
Intervals will start at every
slideDurationstarting at the unix epoch (1970-01-01 00:00:00 UTC).To start intervals at a different point of time, e.g. 30 seconds after a minute, the
startTimeparameter can be used.This will generate windows such as:
Support for Python will be made in a follow up PR after this.
How was this patch tested?
This patch has some basic unit tests for the
TimeWindowexpression testing that the parameters pass validation, and it also has some unit/integration tests testing the correctness of the windowing and usability in complex operations (multi-column grouping, multi-column projections, joins).