[SPARK-38069][SQL][SS] Improve the calculation of time window #35362

nyingping · 2022-01-29T23:37:35Z

What changes were proposed in this pull request?

Remove the CaseWhen，Modified the calculation method of the obtained window

new logic：
lastStart needs to be, which is less than timestamp and is the maximum integer multiple of windowsize

lastStart is equal to timestamp minus the time left in the maximum integer multiple window
val lastStart = timestamp - (timestamp - window.startTime + window.slideDuration) % window.slideDuration

After getting lastStart, lastEnd is obvious, and other possible Windows can be computed using i and windowsize

Why are the changes needed?

Structed Streaming computes window by intermediate result windowId, and windowId computes window by CaseWhen.

We can use Flink's method of calculating window to write it, which is more easy to understand, simple and efficient

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Existing test as this is just refactoring.

Also composed and ran a simple benchmark in this commit: HeartSaVioR@d532b6f

Quoting queries used to benchmark the change:

tumble window

    spark.range(numOfRow)
      .selectExpr("CAST(id AS timestamp) AS time")
      .select(window(col("time"), "12 seconds", "12 seconds", "2 seconds"))
      .count()

sliding window

    spark.range(numOfRow)
      .selectExpr("CAST(id AS timestamp) AS time")
      .select(window(col("time"), "17 seconds", "5 seconds", "2 seconds"))
      .count()

Results are following:

tumble window

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] tumbling windows:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] old logic                                            22             31          13        457.0           2.2       1.0X
[info] new logic                                            17             19           2        589.9           1.7       1.3X

sliding window

[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] sliding windows:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] old logic                                          1347           1368          16          7.4         134.7       1.0X
[info] new logic                                           867            886          16         11.5          86.7       1.6X

HyukjinKwon · 2022-01-30T00:57:53Z

cc @hvanhovell FYI

AmplabJenkins · 2022-01-30T15:00:08Z

Can one of the admins verify this patch?

dongjoon-hyun · 2022-01-31T04:28:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          case _ => Metadata.empty
        }

+


nit. Please remove this redundant new line addition.

nit. Please remove this redundant new line addition.
@dongjoon-hyun
ok,thanks!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

improve structured streaming window of calculated

HeartSaVioR · 2022-02-04T02:16:45Z

Thanks for the contribution!

Given the code change is critical to the fixed time window calculation, could you please fill out the details on math in the PR description, in the section What changes were proposed in this pull request? ?

It would be great if you can provide some calculation examples as well, tumble window / sliding window with start time.

Since the existing logic works for years, we need to be very confident on changing the logic even the new logic is considered as the better one. (We often struggle about regression, especially correctness issue.)

Thanks for understanding.

HeartSaVioR · 2022-02-04T02:38:25Z

cc. @brkyvz since he's an author of the code, although the code was committed 5+ years ago.

@nyingping
If you don't mind, could you please try out micro-benchmark against the change?

Benchmarks in SQL are located in sql/core/test/scala, with package org.apache.spark.sql.execution.benchmark. Since it's really about creating time window, you don't need to deal with streaming query and aggregation. You can start with batch query (say, starting your Dataset via spark.range(10000000)) and convert these values to timestamp, and call window in select, and write to "noop" format of sink, and done.

Below is the simple benchmark code from #18364 - it didn't leverage the benchmark framework, but you can get some sense on creating benchmark code. In benchmark framework you'd like to remove spark.time and leverage the functionality of benchmark framework.

import org.apache.spark.sql.functions._

spark.time { 
  spark.range(numRecords)
    .select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id / 1000) / 1000) as 'time)
    .select(window('time, "10 seconds"))
    .count()
}

If you feel too much bootstrapping on learning benchmark framework, please start with above code (with tumble/sliding) and if the code can show the difference, it would be sufficient.

nyingping · 2022-02-06T09:51:30Z

@HeartSaVioR
I totally understand. Thank you very much.

I have modified the content according to your suggestion, and I am looking forward to your review

HeartSaVioR · 2022-02-06T22:24:03Z

HeartSaVioR@d532b6f

I just played with my own simple benchmark (in the commit above), and the gain is much more than the PR description. It's up to 30% for tumble window to 60% for sliding window. (I expect the gain gets bigger if maxNumOverlapping is higher.)

I'll update the PR description to contain the benchmark result.

I also did some calculations based on the new math to create sliding windows with offset by hand, and it seemed OK. I can't think of cases the new math may miss.

HeartSaVioR · 2022-02-06T22:41:29Z

cc. @tdas @zsxwing @viirya @xuanyuanking Would like to have another eyes of reviewers. Thanks in advance!

HeartSaVioR · 2022-02-06T22:52:34Z

cc. @alex-balikov @jerrypeng as well.

nyingping · 2022-02-07T03:49:24Z

@HeartSaVioR

Thank you for providing a more professional benchmark.

viirya · 2022-02-08T08:02:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

   * for (i <- 0 until maxNumOverlapping)
-   *   windowId <- ceil((timestamp - startTime) / slideDuration)
-   *   windowStart <- windowId * slideDuration + (i - maxNumOverlapping) * slideDuration + startTime
+   *   lastStart <- timestamp - (timestamp - startTime + windowDuration) % windowDuration


Hmm, I suppose this is the window start time of last window. Don't we need to consider slide duration when calculating last start? I think this is only correct if the slide duration equals to window duration.

That's inconsistency between code and comment :) Nice finding.

@nyingping Could you please fix the comment here? Thanks!

@HeartSaVioR Oh, yea, I verified the calculation by looking at the comment. The code looks correct.

The code comment is not fixed; please don't resolve the conversation manually. You can fix it, push the commit and the comment thread will become outdated.

viirya

Seems okay, if I don't miss anything when verifying the calculation.

nyingping · 2022-02-08T08:40:35Z

remove parameter maxNumOverlapping from getwindow on this commit

nyingping · 2022-02-08T08:50:37Z

@HeartSaVioR @viirya yes,of course.thanks!
Is this fix ok?

lastStart <- timestamp - (timestamp - startTime + windowDuration) % windowDuration
->
lastStart <- timestamp - (timestamp - startTime + slideDuration) % slideDuration

<< I do not know why this reply is always pending and cannot be displayed, so I reply here. Sorry

viirya · 2022-02-08T08:53:26Z

Yea, just to update the comment based on the code you changed, i.e. to make them consistent.

nyingping · 2022-02-08T09:03:42Z

@HeartSaVioR @HeartSaVioR
fixed comment and commit,thanks.

HeartSaVioR

+1

viirya · 2022-02-08T23:37:59Z

Thanks. Merging to master.

dongjoon-hyun · 2022-02-08T23:46:26Z

Thank you all!

… < 0 ### What changes were proposed in this pull request? I tried to understand what was introduced in #36737 and made the code more readable and added some test. Many thanks to nyingping! The change in #35362 brought a bug when the `timestamp` is less than 0, i.e. before `1970-01-01 00:00:00 UTC`. Then for some windows, spark returns a wrong `windowStart` time. The root cause of this bug is how the module operator(%) works with negative number. For example, ``` scala> 1 % 3 res0: Int = 1 scala> -1 % 3 res1: Int = -1 // Mathematically it should be 2 here ``` This lead to a wrong calculation result of `windowStart`. For a concrete example: ``` * Example calculation: * For simplicity assume windowDuration = slideDuration. * | x x x x x x x x x x x x | x x x x x x x x x x x x | x x x x x x x x x x x x | * | |----l1 ----|---- l2 -----| * lastStart timestamp lastStartWrong * Normally when timestamp > startTime (or equally remainder > 0), we get * l1 = remainder = (timestamp - startTime) % slideDuration, lastStart = timeStamp - remainder * However, when timestamp < startTime (or equally remainder < 0), the value of remainder is * -l2 (note the negative sign), and lastStart is then at the position of lastStartWrong. * So we need to subtract a slideDuration. ``` ### Why are the changes needed? This is a bug fix. Example from the original PR #36737: Here df3 and df4 has time before 1970, so timestamp < 0. ``` val df3 = Seq( ("1969-12-31 00:00:02", 1), ("1969-12-31 00:00:12", 2)).toDF("time", "value") val df4 = Seq( (LocalDateTime.parse("1969-12-31T00:00:02"), 1), (LocalDateTime.parse("1969-12-31T00:00:12"), 2)).toDF("time", "value") Seq(df3, df4).foreach { df => checkAnswer( df.select(window($"time", "10 seconds", "10 seconds", "5 seconds"), $"value") .orderBy($"window.start".asc) .select($"window.start".cast(StringType), $"window.end".cast(StringType), $"value"), Seq( Row("1969-12-30 23:59:55", "1969-12-31 00:00:05", 1), Row("1969-12-31 00:00:05", "1969-12-31 00:00:15", 2)) ) } ``` Without the change this would error with: ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == !struct<> struct<CAST(window.start AS STRING):string,CAST(window.end AS STRING):string,value:int> ![1969-12-30 23:59:55,1969-12-31 00:00:05,1] [1969-12-31 00:00:05,1969-12-31 00:00:15,1] ![1969-12-31 00:00:05,1969-12-31 00:00:15,2] [1969-12-31 00:00:15,1969-12-31 00:00:25,2] ``` Notice how this is shifted with one `slideDuration`. It should start with `[1969-12-30 23:59:55,1969-12-31 00:00:05,1]` but spark returns `[1969-12-31 00:00:05,1969-12-31 00:00:15,1]`, right-shifted of one `slideDuration` (10 seconds). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Benchmark results: 1. Burak's original Implementation ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 10 17 14 962.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: burak version [info] Stopped after 16 iterations, 10604 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 646 663 19 15.5 64.6 1.0X ``` 2. Current implementation (buggy) ``` [info] Running benchmark: tumbling windows [info] Running case: current - buggy [info] Stopped after 637 iterations, 10008 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 10 16 12 1042.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: current - buggy [info] Stopped after 16 iterations, 10143 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 617 634 10 16.2 61.7 1.0X ``` 3. Purposed change in this PR: ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 10 16 11 981.2 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: purposed change [info] Stopped after 18 iterations, 10122 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 548 562 19 18.3 54.8 1.0X ``` Note that I run them separately, because I found that if you run these tests sequentially, the later one will always get a performance gain. I think the computer is doing some optimizations. Closes #39843 from WweiL/SPARK-38069-time-window-fix. Lead-authored-by: Wei Liu <wei.liu@databricks.com> Co-authored-by: nieyingping <nieyingping@alphadata.com.cn> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

… < 0 ### What changes were proposed in this pull request? I tried to understand what was introduced in #36737 and made the code more readable and added some test. Many thanks to nyingping! The change in #35362 brought a bug when the `timestamp` is less than 0, i.e. before `1970-01-01 00:00:00 UTC`. Then for some windows, spark returns a wrong `windowStart` time. The root cause of this bug is how the module operator(%) works with negative number. For example, ``` scala> 1 % 3 res0: Int = 1 scala> -1 % 3 res1: Int = -1 // Mathematically it should be 2 here ``` This lead to a wrong calculation result of `windowStart`. For a concrete example: ``` * Example calculation: * For simplicity assume windowDuration = slideDuration. * | x x x x x x x x x x x x | x x x x x x x x x x x x | x x x x x x x x x x x x | * | |----l1 ----|---- l2 -----| * lastStart timestamp lastStartWrong * Normally when timestamp > startTime (or equally remainder > 0), we get * l1 = remainder = (timestamp - startTime) % slideDuration, lastStart = timeStamp - remainder * However, when timestamp < startTime (or equally remainder < 0), the value of remainder is * -l2 (note the negative sign), and lastStart is then at the position of lastStartWrong. * So we need to subtract a slideDuration. ``` ### Why are the changes needed? This is a bug fix. Example from the original PR #36737: Here df3 and df4 has time before 1970, so timestamp < 0. ``` val df3 = Seq( ("1969-12-31 00:00:02", 1), ("1969-12-31 00:00:12", 2)).toDF("time", "value") val df4 = Seq( (LocalDateTime.parse("1969-12-31T00:00:02"), 1), (LocalDateTime.parse("1969-12-31T00:00:12"), 2)).toDF("time", "value") Seq(df3, df4).foreach { df => checkAnswer( df.select(window($"time", "10 seconds", "10 seconds", "5 seconds"), $"value") .orderBy($"window.start".asc) .select($"window.start".cast(StringType), $"window.end".cast(StringType), $"value"), Seq( Row("1969-12-30 23:59:55", "1969-12-31 00:00:05", 1), Row("1969-12-31 00:00:05", "1969-12-31 00:00:15", 2)) ) } ``` Without the change this would error with: ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == !struct<> struct<CAST(window.start AS STRING):string,CAST(window.end AS STRING):string,value:int> ![1969-12-30 23:59:55,1969-12-31 00:00:05,1] [1969-12-31 00:00:05,1969-12-31 00:00:15,1] ![1969-12-31 00:00:05,1969-12-31 00:00:15,2] [1969-12-31 00:00:15,1969-12-31 00:00:25,2] ``` Notice how this is shifted with one `slideDuration`. It should start with `[1969-12-30 23:59:55,1969-12-31 00:00:05,1]` but spark returns `[1969-12-31 00:00:05,1969-12-31 00:00:15,1]`, right-shifted of one `slideDuration` (10 seconds). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Benchmark results: 1. Burak's original Implementation ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 10 17 14 962.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: burak version [info] Stopped after 16 iterations, 10604 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 646 663 19 15.5 64.6 1.0X ``` 2. Current implementation (buggy) ``` [info] Running benchmark: tumbling windows [info] Running case: current - buggy [info] Stopped after 637 iterations, 10008 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 10 16 12 1042.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: current - buggy [info] Stopped after 16 iterations, 10143 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 617 634 10 16.2 61.7 1.0X ``` 3. Purposed change in this PR: ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 10 16 11 981.2 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: purposed change [info] Stopped after 18 iterations, 10122 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 548 562 19 18.3 54.8 1.0X ``` Note that I run them separately, because I found that if you run these tests sequentially, the later one will always get a performance gain. I think the computer is doing some optimizations. Closes #39843 from WweiL/SPARK-38069-time-window-fix. Lead-authored-by: Wei Liu <wei.liu@databricks.com> Co-authored-by: nieyingping <nieyingping@alphadata.com.cn> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 87d4eb6) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

… < 0 ### What changes were proposed in this pull request? I tried to understand what was introduced in apache#36737 and made the code more readable and added some test. Many thanks to nyingping! The change in apache#35362 brought a bug when the `timestamp` is less than 0, i.e. before `1970-01-01 00:00:00 UTC`. Then for some windows, spark returns a wrong `windowStart` time. The root cause of this bug is how the module operator(%) works with negative number. For example, ``` scala> 1 % 3 res0: Int = 1 scala> -1 % 3 res1: Int = -1 // Mathematically it should be 2 here ``` This lead to a wrong calculation result of `windowStart`. For a concrete example: ``` * Example calculation: * For simplicity assume windowDuration = slideDuration. * | x x x x x x x x x x x x | x x x x x x x x x x x x | x x x x x x x x x x x x | * | |----l1 ----|---- l2 -----| * lastStart timestamp lastStartWrong * Normally when timestamp > startTime (or equally remainder > 0), we get * l1 = remainder = (timestamp - startTime) % slideDuration, lastStart = timeStamp - remainder * However, when timestamp < startTime (or equally remainder < 0), the value of remainder is * -l2 (note the negative sign), and lastStart is then at the position of lastStartWrong. * So we need to subtract a slideDuration. ``` ### Why are the changes needed? This is a bug fix. Example from the original PR apache#36737: Here df3 and df4 has time before 1970, so timestamp < 0. ``` val df3 = Seq( ("1969-12-31 00:00:02", 1), ("1969-12-31 00:00:12", 2)).toDF("time", "value") val df4 = Seq( (LocalDateTime.parse("1969-12-31T00:00:02"), 1), (LocalDateTime.parse("1969-12-31T00:00:12"), 2)).toDF("time", "value") Seq(df3, df4).foreach { df => checkAnswer( df.select(window($"time", "10 seconds", "10 seconds", "5 seconds"), $"value") .orderBy($"window.start".asc) .select($"window.start".cast(StringType), $"window.end".cast(StringType), $"value"), Seq( Row("1969-12-30 23:59:55", "1969-12-31 00:00:05", 1), Row("1969-12-31 00:00:05", "1969-12-31 00:00:15", 2)) ) } ``` Without the change this would error with: ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == !struct<> struct<CAST(window.start AS STRING):string,CAST(window.end AS STRING):string,value:int> ![1969-12-30 23:59:55,1969-12-31 00:00:05,1] [1969-12-31 00:00:05,1969-12-31 00:00:15,1] ![1969-12-31 00:00:05,1969-12-31 00:00:15,2] [1969-12-31 00:00:15,1969-12-31 00:00:25,2] ``` Notice how this is shifted with one `slideDuration`. It should start with `[1969-12-30 23:59:55,1969-12-31 00:00:05,1]` but spark returns `[1969-12-31 00:00:05,1969-12-31 00:00:15,1]`, right-shifted of one `slideDuration` (10 seconds). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Benchmark results: 1. Burak's original Implementation ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 10 17 14 962.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: burak version [info] Stopped after 16 iterations, 10604 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] burak version 646 663 19 15.5 64.6 1.0X ``` 2. Current implementation (buggy) ``` [info] Running benchmark: tumbling windows [info] Running case: current - buggy [info] Stopped after 637 iterations, 10008 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 10 16 12 1042.7 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: current - buggy [info] Stopped after 16 iterations, 10143 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] current - buggy 617 634 10 16.2 61.7 1.0X ``` 3. Purposed change in this PR: ``` [info] Apple M1 Max [info] tumbling windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 10 16 11 981.2 1.0 1.0X [info] Running benchmark: sliding windows [info] Running case: purposed change [info] Stopped after 18 iterations, 10122 ms [info] OpenJDK 64-Bit Server VM 11.0.12+7-LTS on Mac OS X 12.5.1 [info] Apple M1 Max [info] sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] purposed change 548 562 19 18.3 54.8 1.0X ``` Note that I run them separately, because I found that if you run these tests sequentially, the later one will always get a performance gain. I think the computer is doing some optimizations. Closes apache#39843 from WweiL/SPARK-38069-time-window-fix. Lead-authored-by: Wei Liu <wei.liu@databricks.com> Co-authored-by: nieyingping <nieyingping@alphadata.com.cn> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 87d4eb6) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

nyingping added 6 commits January 29, 2022 12:33

improve structured streaming window of calculated

1df0cc8

improve structured streaming window of calculated

8bf8e65

improve structured streaming window of calculated

d8d0799

improve structured streaming window of calculated

530c5b8

improve structured streaming window of calculated

f0e0ee8

improve structured streaming window of calculated

91c7e45

github-actions bot added the SQL label Jan 29, 2022

Merge branch 'apache:master' into main

c839ac2

dongjoon-hyun changed the title ~~[SPARK-38069][SQL][WINDOW] improve structured streaming window of calculated~~ [SPARK-38069][SQL] Improve structured streaming window of calculated Jan 31, 2022

dongjoon-hyun reviewed Jan 31, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

Update Analyzer.scala

118c026

improve structured streaming window of calculated

HeartSaVioR changed the title ~~[SPARK-38069][SQL] Improve structured streaming window of calculated~~ [SPARK-38069][SQL][SS] Improve the calculation of time window Feb 6, 2022

nyingping added 3 commits February 8, 2022 13:33

Merge branch 'apache:master' into main

fabd761

remove the overlappingWindows parameter of the getwindow function

625b4eb

Merge remote-tracking branch 'origin/main'

333b92e

viirya reviewed Feb 8, 2022

View reviewed changes

remove the overlappingWindows parameter of the getwindow function

1eaf6e5

HeartSaVioR approved these changes Feb 8, 2022

View reviewed changes

viirya approved these changes Feb 8, 2022

View reviewed changes

viirya closed this in 8d2e08f Feb 8, 2022

viirya mentioned this pull request Feb 16, 2022

[SPARK-38214][SS]No need to filter windows when windowDuration is multiple of slideDuration #35526

Closed

nyingping mentioned this pull request Jun 1, 2022

[SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration… #36737

Closed

WweiL mentioned this pull request Feb 1, 2023

[SPARK-39347][SS] Bug fix for time window calculation when event time < 0 #39843

Closed

[SPARK-38069][SQL][SS] Improve the calculation of time window #35362

[SPARK-38069][SQL][SS] Improve the calculation of time window #35362

Uh oh!

Conversation

nyingping commented Jan 29, 2022 • edited by HeartSaVioR Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 30, 2022

Uh oh!

AmplabJenkins commented Jan 30, 2022

Uh oh!

dongjoon-hyun Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

nyingping Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HeartSaVioR commented Feb 4, 2022

Uh oh!

HeartSaVioR commented Feb 4, 2022

Uh oh!

nyingping commented Feb 6, 2022

Uh oh!

HeartSaVioR commented Feb 6, 2022

Uh oh!

HeartSaVioR commented Feb 6, 2022

Uh oh!

HeartSaVioR commented Feb 6, 2022

Uh oh!

nyingping commented Feb 7, 2022

Uh oh!

viirya Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

viirya Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

nyingping commented Feb 8, 2022

Uh oh!

nyingping commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Feb 8, 2022

Uh oh!

nyingping commented Feb 8, 2022

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 8, 2022

Uh oh!

dongjoon-hyun commented Feb 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nyingping commented Jan 29, 2022 •

edited by HeartSaVioR

Loading

nyingping commented Feb 8, 2022 •

edited

Loading