[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. #18634

jinxing64 · 2017-07-14T06:16:00Z

What changes were proposed in this pull request?

In SlidingWindowFunctionFrame, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound.
This could result in the buffer is very big though the window is small.
For example:

select a, b, sum(a) 
over (partition by b order by a range between 1000000 following and 1000001 following) 
from table

We can refine the logic and just add the qualified rows into buffer.

How was this patch tested?

Manual test:
Run sql
select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10
against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines.
Configure the executor with 2G bytes memory.
With the change in this pr, it works find. Without this change, below exception will be thrown.

MemoryError: Java heap space
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

SparkQA · 2017-07-14T07:04:55Z

Test build #79605 has finished for PR 18634 at commit 5103ae8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-07-14T07:13:22Z

retest please

jinxing64 · 2017-07-14T08:53:01Z

Jenkins, retest this please.

SparkQA · 2017-07-14T11:24:45Z

Test build #79608 has finished for PR 18634 at commit 5103ae8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-07-17T11:06:13Z

cc @cloud-fan @jiangxb1987

jiangxb1987

I think this improvement is valid, although this only take effect on relative corner case(when CurrentRow is not in the window frame). One concern is the test case don't reflect the improvement of this change.

jiangxb1987 · 2017-07-17T12:41:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala

How about:

while (nextRow != null && ubound.compare(nextRow, inputHighIndex, current, index) <= 0) { if (lbound.compare(nextRow, inputLowIndex, current, index) < 0) { inputLowIndex += 1 } else { buffer.add(nextRow.copy()) bufferUpdated = true } nextRow = WindowFunctionFrame.getNextOrNull(inputIterator) inputHighIndex += 1 }

?

jinxing64 · 2017-07-17T14:11:12Z

@jiangxb1987 Thanks a lot for quick reply !

One concern is the test case don't reflect the improvement of this change.

Yes, there is no unit test for WindowFunctionFrame now. The test case I added is just to check the correctness.

SparkQA · 2017-07-17T16:30:32Z

Test build #79671 has finished for PR 18634 at commit 1ebb211.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-18T04:55:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala


+  test("window function: mutiple window expressions specified by range in a single expression") {
+    val nums = sparkContext.parallelize(1 to 10).map(x => (x, x % 2)).toDF("x", "y")
+    nums.createOrReplaceTempView("nums")


wrap your test with withTempView, which can drop the view automatically.

BTW this test is not very related to this PR, just improves test coverage for range window frame.

And this test case doesn't cover when CurrentRow is not in the window frame. We'd better add that senario.

Sure, I will add it later today.

cloud-fan · 2017-07-18T04:57:58Z

@jinxing64 I think this patch is straightforward, can you do a manual test, which OOM before and works after this PR? We can put the test in PR description so that other people can try it out.

jinxing64 · 2017-07-18T05:00:53Z

@cloud-fan @jiangxb1987
Thanks for help! I will refine and post the result of manual test late today :)

jinxing64 · 2017-07-19T11:02:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala

-    spark.catalog.dropTempView("nums")
+    withTempView("nums") {
+      val expected =
+        Row(1, 1, 1, 4, null, 8, 25) ::


@cloud-fan
This is null and do you think 0 is better?

null is better, which matches the behavior in Aggregate.

cloud-fan · 2017-07-19T11:32:21Z

LGTM

cloud-fan · 2017-07-19T11:32:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala

+          Row(0, 4, 6, 12, 2, 14, 28) ::
+          Row(0, 6, 12, 18, 6, 18, 24) ::
+          Row(0, 8, 20, 24, 10, 10, 18) ::
+          Row(0, 10, 30, 18, 14, null, 10) ::


BTW, please make sure there is no behavior change, i.e. the result should be same with or without this PR.

expected is calculated manually. This test is to verify there is no behavior change.

SparkQA · 2017-07-19T13:17:04Z

Test build #79753 has finished for PR 18634 at commit b33c4cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-19T13:36:23Z

thanks, merging to master/2.2!

## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 1000000 following and 1000001 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Manual test: Run sql `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10` against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines. Configure the executor with 2G bytes memory. With the change in this pr, it works find. Without this change, below exception will be thrown. ``` MemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62) at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ``` Author: jinxing <jinxing6042@126.com> Closes #18634 from jinxing64/SPARK-21414. (cherry picked from commit 4eb081c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 1000000 following and 1000001 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Manual test: Run sql `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10` against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines. Configure the executor with 2G bytes memory. With the change in this pr, it works find. Without this change, below exception will be thrown. ``` MemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62) at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ``` Author: jinxing <jinxing6042@126.com> Closes apache#18634 from jinxing64/SPARK-21414. (cherry picked from commit 4eb081c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jiangxb1987 reviewed Jul 17, 2017

View reviewed changes

Refine SlidingWindowFunctionFrame to avoid OOM.

1ebb211

jinxing64 force-pushed the SPARK-21414 branch from 5103ae8 to 1ebb211 Compare July 17, 2017 13:58

cloud-fan reviewed Jul 18, 2017

View reviewed changes

withTempView

b33c4cf

jinxing64 commented Jul 19, 2017

View reviewed changes

cloud-fan reviewed Jul 19, 2017

View reviewed changes

asfgit closed this in 4eb081c Jul 19, 2017

[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. #18634

[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. #18634

Uh oh!

Conversation

jinxing64 commented Jul 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 14, 2017

Uh oh!

jinxing64 commented Jul 14, 2017

Uh oh!

jinxing64 commented Jul 14, 2017

Uh oh!

SparkQA commented Jul 14, 2017

Uh oh!

jinxing64 commented Jul 17, 2017

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Jul 17, 2017

Uh oh!

SparkQA commented Jul 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 18, 2017

Uh oh!

jinxing64 commented Jul 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

cloud-fan commented Jul 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinxing64 commented Jul 14, 2017 •

edited

Loading

cloud-fan Jul 18, 2017 •

edited

Loading