[SPARK-11767] [SQL] limit the size of caced batch #9760

davies · 2015-11-17T06:36:42Z

Currently the size of cached batch in only controlled by batchSize (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management).

This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

This also change the way to grow buffer, double it each time, then trim it once finished.

cc @liancheng

SparkQA · 2015-11-17T06:57:07Z

Test build #46062 has finished for PR 9760 at commit d57c180.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-17T08:55:19Z

Test build #2069 has finished for PR 9760 at commit d57c180.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-17T09:25:28Z

Test build #2070 has finished for PR 9760 at commit 55a905b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class BitSet(numBits: Int) extends Serializable\n * class StreamingListener(object):\n * case class JSONOptions(\n * abstract class Aggregator[-A, B, C] extends Serializable\n

liancheng · 2015-11-17T11:27:41Z

This changes LGTM, but I don't quite understand this line in the PR description:

... This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

Where is the "50" from? Is it a hard limit defined somewhere else or an estimation of average use case?

davies · 2015-11-17T20:40:49Z

It's an estimation based on LongType/DoubleType, it means most of the tables with primitive types will not be affected.

Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management). This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns). This also change the way to grow buffer, double it each time, then trim it once finished. cc liancheng Author: Davies Liu <davies@databricks.com> Closes #9760 from davies/cache_limit. (cherry picked from commit 5aca6ad) Signed-off-by: Davies Liu <davies.liu@gmail.com>

limit the size of caced batch

55a905b

davies force-pushed the cache_limit branch from d57c180 to 55a905b Compare November 17, 2015 07:07

asfgit closed this in 5aca6ad Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-11767] [SQL] limit the size of caced batch #9760

[SPARK-11767] [SQL] limit the size of caced batch #9760

Uh oh!

davies commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

liancheng commented Nov 17, 2015

Uh oh!

davies commented Nov 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-11767] [SQL] limit the size of caced batch #9760

[SPARK-11767] [SQL] limit the size of caced batch #9760

Uh oh!

Conversation

davies commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

SparkQA commented Nov 17, 2015

Uh oh!

liancheng commented Nov 17, 2015

Uh oh!

davies commented Nov 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants