Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Nov 17, 2015

Currently the size of cached batch in only controlled by batchSize (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management).

This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

This also change the way to grow buffer, double it each time, then trim it once finished.

cc @liancheng

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46062 has finished for PR 9760 at commit d57c180.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #2069 has finished for PR 9760 at commit d57c180.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #2070 has finished for PR 9760 at commit 55a905b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class BitSet(numBits: Int) extends Serializable\n * class StreamingListener(object):\n * case class JSONOptions(\n * abstract class Aggregator[-A, B, C] extends Serializable\n

@liancheng
Copy link
Contributor

This changes LGTM, but I don't quite understand this line in the PR description:

... This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

Where is the "50" from? Is it a hard limit defined somewhere else or an estimation of average use case?

@davies
Copy link
Contributor Author

davies commented Nov 17, 2015

It's an estimation based on LongType/DoubleType, it means most of the tables with primitive types will not be affected.

asfgit pushed a commit that referenced this pull request Nov 17, 2015
Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management).

This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

This also change the way to grow buffer, double it each time, then trim it once finished.

cc liancheng

Author: Davies Liu <davies@databricks.com>

Closes #9760 from davies/cache_limit.

(cherry picked from commit 5aca6ad)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
@asfgit asfgit closed this in 5aca6ad Nov 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants