[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. #15304

vanzin · 2016-09-29T21:39:12Z

This reverts commit 9ac68db. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

…ver for cached relation." This reverts commit 9ac68db. Turns out the original fix was correct.

rxin · 2016-09-29T22:21:00Z

This is a revert of the revert?

vanzin · 2016-09-29T22:26:13Z

Correct.

vanzin · 2016-09-29T22:42:12Z

I updated the summary to have the original summary, so that people don't have to click through multiple PRs to read what it's about.

SparkQA · 2016-09-29T23:47:07Z

Test build #66129 has finished for PR 15304 at commit b81ad6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-10-03T18:36:12Z

@yhuai I'll push this tomorrow if I don't hear from you.

yhuai · 2016-10-04T02:11:03Z

Changes look good. How about we change the title back to [SPARK-17549] [SQL] Only collect table size stat in driver for cached relation? Thanks!

vanzin · 2016-10-04T16:38:25Z

Done. Merging to master / 2.0.

yhuai · 2016-10-04T16:40:07Z

Thanks!

…relation. This reverts commit 9ac68db. Turns out the original fix was correct. Original change description: The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15304 from vanzin/SPARK-17549.2. (cherry picked from commit 8d969a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…relation. This reverts commit 9ac68db. Turns out the original fix was correct. Original change description: The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#15304 from vanzin/SPARK-17549.2.

Revert "[SPARK-17549][SQL] Revert Only collect table size stat in dri…

b81ad6b

…ver for cached relation." This reverts commit 9ac68db. Turns out the original fix was correct.

vanzin mentioned this pull request Sep 29, 2016

[SPARK-17549][sql] Coalesce cached relation stats in driver. #15189

Closed

vanzin changed the title ~~Revert "[SPARK-17549][SQL] Revert Only collect table size stat in driver for cached relation."~~ [SPARK-17549][SQL] Only collect table size stat in driver for cached relation. Oct 4, 2016

asfgit closed this in 8d969a2 Oct 4, 2016

vanzin deleted the SPARK-17549.2 branch November 30, 2016 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. #15304

[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. #15304

Uh oh!

vanzin commented Sep 29, 2016 •

edited

Loading

Uh oh!

rxin commented Sep 29, 2016

Uh oh!

vanzin commented Sep 29, 2016

Uh oh!

vanzin commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

vanzin commented Oct 3, 2016

Uh oh!

yhuai commented Oct 4, 2016

Uh oh!

vanzin commented Oct 4, 2016

Uh oh!

yhuai commented Oct 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. #15304

[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. #15304

Uh oh!

Conversation

vanzin commented Sep 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Sep 29, 2016

Uh oh!

vanzin commented Sep 29, 2016

Uh oh!

vanzin commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

vanzin commented Oct 3, 2016

Uh oh!

yhuai commented Oct 4, 2016

Uh oh!

vanzin commented Oct 4, 2016

Uh oh!

yhuai commented Oct 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanzin commented Sep 29, 2016 •

edited

Loading