Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Sep 29, 2016

This reverts commit 9ac68db. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

…ver for cached relation."

This reverts commit 9ac68db. Turns out
the original fix was correct.
@rxin
Copy link
Contributor

rxin commented Sep 29, 2016

This is a revert of the revert?

@vanzin
Copy link
Contributor Author

vanzin commented Sep 29, 2016

Correct.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 29, 2016

I updated the summary to have the original summary, so that people don't have to click through multiple PRs to read what it's about.

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66129 has finished for PR 15304 at commit b81ad6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Oct 3, 2016

@yhuai I'll push this tomorrow if I don't hear from you.

@yhuai
Copy link
Contributor

yhuai commented Oct 4, 2016

Changes look good. How about we change the title back to [SPARK-17549] [SQL] Only collect table size stat in driver for cached relation? Thanks!

@vanzin vanzin changed the title Revert "[SPARK-17549][SQL] Revert Only collect table size stat in driver for cached relation." [SPARK-17549][SQL] Only collect table size stat in driver for cached relation. Oct 4, 2016
@vanzin
Copy link
Contributor Author

vanzin commented Oct 4, 2016

Done. Merging to master / 2.0.

@yhuai
Copy link
Contributor

yhuai commented Oct 4, 2016

Thanks!

asfgit pushed a commit that referenced this pull request Oct 4, 2016
…relation.

This reverts commit 9ac68db. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15304 from vanzin/SPARK-17549.2.

(cherry picked from commit 8d969a2)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 8d969a2 Oct 4, 2016
abridgett pushed a commit to opensignal/spark that referenced this pull request Oct 10, 2016
…relation.

This reverts commit 9ac68db. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#15304 from vanzin/SPARK-17549.2.
@vanzin vanzin deleted the SPARK-17549.2 branch November 30, 2016 22:58
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…relation.

This reverts commit 9ac68db. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#15304 from vanzin/SPARK-17549.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants