[SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode #14915

clockfly · 2016-09-01T09:07:45Z

What changes were proposed in this pull request?

class org.apache.spark.sql.types.Metadata is widely used in mllib to store some ml attributes. Metadata is commonly stored in Alias expression.

case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)

The Metadata can take a big memory footprint since the number of attributes is big ( in scale of million). When toJSON is called on Alias expression, the Metadata will also be converted to a big JSON string.
If a plan contains many such kind of Alias expressions, it may trigger out of memory error when toJSON is called, since converting all Metadata references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

How was this patch tested?

Existing tests.

clockfly · 2016-09-01T09:14:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

Current implementation of toJSON recursively searches the Map and Seq, and try to convert every field to JSON.

It is quite risky, since we don't know what data is stored in unknown Seq and Map, and it may easily trigger OOM if the Seq or Map is a huge object.

Maybe we should disable converting Seq and Map?

clockfly · 2016-09-01T09:16:12Z

@mengxr @yhuai, comments?

SparkQA · 2016-09-01T11:13:51Z

Test build #64772 has finished for PR 14915 at commit 368e097.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-01T17:23:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

I think this comment deserves to have an example. Also, it will be good to just create a jira with your example.

I will create a follow up jira to refactor the toJSON

I think it is better to make the comment self-contained. So, readers of this part do not need to guess or search the jira to understand what this line means.

I removed this comments. I tried, but it seems it requires a big block to explain what this TODO mean. I feel it may creates bigger confusion.

yhuai · 2016-09-01T18:08:08Z

btw, we also need to merge it to branch 1.6, which also have toJSON (https://github.com/apache/spark/blob/branch-1.6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L533).

SparkQA · 2016-09-02T05:27:25Z

Test build #64828 has finished for PR 14915 at commit 39f3c63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-06T00:49:18Z

Test build #64956 has finished for PR 14915 at commit 20fa7e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-06T04:15:55Z

test this please

yhuai · 2016-09-06T04:23:08Z

LGTM. Pending jenkins.

cloud-fan · 2016-09-06T04:44:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

-    case m: Metadata => m.jsonValue
+    // SPARK-17356: In usage of mllib, Metadata may store a huge vector of data, transforming
+    // it to JSON may trigger OutOfMemoryError.
+    case m: Metadata => Metadata.empty.jsonValue


shall we use JNothing instead of Metadata.empty.jsonValue?

No, we should not. JNothing is to map scala.Option.

oh sorry, I mean JNull

SparkQA · 2016-09-06T05:45:42Z

Test build #64967 has finished for PR 14915 at commit 20fa7e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-06T05:55:56Z

test this please

SparkQA · 2016-09-06T07:55:41Z

Test build #64971 has finished for PR 14915 at commit 20fa7e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…reeNode ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14915 from clockfly/json_oom. (cherry picked from commit 6f13aa7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-09-06T08:07:53Z

thanks, merging to master and 2.0!
can you send a new PR for 1.6?

…for TreeNode This is a backport of PR #14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14973 from clockfly/json_oom_1.6.

…for TreeNode This is a backport of PR apache#14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes apache#14973 from clockfly/json_oom_1.6. (cherry picked from commit e6480a6)

clockfly changed the title ~~[SPARK-17356][SQL] Fix out of memory issue when calling TreeNode.toJSON~~ [SPARK-17356][SQL][WIP] Fix out of memory issue when calling TreeNode.toJSON Sep 1, 2016

clockfly changed the title ~~[SPARK-17356][SQL][WIP] Fix out of memory issue when calling TreeNode.toJSON~~ [SPARK-17356][SQL][WIP] Fix out of memory issue when generating JSON for TreeNode Sep 1, 2016

clockfly reviewed Sep 1, 2016
View reviewed changes

fix OOM issue when generating JSON

909f3bb

clockfly changed the title ~~[SPARK-17356][SQL][WIP] Fix out of memory issue when generating JSON for TreeNode~~ [SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode Sep 1, 2016

yhuai reviewed Sep 1, 2016
View reviewed changes

log warning if schema doesn't contain column for corrupted record

39f3c63

clockfly force-pushed the json_oom branch from 368e097 to 39f3c63 Compare September 2, 2016 03:25

address comment

20fa7e3

cloud-fan reviewed Sep 6, 2016
View reviewed changes

asfgit closed this in 6f13aa7 Sep 6, 2016

clockfly mentioned this pull request Sep 6, 2016

[SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for TreeNode #14973

Closed

[SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode #14915

[SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode #14915

Uh oh!

Conversation

clockfly commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

clockfly Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clockfly commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

yhuai Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Sep 3, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly Sep 5, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

yhuai commented Sep 6, 2016

Uh oh!

yhuai commented Sep 6, 2016

Uh oh!

cloud-fan Sep 6, 2016

Choose a reason for hiding this comment

Uh oh!

clockfly Sep 6, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 6, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

yhuai commented Sep 6, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

cloud-fan commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clockfly commented Sep 1, 2016 •

edited

Loading

clockfly Sep 1, 2016 •

edited

Loading