Skip to content

Conversation

@clockfly
Copy link
Contributor

@clockfly clockfly commented Sep 1, 2016

What changes were proposed in this pull request?

class org.apache.spark.sql.types.Metadata is widely used in mllib to store some ml attributes. Metadata is commonly stored in Alias expression.

case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)

The Metadata can take a big memory footprint since the number of attributes is big ( in scale of million). When toJSON is called on Alias expression, the Metadata will also be converted to a big JSON string.
If a plan contains many such kind of Alias expressions, it may trigger out of memory error when toJSON is called, since converting all Metadata references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

How was this patch tested?

Existing tests.

@clockfly clockfly changed the title [SPARK-17356][SQL] Fix out of memory issue when calling TreeNode.toJSON [SPARK-17356][SQL][WIP] Fix out of memory issue when calling TreeNode.toJSON Sep 1, 2016
@clockfly clockfly changed the title [SPARK-17356][SQL][WIP] Fix out of memory issue when calling TreeNode.toJSON [SPARK-17356][SQL][WIP] Fix out of memory issue when generating JSON for TreeNode Sep 1, 2016
Copy link
Contributor Author

@clockfly clockfly Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation of toJSON recursively searches the Map and Seq, and try to convert every field to JSON.

It is quite risky, since we don't know what data is stored in unknown Seq and Map, and it may easily trigger OOM if the Seq or Map is a huge object.

Maybe we should disable converting Seq and Map?

@clockfly
Copy link
Contributor Author

clockfly commented Sep 1, 2016

@mengxr @yhuai, comments?

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64772 has finished for PR 14915 at commit 368e097.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@clockfly clockfly changed the title [SPARK-17356][SQL][WIP] Fix out of memory issue when generating JSON for TreeNode [SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode Sep 1, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment deserves to have an example. Also, it will be good to just create a jira with your example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a follow up jira to refactor the toJSON

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to make the comment self-contained. So, readers of this part do not need to guess or search the jira to understand what this line means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this comments. I tried, but it seems it requires a big block to explain what this TODO mean. I feel it may creates bigger confusion.

@yhuai
Copy link
Contributor

yhuai commented Sep 1, 2016

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64828 has finished for PR 14915 at commit 39f3c63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64956 has finished for PR 14915 at commit 20fa7e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

test this please

@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

LGTM. Pending jenkins.

case m: Metadata => m.jsonValue
// SPARK-17356: In usage of mllib, Metadata may store a huge vector of data, transforming
// it to JSON may trigger OutOfMemoryError.
case m: Metadata => Metadata.empty.jsonValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use JNothing instead of Metadata.empty.jsonValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should not. JNothing is to map scala.Option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I mean JNull

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64967 has finished for PR 14915 at commit 20fa7e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

test this please

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64971 has finished for PR 14915 at commit 20fa7e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Sep 6, 2016
…reeNode

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14915 from clockfly/json_oom.

(cherry picked from commit 6f13aa7)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!
can you send a new PR for 1.6?

@asfgit asfgit closed this in 6f13aa7 Sep 6, 2016
asfgit pushed a commit that referenced this pull request Sep 6, 2016
…for TreeNode

This is a backport of PR #14915 to branch 1.6.

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14973 from clockfly/json_oom_1.6.
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Sep 7, 2016
…for TreeNode

This is a backport of PR apache#14915 to branch 1.6.

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes apache#14973 from clockfly/json_oom_1.6.

(cherry picked from commit e6480a6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants