-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode #14915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation of toJSON recursively searches the Map and Seq, and try to convert every field to JSON.
It is quite risky, since we don't know what data is stored in unknown Seq and Map, and it may easily trigger OOM if the Seq or Map is a huge object.
Maybe we should disable converting Seq and Map?
|
Test build #64772 has finished for PR 14915 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment deserves to have an example. Also, it will be good to just create a jira with your example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a follow up jira to refactor the toJSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to make the comment self-contained. So, readers of this part do not need to guess or search the jira to understand what this line means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this comments. I tried, but it seems it requires a big block to explain what this TODO mean. I feel it may creates bigger confusion.
|
btw, we also need to merge it to branch 1.6, which also have toJSON (https://github.com/apache/spark/blob/branch-1.6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L533). |
|
Test build #64828 has finished for PR 14915 at commit
|
|
Test build #64956 has finished for PR 14915 at commit
|
|
test this please |
|
LGTM. Pending jenkins. |
| case m: Metadata => m.jsonValue | ||
| // SPARK-17356: In usage of mllib, Metadata may store a huge vector of data, transforming | ||
| // it to JSON may trigger OutOfMemoryError. | ||
| case m: Metadata => Metadata.empty.jsonValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use JNothing instead of Metadata.empty.jsonValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we should not. JNothing is to map scala.Option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, I mean JNull
|
Test build #64967 has finished for PR 14915 at commit
|
|
test this please |
|
Test build #64971 has finished for PR 14915 at commit
|
…reeNode
## What changes were proposed in this pull request?
class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.
```
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
```
The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.
With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.
## How was this patch tested?
Existing tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes #14915 from clockfly/json_oom.
(cherry picked from commit 6f13aa7)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to master and 2.0! |
…for TreeNode This is a backport of PR #14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14973 from clockfly/json_oom_1.6.
…for TreeNode This is a backport of PR apache#14915 to branch 1.6. ## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes apache#14973 from clockfly/json_oom_1.6. (cherry picked from commit e6480a6)
What changes were proposed in this pull request?
class
org.apache.spark.sql.types.Metadatais widely used in mllib to store some ml attributes.Metadatais commonly stored inAliasexpression.The
Metadatacan take a big memory footprint since the number of attributes is big ( in scale of million). WhentoJSONis called onAliasexpression, theMetadatawill also be converted to a big JSON string.If a plan contains many such kind of
Aliasexpressions, it may trigger out of memory error whentoJSONis called, since converting allMetadatareferences to JSON will take huge memory.With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.
How was this patch tested?
Existing tests.