-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17972][SQL] Build Datasets upon withCachedData instead of analyzed to avoid slow query planning
#15517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original toString method may OOM for super large query plans. This is especially true for those plan trees built in iterative manner and grow exponentially.
|
Test build #67081 has finished for PR 15517 at commit
|
|
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, we also cache the analyzed plan right?
I think the major change is that, now we cache cached plan instead of analyzed plan.
292ef36 to
e1283a8
Compare
withCachedData instead of analyzed to avoid slow query planning
|
The previous test failure was because we replace the analyzed plan with Force-pushed a new and much simpler approach by building new Datasets upon |
|
Test build #67120 has finished for PR 15517 at commit
|
|
cc @mengxr and @jkbradley |
|
Test build #67208 has finished for PR 15517 at commit
|
|
The most recent version still breaks some test cases related to caching. Investigating it. |
|
|
||
| lazy val withCachedData: LogicalPlan = { | ||
| assertAnalyzed() | ||
| assertSupported() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is actually moved to optimizedPlan. It's for fixing the streaming test failures.
Although streaming queries doesn't use QueryExecution for actual execution, it somehow triggers this line after changes made in the previous commit and throws exception. Not quite familiar with structured streaming though, I may missed something here.
|
I'm closing this since caching is not the ultimate solution for this problem anyway. Caching is too memory consuming when you, say, computing connected components in an iterative way over a graph with 50 billion nodes. Going to add a checkpoint API for Dataset so that we can truncate both the plan tree and the RDD lineage without caching. |
|
dataset.checkpoint is what i need. |
What changes were proposed in this pull request?
(This PR is based on a PoC branch authored by @clockfly.)
Iterative ML code may easily create query plans that grow exponentially. We found that query planning time also increases exponentially even when all the sub-plan trees are cached.
The following snippet illustrates the problem:
This is because when building a new Dataset, the new plan is always built upon
QueryExecution.analyzed, which doesn't leverage existing cached plans. This PR tries to fix this issue by building new Datasets uponQueryExecution.withCachedDatato leverage cached plans and avoid super slow query planning.Here is the result of running 1,000 iterations using the same posted above after applying this PR:
Query planning time slows down much more slowly. This is mostly because cache manager query lookup slows down as entries stored in the cache manager grows.
Many thanks to @clockfly, who investigated this issue and made an initial PoC.
How was this patch tested?
Existing tests.