-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775
Conversation
|
also cc @MLnick |
|
Test build #72274 has finished for PR 16775 at commit
|
|
Test build #72275 has started for PR 16775 at commit |
|
Test build #72276 has started for PR 16775 at commit |
|
retest this please. |
|
Test build #72277 has finished for PR 16775 at commit
|
|
Wouldn't it better to Vectorize |
|
|
|
For the issue reported on mailing list, I found the root cause makes significant difference between 1.6 and current branch. The fix is at #16785. However, I think this patch is still useful. So I keep it open for a while for reviewers. |
|
ping @mengxr @jkbradley @liancheng @MLnick May you take a look at this? Thanks. |
|
@viirya I believe this PR meshes with the refactoring and application to pregel GraphX algorithms in #15125. Basically, it moves the periodic checkpointing code from mllib into core and uses it in GraphX to checkpoint long lineages. This is essential to scale GraphX to huge graphs, as described in my comment in the PR, and solves a very real problem for us. Can you take a look at that PR? |
|
I think we can solve this issue by tackling the codes in SQL. So close it for now. |
What changes were proposed in this pull request?
For a
Pipelineincluding long stages, the iterative fit and transform cause extremely grown query plans and RDD lineages, it takes longer time to finish the fit and transform.This patch introduces
PeriodicDatasetCheckpointerto do periodic checkout for dataset used infitandtransform.This introduces new paramcheckpointIntervaltoPipelineandPipelineModel. Once it is set, we will do periodic checkout byPeriodicDatasetCheckpointer.As there is existing trait
HasCheckpointIntervalwhich already definescheckpointIntervalparam. This patch letsPipelineandPipelineModelextendHasCheckpointInterval.Benchmark
Run the following codes locally.
Before this patch: 1786001 ms
After this patch: 69013 ms
This issue is originally reported at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.