-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependent RDDs #19373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…se of dependant RDDs
…eckpointed RDD as their parent to prevent early removal
|
I would be happy if anyone can take a look at this PR. |
| * | ||
| * TODO: Move this out of MLlib? | ||
| */ | ||
| private[spark] class PeriodicRDDCheckpointer[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaladoc for this class needs to be updated to include this new behaviour. Particularly, the 'WARNINGS' section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujithjay, thanks a lot for noticing!
Just updated the docs a little bit to clarify the new behaviour.
|
Hi @szhem , you could consider identifying contributors who have worked on the code being changed, and reach out to them for review. |
|
cc: @felixcheung @jkbradley @mengxr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is deleting earlier checkpoint after the current checkpoint is called though?
is this just an issue with DataSet.checkpoint(eager = true)?
Currently PeriodicCheckpointer can fail in case of checkpointing RDDs which depend on each other like in the sample below. It's about deleting files of the already checkpointed and materialized RDD in case of another RDD depends on it. If RDDs are cached before checkpointing (like it is often recommended) then this issue is likely to be not visible, because the checkpointed RDD will be read from cache and not from the materiazed files. The good example of such a behaviour is described in this PR - #19410, where GraphX fails with
This PR does not include modifications to DataSet API and affects mainly |
…t files may be preserved in case of checkpointing dependent RDDs
… to clarify the new behaviour
|
to clarify, what I mean is the issue is caused by checkpointing being lazy - so therefore if you remove the previous checkpoint before the new checkpoint is started or completed, this fails. so the fix might be to change to call |
|
@felixcheung, I've experimented with the similar method for RDDs ... def checkpoint(eager: Boolean): RDD[T] = {
checkpoint()
if (eager) {
count()
}
this
}... and it does not work for val checkpointInterval = 2
val checkpointer = new PeriodicRDDCheckpointer[(Int, Int)](checkpointInterval, sc)
val rdd1 = sc.makeRDD((0 until 10).map(i => i -> i))
// rdd1 is not materialized yet, checkpointer(update=1, checkpointInterval=2)
checkpointer.update(rdd1)
// rdd2 depends on rdd1
val rdd2 = rdd1.filter(_ => true)
// rdd1 is materialized, checkpointer(update=2, checkpointInterval=2)
checkpointer.update(rdd1)
// rdd3 depends on rdd1
val rdd3 = rdd1.filter(_ => true)
// rdd3 is not materialized yet, checkpointer(update=3, checkpointInterval=2)
checkpointer.update(rdd3)
// rdd3 is materialized, rdd1's files are removed, checkpointer(update=4, checkpointInterval=2)
checkpointer.update(rdd3)
// fails with FileNotFoundException because
// rdd1's files were removed on the previous step and
// rdd2 depends on rdd1
rdd2.count()It fails with |
…eckpointed RDD are removed too
Even if checkpoint is completed, |
|
Just a kind remainder... |
|
Hello @sujithjay, @felixcheung, @jkbradley, @mengxr, it's already more than a year passed since this pull request has been opened. |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Fix for SPARK-22150 JIRA issue.
In case of checkpointing RDDs which depend on previously checkpointed RDDs (for example in iterative algorithms) PeriodicCheckpointer removes already checkpointed materialized RDDs too early leading to FileNotFoundExceptions.
Consider the following snippet
This PR proposes to preserve all the checkpoints the last one depends on to be able to evaluate the final RDD even if the last checkpoint (the final RDD depends on) is not yet materialized.
How was this patch tested?
Unit tests as well as manually in production jobs which previously were failing until this PR applied.