[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependent RDDs #19373

szhem · 2017-09-27T21:54:57Z

What changes were proposed in this pull request?

Fix for SPARK-22150 JIRA issue.

In case of checkpointing RDDs which depend on previously checkpointed RDDs (for example in iterative algorithms) PeriodicCheckpointer removes already checkpointed materialized RDDs too early leading to FileNotFoundExceptions.

Consider the following snippet

// create a periodic checkpointer with interval of 2
val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)

val rdd1 = createRDD(sc)
checkpointer.update(rdd1)
// on the second update rdd1 is checkpointed
checkpointer.update(rdd1)
// on action checkpointed rdd is materialized and its lineage is truncated
rdd1.count() 

// rdd2 depends on rdd1
val rdd2 = rdd1.filter(_ => true)
checkpointer.update(rdd2)
// on the second update rdd2 is checkpointed and checkpoint files of rdd1 are deleted
checkpointer.update(rdd2)
// on action it's necessary to read already removed checkpoint files of rdd1
rdd2.count()

This PR proposes to preserve all the checkpoints the last one depends on to be able to evaluate the final RDD even if the last checkpoint (the final RDD depends on) is not yet materialized.

How was this patch tested?

Unit tests as well as manually in production jobs which previously were failing until this PR applied.

…se of dependant RDDs

…eckpointed RDD as their parent to prevent early removal

szhem · 2017-10-16T21:03:09Z

I would be happy if anyone can take a look at this PR.

sujithjay · 2018-03-26T08:35:20Z

core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala

- *
- * TODO: Move this out of MLlib?
 */
 private[spark] class PeriodicRDDCheckpointer[T](


The scaladoc for this class needs to be updated to include this new behaviour. Particularly, the 'WARNINGS' section.

@sujithjay, thanks a lot for noticing!
Just updated the docs a little bit to clarify the new behaviour.

sujithjay · 2018-03-26T08:45:33Z

Hi @szhem , you could consider identifying contributors who have worked on the code being changed, and reach out to them for review.

sujithjay · 2018-03-26T08:48:57Z

cc: @felixcheung @jkbradley @mengxr
Could you please review this PR?

felixcheung

It is deleting earlier checkpoint after the current checkpoint is called though?

is this just an issue with DataSet.checkpoint(eager = true)?

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@checkpoint(eager:Boolean):org.apache.spark.sql.Dataset[T]

szhem · 2018-03-27T08:25:24Z

@felixcheung

It is deleting earlier checkpoint after the current checkpoint is called though?

Currently PeriodicCheckpointer can fail in case of checkpointing RDDs which depend on each other like in the sample below.

// create a periodic checkpointer with interval of 2
val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
val rdd1 = createRDD(sc)

// rdd2 depends on rdd1
val rdd2 = rdd1.filter(_ => true)
checkpointer.update(rdd2)
// on the second update rdd2 is checkpointed and checkpoint files of rdd1 are deleted
checkpointer.update(rdd2)
// on action it's necessary to read already removed checkpoint files of rdd1
rdd2.count()

It's about deleting files of the already checkpointed and materialized RDD in case of another RDD depends on it.

If RDDs are cached before checkpointing (like it is often recommended) then this issue is likely to be not visible, because the checkpointed RDD will be read from cache and not from the materiazed files.

The good example of such a behaviour is described in this PR - #19410, where GraphX fails with FileNotFoundException in case of insufficient memory resources when cached blocks of checkpointed and materialized RDDs are evicted from memory, causing them to be read from already deleted files.

is this just an issue with DataSet.checkpoint(eager = true)?

This PR does not include modifications to DataSet API and affects mainly PeriodicCheckpointer and PeriodicRDDCheckpointer.
It was created as a preliminary PR to this one - #19410 (where GraphX fails in case of reading cached RDDs already evicted from memory).

… of checkpoints

…t files may be preserved in case of checkpointing dependent RDDs

… to clarify the new behaviour

szhem · 2018-03-27T16:20:24Z

BTW, how do you think guys, may be it would be better to merge changes from #19410 into this one?
The #19410 is almost about the same issue and fixes the described behaviour for GraphX.

felixcheung · 2018-03-28T07:43:43Z

to clarify, what I mean is the issue is caused by checkpointing being lazy - so therefore if you remove the previous checkpoint before the new checkpoint is started or completed, this fails.

so the fix might be to change to call checkpoint() to checkpoint(eager: true) - this ensures by the time checkpoint call is returned the checkpointing is completed.

szhem · 2018-03-29T17:07:47Z

@felixcheung,
Unfortunately, RDDs, PeriodicRDDCheckpointer is based on, do not have checkpoint(eager: true) yet.
It's a functionality of DataSets.

I've experimented with the similar method for RDDs ...

def checkpoint(eager: Boolean): RDD[T] = {
  checkpoint()
  if (eager) {
    count()
  }
  this
}

... and it does not work for PeriodicRDDCheckpointer in some scenarios.
Please, consider the following example

val checkpointInterval = 2

val checkpointer = new PeriodicRDDCheckpointer[(Int, Int)](checkpointInterval, sc)
val rdd1 = sc.makeRDD((0 until 10).map(i => i -> i))

// rdd1 is not materialized yet, checkpointer(update=1, checkpointInterval=2)
checkpointer.update(rdd1)
// rdd2 depends on rdd1
val rdd2 = rdd1.filter(_ => true)

// rdd1 is materialized, checkpointer(update=2, checkpointInterval=2)
checkpointer.update(rdd1)
// rdd3 depends on rdd1
val rdd3 = rdd1.filter(_ => true)

// rdd3 is not materialized yet, checkpointer(update=3, checkpointInterval=2)
checkpointer.update(rdd3)
// rdd3 is materialized, rdd1's files are removed, checkpointer(update=4, checkpointInterval=2)
checkpointer.update(rdd3)

// fails with FileNotFoundException because
// rdd1's files were removed on the previous step and
// rdd2 depends on rdd1
rdd2.count()

It fails with FileNotFoundException even in case of eager checkpointing, and passes in case of preserving parent checkpointed RDDs like it's done in this PR.

…eckpointed RDD are removed too

szhem · 2018-04-02T07:22:03Z

so the fix might be to change to call checkpoint() to checkpoint(eager: true) - this ensures by the time checkpoint call is returned the checkpointing is completed.

Even if checkpoint is completed, PeriodicRDDCheckpointer removes files of the checkpointed and materialized RDDs later on, so it may happen that another RDD depends on the already removed files.

szhem · 2018-06-25T08:37:57Z

Just a kind remainder...

szhem · 2018-09-30T21:03:30Z

Hello @sujithjay, @felixcheung, @jkbradley, @mengxr, it's already more than a year passed since this pull request has been opened.
I'm just wondering whether there is any chance for this PR to be reviewed (understanding that all of you have a little or probably no time having your own more important activities) by someone and either rejected or merged.

AmplabJenkins · 2019-09-16T18:24:42Z

Can one of the admins verify this patch?

github-actions · 2020-01-15T00:06:34Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

szhem added 3 commits September 28, 2017 00:33

[SPARK-22150][CORE] preventing too early removal of checkpoints in ca…

0c3338c

…se of dependant RDDs

[SPARK-22150][CORE] checking whether two checkpoints have the same ch…

972ef04

…eckpointed RDD as their parent to prevent early removal

[SPARK-22150][CORE] respecting scala style settings

4ff1624

szhem mentioned this pull request Oct 2, 2017

[SPARK-22184][CORE][GRAPHX] GraphX fails in case of insufficient memory and checkpoints enabled #19410

Closed

szhem changed the title ~~[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs~~ [SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependent RDDs Oct 3, 2017

sujithjay reviewed Mar 26, 2018

View reviewed changes

felixcheung reviewed Mar 27, 2018

View reviewed changes

szhem added 3 commits March 27, 2018 18:57

[SPARK-22150][CORE] Updatating scaladocs to clarify the new behaviour…

f9156cb

… of checkpoints

[SPARK-22150][CORE] Updatating scaladocs to clarify why its checkpoin…

1a0b5c1

…t files may be preserved in case of checkpointing dependent RDDs

[SPARK-22150][CORE] Updatating scaladocs of PeriodicGraphCheckpointer…

6d11404

… to clarify the new behaviour

[SPARK-22150][CORE] tests to check that checkpoint files of parent ch…

e33b32b

…eckpointed RDD are removed too

dongjoon-hyun added the SPARK CORE label Jun 14, 2019

github-actions bot added the Stale label Jan 15, 2020

github-actions bot closed this Jan 16, 2020

[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependent RDDs #19373

[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependent RDDs #19373

Uh oh!

Conversation

szhem commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

szhem commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sujithjay Mar 26, 2018

Choose a reason for hiding this comment

Uh oh!

szhem Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

sujithjay commented Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sujithjay commented Mar 26, 2018

Uh oh!

felixcheung left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szhem commented Mar 27, 2018

Uh oh!

szhem commented Mar 27, 2018

Uh oh!

felixcheung commented Mar 28, 2018

Uh oh!

szhem commented Mar 29, 2018

Uh oh!

szhem commented Apr 2, 2018

Uh oh!

szhem commented Jun 25, 2018

Uh oh!

szhem commented Sep 30, 2018

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

szhem commented Sep 27, 2017 •

edited

Loading

szhem commented Oct 16, 2017 •

edited

Loading

sujithjay commented Mar 26, 2018 •

edited

Loading

felixcheung left a comment •

edited

Loading