-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22184][CORE][GRAPHX] GraphX fails in case of insufficient memory and checkpoints enabled #19410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22184][CORE][GRAPHX] GraphX fails in case of insufficient memory and checkpoints enabled #19410
Conversation
…se of dependant RDDs
…eckpointed RDD as their parent to prevent early removal
…se of dependant RDDs
…eckpointed RDD as their parent to prevent early removal
|
I would be happy if anyone can take a look at this PR. |
…t files may be preserved in case of checkpointing dependent RDDs
… to clarify the new behaviour
…eckpointed RDD are removed too
|
Hello @dding3, @viirya, @mallman, @felixcheung, You were reviewing graph checkpointing, introduced here #15125, and this PR changes the behaviour a little bit. Could you please review this PR too if possible? |
|
Just a kind remainder... |
|
Hi @szhem. Thanks for the kind reminder and thanks for your contribution. I'm sorry I did not respond sooner. I no longer work where I regularly used the checkpointing code with large graphs. And I don't have access to any similar graph to test with now. I'm somewhat hamstrung by that limitation. That being said, I'll do my best to help. With respect to the failure you're seeing, can you tell me what happens if you set your graph's storage level to |
|
Hi @mallman! In case of ... tests pass. They still fail in case of Although it works, I'm not sure that changing the caching level of the graph is really a good option to go with as Spark starts complaining here and here P.S. To emulate the lack of memory I just set the following options like here. |
|
Hi @szhem. I dug deeper and think I understand the problem better. To state the obvious, the periodic checkpointer deletes checkpoint files of RDDs that are potentially still accessible. In fact, that's the problem here. It deletes the checkpoint files of an RDD that's later used. The algorithm being used to find checkpoint files that can be "safely" deleted is flawed, and this PR aims to fix that. I have a few thoughts from this.
What do you think? @felixcheung or @viirya, can you weigh in on this, please? |
|
Just my two cents regarding built-in solutions: Periodic checkpointer deletes checkpoint files not to pollute the hard drive. Although disk storage is cheap it's not free. For example, in my case (graph with >1B vertices and about the same amount of edges) checkpoint directory with a single checkpoint took about 150-200GB. |
|
Hi, I met the same problem today. |
|
Hi @szhem. Thanks for the information regarding disk use for your scenario. What do you think about my second point, using the |
|
Hi @mallman, I believe, that |
|
Hi @szhem. I understand you've put a lot of work into this implementation, however I think you should try a simpler approach before we consider something more complicated. I believe an approach based on weak references and a reference queue would be a much simpler alternative. Can you give that a try? |
|
I have tried to set graph's storage level to StorageLevel.MEMORY_AND_DISK in my case and the error still happens. |
|
Hi @mallman , I believe that the solution with weak references will work and probably with
In case of reference queue, could you please recommend the convenient place in the source code to do it? As for me
What do you think? Will the community accept setting |
|
I've tested the mentioned checkpointers with It seems that there are somewhere hard references remained - old checkpoint files are not deleted at all and it seems that ContextCleaner.doCleanCheckpoint is never called. |
|
Hello @mallman, @sujithjay, @felixcheung, @jkbradley, @mengxr, it's already about a year passed since this pull request has been opened. |
|
Hi @szhem. I'm sorry I haven't been more responsive here. I can relate to your frustration, and I do want to help you make progress on this PR and merge it in. I have indeed been busy with other responsibilities, but I can rededicate time to reviewing this PR. Of all the approaches you've proposed so far, I like the Would you be willing to open another PR with your Thank you. |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
I'll close this pr and the corresponding jira because this pr is stale. If necessary, please reopen this. Thanks! |
|
I ran into same issue today. Is there a workaround? @szhem @EthanRock |
|
@ral51 the workaround is this PR, that has been closed without being merged, unfortunately. |
|
@szhem Your PR looks like keeping checkpoints around so that's good. Thank you |
|
@ral51 setting |
What changes were proposed in this pull request?
Fix for SPARK-22184 JIRA issue (and also includes the related #19373).
In case of GraphX jobs, when checkpoints are enabled, GraphX can fail with
FileNotFoundException.The failure can happen during Pregel iterations or when Pregel completes only in cases of insufficient memory when checkpointed RDDs are evicted from memory and have to be read from disk (but already removed from there).
This PR proposes to preserve all the checkpoints the last one (checkpoint) of
messagesandgraphdepends on during the iterations, and all the checkpoints ofmessagesandgraphthe resultinggraphdepends at the end of Pregel iterations.How was this patch tested?
Unit tests as well as manually in production jobs which previously were failing until this PR applied.