Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch fixes multiple memory leaks in Spillable collections, as well as a leak in UnsafeShuffleWriter. There were a small handful of places where tasks would acquire memory from the ShuffleMemoryManager but would not release it by the time the task had ended. The UnsafeShuffleWriter case was harmless, since the leak could only occur at the very end of a task, but the other two cases are somewhat serious:

  • ExternalSorter.stop() did not release the sorter's memory. In addition, BlockStoreShuffleReader never called stop() once the sorter's iterator was fully-consumed. Put together, these bugs meant that a shuffle which performed a reduce-side could starve downstream piplelined transformations of shuffle memory.
  • ExternalAppendOnlyMap exposes no equivalent of stop() and its iterators do not automatically free its in-memory data upon completion. This could cause aggregation operations to starve other operations of shuffle memory.

This patch adds a regression test and fixes all three leaks.

This patch was originally opened as #9260; this version is the 1.5.x backport.

  - Fix leak in ExternalSorter.stop().
  - Use CompletionIterator in BlockStoreShuffleReader.
  - Fix leak in UnsafeShuffleWriter (this one wouldn't affect users, since the leak could only occur precisely before the task finished, but it broke the new tests).
@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44858 timed out for PR 9427 at commit 46ef394 after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44859 timed out for PR 9427 at commit c897b75 after a configured wait of 175m.

@JoshRosen
Copy link
Contributor Author

Hmm, looks like legitimate test failures. I'll investigate tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of failed task, it's fine to have some memory leak (will be freed finally), will this exception override the real one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

@davies
Copy link
Contributor

davies commented Nov 4, 2015

LGTM, except one minor comment.

@JoshRosen
Copy link
Contributor Author

@davies, I've updated this to address your comment; PTAL.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46559 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46648 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #2115 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #2150 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@davies
Copy link
Contributor

davies commented Dec 2, 2015

@JoshRosen Is there something wrong with this PR?

@JoshRosen
Copy link
Contributor Author

Looks like it timed out while compiling? Let me try again.

Jenkins, retest this please.

@yhuai
Copy link
Contributor

yhuai commented Dec 8, 2015

tes this please

@yhuai
Copy link
Contributor

yhuai commented Dec 8, 2015

test this please

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47344 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47686 timed out for PR 9427 at commit 1d35e6f after a configured wait of 175m.

@davies
Copy link
Contributor

davies commented Dec 15, 2015

@JoshRosen It should be something wrong in this PR, if we can't fix it easily, would you mind close this one?

@JoshRosen
Copy link
Contributor Author

Yeah, I'm going to close this for now; I don't think that this is a high-priority issue to fix for 1.5.x since it's been around forever and nobody reported problems due to it.

@JoshRosen JoshRosen closed this Dec 15, 2015
@JoshRosen JoshRosen deleted the SPARK-11293-branch-1.5-backport branch August 29, 2016 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants