Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch fixes multiple memory leaks in Spillable collections, as well as a leak in UnsafeShuffleWriter. There were a small handful of places where tasks would acquire memory from the ShuffleMemoryManager but would not release it by the time the task had ended. The UnsafeShuffleWriter case was harmless, since the leak could only occur at the very end of a task, but the other two cases are somewhat serious:

  • ExternalSorter.stop() did not release the sorter's memory. In addition, BlockStoreShuffleReader never called stop() once the sorter's iterator was fully-consumed. Put together, these bugs meant that a shuffle which performed a reduce-side could starve downstream piplelined transformations of shuffle memory.
  • ExternalAppendOnlyMap exposes no equivalent of stop() and its iterators do not automatically free its in-memory data upon completion. This could cause aggregation operations to starve other operations of shuffle memory.

This patch adds a regression test and fixes all three leaks. I'd like to backport this patch to Spark 1.5.x and possibly to other maintenance releases.

  - Fix leak in ExternalSorter.stop().
  - Use CompletionIterator in BlockStoreShuffleReader.
  - Fix leak in UnsafeShuffleWriter (this one wouldn't affect users, since the leak could only occur precisely before the task finished, but it broke the new tests).
@JoshRosen
Copy link
Contributor Author

/cc @andrewor14 @rxin @davies @mateiz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated whether to push this burden to callers rather than using a CompletionIterator here until I noticed that ExternalIterator() ends up calling currentMap.destructiveSortedIterator(), so (implicitly) this method already had the limitation that it could only be called once.

@SparkQA
Copy link

SparkQA commented Oct 24, 2015

Test build #44281 has finished for PR 9260 at commit f775445.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * // to its own URLs over the parent class loader (see Executor's createClassLoader method).\n

@SparkQA
Copy link

SparkQA commented Oct 24, 2015

Test build #44288 has finished for PR 9260 at commit 4bfa7e4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 24, 2015

Test build #44291 has finished for PR 9260 at commit 4bfa7e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, we will see this exception during tests, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@yhuai
Copy link
Contributor

yhuai commented Oct 25, 2015

Looks good to me. Probably ask someone who is more familiar with this part of the code for the final sign-off.

@JoshRosen
Copy link
Contributor Author

I've gone ahead and merged #9127, which contains something similar to these changes, but updated to reflect the memory manager unification. I still think that we should consider merging this patch as it stands now into 1.5.x and 1.4.x.

@JoshRosen
Copy link
Contributor Author

Ping @yhuai, any objection to merging this to 1.5? A similar fix has already been incorporated into master as part of my memory manager consolidation patch. We could also choose not to merge / backport this patch, since AFAIK nobody has complained about this issue.

@yhuai
Copy link
Contributor

yhuai commented Nov 3, 2015

Is it risky? Looks not?

If it is safe patch, wow about we add the check in Task.scala first to make sure that 1.5 does throw the exception? Then, we add the fix and get it merged?

@JoshRosen
Copy link
Contributor Author

@yhuai I think it's pretty low-risk. I'm going to close this PR and will continue work over at #9427, which is opened against branch-1.5.

@JoshRosen JoshRosen closed this Nov 3, 2015
@JoshRosen JoshRosen deleted the SPARK-11293 branch August 29, 2016 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants