Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object.

This PR improves the check for user's objects to exclude REPL line object.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @JoshRosen @rxin

private def getInnerClosureClasses(obj: AnyRef): List[Class[_]] = {
val seen = Set[Class[_]](obj.getClass)
var stack = List[Class[_]](obj.getClass)
val stack = Stack[Class[_]](obj.getClass)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of unrelated, but it's obvious that using Stack is more efficient here.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55604 has finished for PR 12327 at commit b78b2ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55614 has finished for PR 12327 at commit b78b2ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @andrewor14

@yhuai
Copy link
Contributor

yhuai commented Apr 13, 2016

@JoshRosen Can you take a look?

} else if (outerPairs.size > 0) {
logDebug(s" + outermost object is a closure, so we just keep it: ${outerPairs.head}")
if (outerPairs.size > 0) {
if (isClosure(outerPairs.head._1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really your code, but can you do:

val (outermostClass, outermostObject) = outerPairs.head
if (isClosure(outermostClass)) {
  ...
} else if (outermostClass.getName.startsWith("$line")) {
  ...
} else {
  ...
  parent = outermostObject
  outerPairs = outerPairs.tail
}

so it's more readable.

@andrewor14
Copy link
Contributor

@cloud-fan Can you add a test to ClosureCleanerSuite2? Otherwise this LGTM.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55792 has finished for PR 12327 at commit 3db685c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import java.io.{ByteArrayInputStream, ByteArrayOutputStream}

import scala.collection.mutable.{Map, Set, Stack}
import scala.language.existentials
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used

@andrewor14
Copy link
Contributor

LGTM merging into master thanks

@asfgit asfgit closed this in 1d04c86 Apr 14, 2016
@a-roberts
Copy link
Contributor

Hi, with this new test I'm seeing large deviations using IBM JDKs and different platforms, for example:

java.lang.AssertionError: assertion failed: deviation too large: 0.25359712230215825, first size: 26688, second size: 19920

Is the 20% deviation particularly important? Why this number?

FYI here's the comparison between different Java vendors and architectures:

  • OpenJDK on Intel (passes), cacheSize1: 180392 and cacheSize2: 187896 (4.07%)
  • IBM JDK with SUSE on zSystems (fails): cacheSize1: 26688 and cacheSize2: 19920 (29%)
  • IBM JDK with Ubuntu 14 04 on Power 8 LE (fails): cacheSize1: 26688 and cacheSize2: 19920 (29%)
  • IBM JDK with Ubuntu 14 04 on Intel (fails): cacheSize1: 354692 and cacheSize2: 263800 (29.3%)

I'll look into this, interesting that with IBM Java it's always a very similar and much larger percentage

@cloud-fan
Copy link
Contributor Author

@a-roberts wow that's interesting, I think maybe IBM JDK brings more information to scala REPL line object, which increases the cache size. It will be great if you can look into it, thanks!

BTW if this problem do matters for you, feel free to send a PR to increase the threshold(20%).

@a-roberts
Copy link
Contributor

@cloud-fan I've had a closer look at this and think a more robust method would be to use weak references to identify when an object is out of scope, with IBM Java we see the 29% reduction between cache size 1 and cache size 2 but with OpenJDK we see a 4% increase, suggesting that we can't rely on the sizes being similar across JDK vendors, now thinking this is a test case issue rather than a problem in the ClosureCleaner or IBM Java code.

With IBM Java our second cache size (after repartitioning) is much smaller; repartitioning uses ContextCleaner whereas with OpenJDK it grows. Either we have a bigger memory footprint or the cached size is being calculated incorrectly (looks fine to me and we actually have smaller object sizes). The problem on Z was due to using repl/pom.xml instead of pom.xml in the Spark home directory (same result if we use the right pom.xml file) so can be discarded for this discussion.

I'm going to figure out what's in the Scala REPL line objects between vendors, I think the intention of this commit is to test that the REPL line object is being cleaned but the assertion in place at the moment doesn't look to be correct (the size is bigger after the cleaning and cacheSize2 is the result of cleaning if I'm understanding the code correctly), have I missed a trick?

@cloud-fan
Copy link
Contributor Author

cacheSize1 and cacheSize2 are both the size after cleaning. The difference is that, cacheSize1 is the size after cleaned the data with line object reference, cacheSize2 is the size after cleaned the data without line object reference.

And yes, the test missed the difference of size of Scala REPL line objects between vendors, feel free to send a PR to fix it and thanks for investigating this!

@a-roberts
Copy link
Contributor

cacheSize1 and cacheSize2 are both the size after cleaning. The difference is that, cacheSize1 is the size after cleaned the data with line object reference, cacheSize2 is the size after cleaned the data without line object reference.

Looking for clarity here, is it true that clean "with the reference" should be bigger (cacheSize1) and clean "without the reference" should be smaller (cacheSize2)?

OpenJDK, cacheSize1: 180392, cacheSize2: 187896 (bigger without the line object reference)

IBM JDK, cacheSize1: 354692, cacheSize2: 263800 (smaller without the line object reference)

What exactly does "without line object reference" mean and should cacheSize1 be smaller or bigger than cacheSize2?

I know the SizeEstimator overestimates for IBM Java so our cached footprint is much larger (handling this), so because of the larger difference we get this test failing, OpenJDK fails with Kryo and IBM passes with Kryo for this test.

A better check would be to run with and without the closure cleaner change and to check the second result is less by the size of the line object, so based on our cacheSize2 being smaller (without the line object reference), I'm thinking that IBM Java functions as expected and OpenJDK doesn't - but this depends on my questions above, interested to hear what you think @cloud-fan

@cloud-fan
Copy link
Contributor Author

cloud-fan commented May 27, 2016

cacheSize1 and cacheSize2 should be almost same, because they cache the same data, with different RDD partition numbers. However, before this PR, cacheSize1 is much larger because it references line object and doesn't clean it.

A better check would be to run with and without the closure cleaner change

Yea, this is what I did locally, but how to write a test for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants