Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Oct 30, 2015

After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44698 has finished for PR 9383 at commit 5707f5b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44700 has finished for PR 9383 at commit 764f540.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44708 has finished for PR 9383 at commit 55c47ed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44710 has finished for PR 9383 at commit 752c8e7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2015

Test build #44714 has finished for PR 9383 at commit 0512f1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After call getCompactArray, the content of longArray is modified. Can this BytesToBytesMap be normally used later? Because the position in longArray for a key should be determined by (keyBase, keyOffset, keyLength). If the positions are modified, can the methods such as safeLookup work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, after this, the map is broken, should be freed later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to add comment for it?

@viirya
Copy link
Member

viirya commented Oct 31, 2015

Besides, as we discussed in #9067, should we add a configuration for turning on/off this feature? This feature may not always have performance gain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to catch OutOfMemoryError anymore?

@davies
Copy link
Contributor Author

davies commented Oct 31, 2015

Currently, the old one is broken, I'd to remove that one. The new should be as fast as the old one in worst case, I think we don't need a configuration for this.

@SparkQA
Copy link

SparkQA commented Oct 31, 2015

Test build #44728 has finished for PR 9383 at commit 28f84e1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2015

Test build #44730 has finished for PR 9383 at commit 6fde4d5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Nov 2, 2015

After some benchmark, realized that using hashcode as prefix in timsort will cause regression in timsort and snappy compression (especially for aggregation after join, the order of records will become random). I will revert that part.

benchmark code:

sqlContext.setConf("spark.sql.shuffle.partitions", "1")
N = 1<<25
M = 1<<20
df = sqlContext.range(N).selectExpr("id", "repeat(id, 2) as s")
df.show()
df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
j = df.join(df2, df.id==df2.id2).groupBy(df.s).max("id", "id2")
n = j.count()

Another interesting finding is that Snappy will slowdown the spilling by 50% of end-to-end time, LZ4 will be faster than Snappy, but still 10% slower than not-compressed. Should we use false as the default value for spark.shuffle.spill.compress?(PS: tested on Mac with SSD, it may not be true on HD)

@SparkQA
Copy link

SparkQA commented Nov 2, 2015

Test build #44820 has finished for PR 9383 at commit 53dbdf2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2015

Test build #44823 has finished for PR 9383 at commit 2e341f5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2015

Test build #44830 has finished for PR 9383 at commit 3864095.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #1970 has finished for PR 9383 at commit df44fc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Nov 3, 2015

ping @yhuai @JoshRosen

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44834 has finished for PR 9383 at commit df44fc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Currently, the old one is broken, I'd to remove that one.

@davies, are you referring to the old Aggregate1 interface or the old implementation of sort fallback here?

@JoshRosen
Copy link
Contributor

@davies, the block comment at the top of TungstenAggregationIterator is now out-of-date; do you mind updating it to reflect the new behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordinarily, this would end up deleting the spill files, but it doesn't because of the spillWriters.clear() call above. If you end up updating this patch, mind adding a one-line comment to explain this (since it's a subtle point)?

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #44972 has finished for PR 9383 at commit 6f3bb15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen Do you remember why we need to clear this? Once cleared, how to delete the spilled files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @JoshRosen offline, we should not clear spillWriters here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note, we had a quick discussion. Seems we should not call spillWriters.clear(). Otherwise those spilled files will not be deleted.

@davies
Copy link
Contributor Author

davies commented Nov 4, 2015

@JoshRosen @yhuai pushed a refactor on this (reduce possibility of full GC by re-use the array and map), please take another look.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #44997 has finished for PR 9383 at commit fc5e052.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45001 has finished for PR 9383 at commit 1c0c6c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #1978 has finished for PR 9383 at commit 1c0c6c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have numElements <= growthThreshold && !canGrowArray, it is guaranteed that our page still has space to put this key, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we check the space in page later.

@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2015

test this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45062 has finished for PR 9383 at commit 10d7169.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2015

LGTM pending jenkins.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45078 has finished for PR 9383 at commit b1f8a99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Nov 5, 2015

Merging into master, thanks!

@asfgit asfgit closed this in 81498dd Nov 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants