-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11425] [SPARK-11486] Improve hybrid aggregation #9383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #44698 has finished for PR 9383 at commit
|
|
Test build #44700 has finished for PR 9383 at commit
|
|
Test build #44708 has finished for PR 9383 at commit
|
|
Test build #44710 has finished for PR 9383 at commit
|
|
Test build #44714 has finished for PR 9383 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After call getCompactArray, the content of longArray is modified. Can this BytesToBytesMap be normally used later? Because the position in longArray for a key should be determined by (keyBase, keyOffset, keyLength). If the positions are modified, can the methods such as safeLookup work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, after this, the map is broken, should be freed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to add comment for it?
|
Besides, as we discussed in #9067, should we add a configuration for turning on/off this feature? This feature may not always have performance gain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to catch OutOfMemoryError anymore?
|
Currently, the old one is broken, I'd to remove that one. The new should be as fast as the old one in worst case, I think we don't need a configuration for this. |
|
Test build #44728 has finished for PR 9383 at commit
|
|
Test build #44730 has finished for PR 9383 at commit
|
|
After some benchmark, realized that using hashcode as prefix in timsort will cause regression in timsort and snappy compression (especially for aggregation after join, the order of records will become random). I will revert that part. benchmark code: Another interesting finding is that Snappy will slowdown the spilling by 50% of end-to-end time, LZ4 will be faster than Snappy, but still 10% slower than not-compressed. Should we use |
|
Test build #44820 has finished for PR 9383 at commit
|
|
Test build #44823 has finished for PR 9383 at commit
|
|
Test build #44830 has finished for PR 9383 at commit
|
|
Test build #1970 has finished for PR 9383 at commit
|
|
ping @yhuai @JoshRosen |
|
Test build #44834 has finished for PR 9383 at commit
|
@davies, are you referring to the old Aggregate1 interface or the old implementation of sort fallback here? |
|
@davies, the block comment at the top of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordinarily, this would end up deleting the spill files, but it doesn't because of the spillWriters.clear() call above. If you end up updating this patch, mind adding a one-line comment to explain this (since it's a subtle point)?
|
Test build #44972 has finished for PR 9383 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen Do you remember why we need to clear this? Once cleared, how to delete the spilled files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with @JoshRosen offline, we should not clear spillWriters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note, we had a quick discussion. Seems we should not call spillWriters.clear(). Otherwise those spilled files will not be deleted.
|
@JoshRosen @yhuai pushed a refactor on this (reduce possibility of full GC by re-use the array and map), please take another look. |
|
Test build #44997 has finished for PR 9383 at commit
|
|
Test build #45001 has finished for PR 9383 at commit
|
|
Test build #1978 has finished for PR 9383 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have numElements <= growthThreshold && !canGrowArray, it is guaranteed that our page still has space to put this key, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we check the space in page later.
|
test this please |
|
Test build #45062 has finished for PR 9383 at commit
|
|
LGTM pending jenkins. |
|
Test build #45078 has finished for PR 9383 at commit
|
|
Merging into master, thanks! |
After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.