-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4084] Reuse sort key in Sorter #2937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #22175 has started for PR 2937 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat surprising that a universal trait can have a default implementation, but maybe we can convert this to an abstract class to ensure it's still compiled into simple Java bytecode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also funny that Java ignores the protectedness, maybe we can upgrade all actually-public methods to public (that's everything but getKey(data: Buffer, pos: Int), which is only used internally)
|
Test build #22175 has finished for PR 2937 at commit
|
|
Test FAILed. |
|
@aarondav I updated the PR based on your comment. See the description for renaming |
|
Test build #22230 has started for PR 2937 at commit
|
|
Test build #22230 has finished for PR 2937 at commit
|
|
Test FAILed. |
|
test this please |
|
Test build #22231 has started for PR 2937 at commit
|
|
Test build #22231 has finished for PR 2937 at commit
|
|
Test PASSed. |
|
Test build #22307 has started for PR 2937 at commit
|
|
Test build #22307 has finished for PR 2937 at commit
|
|
Test PASSed. |
|
Test build #22312 has started for PR 2937 at commit
|
|
Test build #22312 has finished for PR 2937 at commit
|
|
Test build #22322 has started for PR 2937 at commit
|
|
Test build #22322 has finished for PR 2937 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that it returns the total time across all iterations (which is not the behavior I expected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
Test build #22334 has started for PR 2937 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slight rephrasing of this to something like
Note: Declaring and instantiating multiple subclasses of this class would prevent JIT inlining overridden methods and hence may decrease the shuffle performance.
I just added "Note" since it's sort of a tangent and "may" since it may not be noticeable for non computationally-intensive workloads.
|
LGTM once it passes travis! |
|
Well, it won't pass |
|
Test build #22336 has started for PR 2937 at commit
|
|
Test build #22336 has finished for PR 2937 at commit
|
|
Test FAILed. |
|
Test build #22334 has finished for PR 2937 at commit
|
|
Test PASSed. |
|
@aarondav Does |
|
Test build #22365 has started for PR 2937 at commit
|
|
Test build #22365 has finished for PR 2937 at commit
|
|
Test PASSed. |
|
LGTM! Doing the merging. |
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from @rxin .
The
Sorterclass was written in Java and marked package private. So it is only available toorg.apache.spark.util.collection. I renamed it toTimSortand add a simple wrapper of it, still calledSorter, in Scala, which isprivate[spark].The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:
So with key reuse, it is faster and less likely to trigger GC.