-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15391] [SQL] manage the temporary memory of timsort #13318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6d074f6
f651956
c72cdea
2ad469d
8b4e033
891c007
2ad8eb8
a929a06
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -221,7 +221,8 @@ public BytesToBytesMap( | |
| SparkEnv.get() != null ? SparkEnv.get().blockManager() : null, | ||
| SparkEnv.get() != null ? SparkEnv.get().serializerManager() : null, | ||
| initialCapacity, | ||
| 0.70, | ||
| // In order to re-use the longArray for sorting, the load factor cannot be larger than 0.5. | ||
| 0.5, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we make this a constant and document why it was chosen (to enable radix sort)?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that this may cause
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As analized in the comment, this only have about 10% (or less) difference in practice, I'd like not to have this complexity.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change also us to use radix sort when switching happens, so it's not always regression of performance. It's hard to measure the impact for different workloads, I'd like to prefer for robustness instead of small performance improvements for some workload.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, that sounds fair |
||
| pageSizeBytes, | ||
| enablePerfMetrics); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also update the comments above while instantiating
arrayto better indicate the contents of the array (and talk about the additional buffer that's needed for sorting)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, out of curiosity, what's the worst case scenario in TimSort that requires 0.5x more buffer space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameeragarwal I think you already had a test case for the worst case (there are two ordered part in the array, the shortest part will be copied into a temporary buffer, os it need 0.5 buffer space), it's doced in TimSort.