-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11583][Scheduler][Core] Make MapStatus use less memory uage #9559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #2020 has finished for PR 9559 at commit
|
|
retest plz |
|
retest this please |
|
Can you use the OpenHashSet in Spark? |
|
retest this please |
|
Test build #45450 has finished for PR 9559 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent these by 2 more spaces
|
@rxin OpenHashSet replaces HashSet |
|
@andrewor14 Thanks for your advices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isSparse is the wrong name, I think -- both cases are sparse, its a question of whether or not you are storing the empty blocks.
|
Sean has raised some important higher level questions on the jira -- I'd like us to resolve the discussion there before moving forward on this. |
|
@yaooqinn Can you close this now that you have the new patch? |
|
OK, close this pr and see #9661 |
Disscuss about this pr is at https://issues.apache.org/jira/browse/SPARK-11583
In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.
For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.
Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).
For dense cases: if HashSetInt.size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
For sparse cases: if HashSetInt.size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
sparse case, 299/300 are empty
sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
dense case, no block is empty
sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)