-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return Long? If a class extends this, it should return a Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the driver side,
UnsafeHashedRelationis using a java hashmap.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the BytesToBytesMap implement this interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need to do that.
SizeEstimator.estimate(the publish method ofSizeEstimator) is used at two places. One is memory store and another one is traitSizeTracker(a utility trait used to implement collections that need to track estimated size). We do not putBytesToBytesMapto memory store, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BytesToBytesMap is used by UnsafeHashRelation, so it's put into memory store, that's the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach could be remove the reference to BlockManager in BytesToBytesMap, using
SparkEnv.getwhen needed, the difficulty could be how to fix the test (which use mocked BlockManager).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get it merged first if there is no fundamental issue. So, we can unblock the preview package. I can make the change if we prefer to change
BytesToBytesMapinstead ofUnsafeHashedRelation. I agree returning aOptionis weird. But, I feel if it is possible, we should prefer changingUnsafeHashedRelationbecause it is the one used as the broadcast variable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going publish a preview tonight or tomorrow morning? I will try to send out a patch to fix BytesToBytesMap, if I can't make it before publishing preview, feel free to merge this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high level approach of not relying on reflection and object walking is a good one -- actually with dataset and dataframes, we don't really need size estimation. I also think relying on thread locals and SparkEnv is much less ideal than explicit dependencies.
Either way, this pull request is ok to merge in its current shape, given it's fairly critical. We can do more changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #9799.
@rxin We didn't passing BlockManager down to BytesToBytesMap, already rely on thread local.