Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 27, 2021

Which issue does this PR close?

Re #786

Changes:

  1. Reduce size of ScalarValue from 64 bytes to 32 bytes by boxing ScalarValue:Lists internal parts

Rationale for this change

  1. A smaller ScalarValue means switching to use it in hash aggregations and hash joins will not be as expensive memory wise (where one is instantiated for each distinct grouping value)

What changes are included in this PR?

  1. Reduce size of ScalarValue from 64 bytes to 32 bytes by boxing ScalarValue:Lists internal parts

Are there any user-facing changes?

No

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 27, 2021
// Since ScalarValues are used in a non trivial number of places,
// making it larger means significant more memory consumption
// per distinct value.
assert_eq!(std::mem::size_of::<ScalarValue>(), 32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the test showing the size decrease

List(Option<Vec<ScalarValue>>, DataType),
/// list of nested ScalarValue (boxed to reduce size_of(ScalarValue))
#[allow(clippy::box_vec)]
List(Option<Box<Vec<ScalarValue>>>, Box<DataType>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the change -- the rest of the PR is just follow on work from this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting optimization 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box<[ScalarValue]> via Vec::into_boxed_slice would also be an option and would remove one pointer indirection, with the downside that data would need to be copied if the vec has excess capacity. @alamb do you think this would be worth exploring? I could prepare a PR since I already started looking to the usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann I think using Vec::into_boxed_slice would be just fine. I don't think ScalarValues are often (ever?) updated after creation so using a boxed slice seems like a good idea

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@alamb
Copy link
Contributor Author

alamb commented Jul 28, 2021

I think keeping ScalarValue smaller will help in various places even if we choose to go with something other than the implementation in #786

@alamb alamb merged commit 4929590 into apache:master Jul 28, 2021
@alamb alamb deleted the alamb/reduce_size_of_scalar branch July 28, 2021 18:44
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
…gregates" feature (apache#788)

* upgrade df version and disable skip partial agg

* add comment

* Save

* Revert debug changes

* add comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants