-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up create_batch_from_map
#339
Conversation
515a9bc
to
96ca0d6
Compare
I think this makes a lot of sense. The reason we can't use Arrow arrays for this is that for now they are not mutable -- making some version of an ArrowVec would be helpful (I think I remember @ritchie46 mentioning he made something like this for polars-rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan -- this is quite cool.
There appears to be a test failure on this PR. I can't say I followed all the details, but the overall approach looks really nice
FYI @jimexist -- the signature of ScalarValue::iter_to_array
is changed in this PR
pub fn iter_to_array<'a>( | ||
scalars: impl IntoIterator<Item = &'a ScalarValue>, | ||
pub fn iter_to_array( | ||
scalars: impl IntoIterator<Item = ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is unfortunate -- I was trying to hard to avoid the need for owned ScalarValues
-- but I think since SclarValues
effectively own the underlying storage, if the source data is in some other form, you end up having to create one anyways.
But I think this change is for the better; 👍
@@ -381,19 +380,74 @@ impl ScalarValue { | |||
))) | |||
} | |||
}) | |||
.collect::<Result<Vec<_>>>()?; | |||
|
|||
// it is annoying that one can not create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @alamb also saw some opportunity simplifying / optimizing build_array_primitive
/ build_array_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks 💯
Codecov Report
@@ Coverage Diff @@
## master #339 +/- ##
==========================================
- Coverage 74.86% 74.79% -0.07%
==========================================
Files 146 146
Lines 24495 24607 +112
==========================================
+ Hits 18338 18406 +68
- Misses 6157 6201 +44
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great stuff!
Which issue does this PR close?
Closes #338
Closes #431
To be reviewed/merged after #320
Benchmark results db-benchmark:
#320
This PR (~20% faster for queries with smaller groups, NO OOM)
Rationale for this change
Previously, arrays were created per-row in a inefficient way:
ScalarValue::to_array
What changes are included in this PR?
Using
ScalarValue::iter_to_array
to create arrays instead, removing use of most intermediate Vecs / Arrays and concatenation.This is not as efficient as it could be when data was already contained in typed/contiguous memory, but should be OK for most queries, and much better than before this PR.
My view is that at some point data in aggregations should be stored in contiguous arrays and only referenced (with offsets) to from other places.
Are there any user-facing changes?
No