-
Notifications
You must be signed in to change notification settings - Fork 223
Migrated from Arc<dyn Array>
to Box<dyn Array>
#1042
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1042 +/- ##
==========================================
- Coverage 81.31% 81.09% -0.22%
==========================================
Files 365 363 -2
Lines 34902 34723 -179
==========================================
- Hits 28380 28160 -220
- Misses 6522 6563 +41
Continue to review full report at Codecov.
|
6818e10
to
1b950b9
Compare
@houqp, @Dandandan, @dbr, @sundy-li would you be available to review / comment / help on this one? We are trying to significantly improve the clone-on-write story, which is a bottleneck in Arrow, but would like to double-check with you that this does no harm. |
may be of interest to @wesm also - I recall that Voltron was looking around avoiding allocations on compute kernels - the main point is (in C++ notation): since data is already inside a In cases where data is not shared (which is often, e.g. coming from parquet, IPC, etc.), this has dramatic effects since it can lead to a allocation-free compute for math. E.g. expressions like |
#1061 motivates this PR (~2x speedup on horizontal ops, ~5-10x on horizontal logic ops) |
This change sounds perfectly fine from our point of view - we mostly interact with arrow2 data either in read-only fashion or via FFI (and the changes to FFI seem pretty trivial to handle) - so this doesn't really impact us directly, and if it's faster and generally better, then, 🥳 One thing I don't quite understand is,
I thought |
@dbr, great question. Let's consider the example added on this PR with an
This unfortunately does not compile because
but here we need to be careful - if we would have named the variable |
Looks great to me. cc @andylokandy @leiysky @b41sh may be interested in it if databend can introduce this into new expression framework. |
Arc<dyn Array>
to Box<dyn Array>
Arc<dyn Array>
to Box<dyn Array>
Silly question about this change. I understood that the use of Forget this question, as soon as I finished writing I checked |
Not sure if changelog is generated at version release time, but the only change I needed to make for my code that relied on chunks was to change Arc -> Box and it worked. 🚀 |
@@ -134,7 +133,14 @@ fn create_batch(size: usize) -> Result<Chunk> { | |||
}) | |||
.collect(); | |||
|
|||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like merge conflicts were still in the commit. @jorgecarleitao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #1069
This PR is a backward incompatible change to extend the support for clone-on-write semantics
initially added by @ritchie46 on #794 to arrays.
The migration is simple - replace
Arc<dyn Array>
byBox<dyn Array>
.Background
Arc
main goal is to enable sharing of data, typically because cloning is expensive. The tradeoff is thatArc
s usually represent immutable data.All our arrays are quite small structs. Their underlying data is usually a
DataType
and one or twoArc
s that store the data itself.Thus,
Arc
ingArray
does not add much value. OTOH, it makes the API clumsy, since we need to remember to arc everything. It also aludes users to the immutable nature of the Array. Furthermore, it makes it challenging to offer a good support for clone-on-write semantics to arrays since we need to check if the array is being shared, and clone its internals if yes.Currently,
Arc<dyn Array>
Arc<dyn Array>
My understanding is that there three main reasons for us to be in this situation:
Box<dyn T>
whereT: Clone
is not great - it is essentially not possible instd
.arrow-rs
used to store everything underArc
(e.g.ArrayData
,Array
, data itself).arrow-rs
is inspired on, seems to useArc
(shared_ptr
)This PR
This PR makes
Box<dyn Array>
instead ofArc<dyn Array>
Box<dyn Array>
instead ofArc<dyn Array>
The major benefit of this behavior is that it makes it quite easy to add first class support clone-on-write semantics to the different Array APIs. As an example, with this PR we can now more easily:
Box<dyn Array>
)((a * 10) + 2) * exp(-10 * a))
) without any extra allocationThis PR also adds an example illustrating how the API is used.