We need a better data slicing mechanism than `Box<dyn Array>` #4884

teh-cmc · 2024-01-23T08:52:31Z

Every DataCell is a slice into a larger chunk of arrow data living somewhere on the heap.

That slice is represented by the erased array type: Box<dyn Array>.

We've had plenty of performance issues caused by that erased type in the past, perhaps most infamously: its erased refcount clone() implementation is very CPU unfriendly and orders of magnitude slower than just sticking it into an Arc... which is why we've introduced DataCell in the first place, so we could add our own refcounting layer on top.

The problems don't end there, unfortunately. Box<dyn Array> has huge space overhead.
Every Box<dyn Array> carry with it a DataType plus some type specific metadata: this can easily add up to a 100 bytes or more.

We've recently removed the heap overhead of DataType, but that's not nearly enough: it still takes 48 bytes of stack space (std::mem::size_of::<DataType>() = 48)!

Take e.g. a slice of uint32: std::mem::size_of::<arrow2::array::PrimitiveArray<u32>>() = 104.
That means if you're slicing a single uint32, you're introducing 2 orders of magnitude of overhead, and that's before we even take into account the cost of bucketing, timepoint metadata, etc.

Either we need to change our approach for small slices (e.g. TimeSeriesScalar), or we need a more efficient slicing mechanism.

The text was updated successfully, but these errors were encountered:

emilk · 2024-01-23T14:55:43Z

I think we should revisit this after migrating from arrow2 to arrow:

Tracking issue: Migrate from re_arrow2 to arrow #3741

Hopefully these are mostly problems that either don't exist in arrow, or can be fixed there.

teh-cmc · 2024-01-24T10:02:22Z

See #4883 (comment):

Trying to remove the stack overhead from arrow2 is extremely painful, it breaks basically every line of code; and will be nullified when we switch to arrow-rs anyhow. The only long-term viable solution I can think of is to implement our own slicing mechanism that isn't Box<dyn Array>, which has been the source of most of our compute and space problems since the beginning.

Isn't the long-term solution to switch to arrow-rs?

I'd say that switching to arrow-rs is the medium-term solution even, and should carry us a long-ish way.

Specifically, arrow-rs..:

..works with Arc<dyn Array> instead of Box<dyn Array>, so no costly virtual clones.

..makes sure to arc-ify DataType's internal fields to amortize heap costs (so same as what we did just now).

..reduced the stack size of DataType from 48 to 24 bytes.

But, there is still a lot of overhead when it comes to slicing data, e.g. a u32 slice is still std::mem::size_of::<PrimitiveArray<UInt32Type>>() = 96 bytes per cell...

teh-cmc · 2024-05-16T17:08:31Z

Superseded by the work on dense chunks.

teh-cmc added ⛃ re_datastore affects the datastore itself 📉 performance Optimization, memory use, etc labels Jan 23, 2024

teh-cmc mentioned this issue Jan 23, 2024

Datatype deduplication 2: switch to re_arrow2 #4883

Merged

4 tasks

teh-cmc closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need a better data slicing mechanism than `Box<dyn Array>` #4884

We need a better data slicing mechanism than `Box<dyn Array>` #4884

teh-cmc commented Jan 23, 2024

emilk commented Jan 23, 2024

teh-cmc commented Jan 24, 2024

teh-cmc commented May 16, 2024

We need a better data slicing mechanism than Box<dyn Array> #4884

We need a better data slicing mechanism than Box<dyn Array> #4884

Comments

teh-cmc commented Jan 23, 2024

emilk commented Jan 23, 2024

teh-cmc commented Jan 24, 2024

teh-cmc commented May 16, 2024

We need a better data slicing mechanism than `Box<dyn Array>` #4884

We need a better data slicing mechanism than `Box<dyn Array>` #4884