Add RunEndBuffer (#1799) #3817

tustvold · 2023-03-07T19:45:05Z

Which issue does this PR close?

Part of #1799

Rationale for this change

As part of #1799 we need an abstraction similar to BooleanBuffer but for RunArray. Much like BooleanBuffer this needs to store a logical offset and length, as the Buffer cannot simply be sliced directly.

What changes are included in this PR?

Are there any user-facing changes?

This changes the API of RunArray to move away from returning PrimitiveArray, inline with the broader plan under #1176

tustvold · 2023-03-07T19:53:51Z

arrow-select/src/take.rs

@@ -2157,8 +2157,7 @@ mod tests {
        let take_out = take_run(&run_array, &take_indices).unwrap();

        assert_eq!(take_out.len(), 7);
-
-        assert_eq!(take_out.run_ends().len(), 5);
+        assert_eq!(take_out.run_ends().len(), 7);


run_ends().len() now returns the logical length

askoa · 2023-03-08T14:43:22Z

arrow-buffer/src/buffer/run.rs

+
+    /// Performs a binary search to find the physical index for the given logical index
+    pub fn get_physical_index(&self, logical_index: usize) -> Option<usize> {
+        if logical_index > self.len {


Suggested change

if logical_index > self.len {

if logical_index >= self.len {

I opted to instead remove this check, as it doesn't appear to be necessary. PTAL

Sure. I think the newly added comment should be fixed.

askoa · 2023-03-08T14:51:03Z

arrow-buffer/src/buffer/run.rs

+        let logical_index = E::usize_as(self.offset + logical_index);
+        let cmp = |p: &E| p.partial_cmp(&logical_index).unwrap();
+
+        match self.run_ends.binary_search_by(cmp) {


askoa · 2023-03-08T15:04:46Z

arrow-buffer/src/buffer/run.rs

+
+    /// Performs a binary search to find the physical index for the given logical index
+    ///
+    /// The result is arbitrary if `logical_index > self.len()`


Suggested change

/// The result is arbitrary if `logical_index > self.len()`

/// The result is arbitrary if `logical_index >= self.len()`

~~Also, I think the function should panic if the logical index is not valid for given array.~~

askoa

LGTM. Just one minor comment. Not a big deal.

tustvold · 2023-03-08T15:29:56Z

Thank you for taking the time to review this 👍

alamb

Makes sense to me, I think this looks great @tustvold -- thank you

alamb · 2023-03-08T17:52:49Z

arrow-array/src/array/run_array.rs

@@ -347,7 +316,7 @@ impl<R: RunEndIndexType> std::fmt::Debug for RunArray<R> {
 ///     .map(|&x| if x == "b" { None } else { Some(x) })
 ///     .collect();
 /// assert_eq!(
-///     "RunArray {run_ends: PrimitiveArray<Int16>\n[\n  2,\n  3,\n  5,\n], values: StringArray\n[\n  \"a\",\n  null,\n  \"c\",\n]}\n",
+///     "RunArray {run_ends: [2, 3, 5], values: StringArray\n[\n  \"a\",\n  null,\n  \"c\",\n]}\n",


this is certainly a nicer API

alamb · 2023-03-08T17:54:42Z

arrow-buffer/src/buffer/run.rs

+/// describe the value indices `1, 1, 2, 2` for a RunArray
+///
+/// For example, a [RunEndBuffer] containing values `[6, 8, 9]` with offset `2` and length `5`
+/// would describe the value indices `0, 0, 0, 0, 1` for a RunArray


there are 4 zeros because 6 - 2 = 4, right?

alamb · 2023-03-08T17:55:37Z

arrow-buffer/src/buffer/run.rs

+where
+    E: ArrowNativeType,
+{
+    /// Create a new [`RunEndBuffer`] from a [`ScalarBuffer`], an `offset` and `len`


I think we should add a note that this panics if the invariants are not satisfied (strictly monotonically increasing)

alamb · 2023-03-08T17:56:50Z

arrow-buffer/src/buffer/run.rs

+        }
+    }
+
+    /// Returns the logical offset into the run-ends stored by this buffer


Suggested change

/// Returns the logical offset into the run-ends stored by this buffer

/// Returns the logical offset into the run-ends stored by this buffer.

///

/// See [`RunEndBuffer`] for interpretation of logical offset

alamb · 2023-03-08T17:57:12Z

arrow-buffer/src/buffer/run.rs

+        self.offset
+    }
+
+    /// Returns the logical length of the run-ends stored by this buffer


Suggested change

/// Returns the logical length of the run-ends stored by this buffer

/// Returns the logical length of the run-ends stored by this buffer

///

/// See [`RunEndBuffer`] for interpretation of logical offset

alamb · 2023-03-08T17:59:20Z

arrow-buffer/src/buffer/run.rs

+use crate::buffer::ScalarBuffer;
+use crate::ArrowNativeType;
+
+/// A slice-able buffer of monotonically increasing, positive integers used to store run-ends


I think it would be good to define (or link to the definition) the difference between physical and logical offsets / posititions.

Could also be done as a follow on PR as well

ursabot · 2023-03-08T19:24:06Z

Benchmark runs are scheduled for baseline = 81ed334 and contender = 36f2db3. 36f2db3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

askoa · 2023-03-08T21:49:52Z

arrow-buffer/src/buffer/run.rs

+/// as there are `3` values, and the maximum logical index is `6`, as the maximum run end
+/// is `6`. The physical indices are therefore `[0, 0, 0, 1, 1, 2, 2]`
+///
+/// ```text


I think this is added after I reviewed. This is not correct. The array length is 6 per run_ends and 7 for physical array. Per run_ends, the grouping should be (0,1,2) , (3) and (4,5). The physical array is defined as (0,1,2), (3,4) and (5,6).

Aah yes, I edited it to make the diagram and messed it up. Will fix tomorrow

askoa · 2023-03-08T23:49:39Z

arrow-buffer/src/buffer/run.rs

+            assert!(!run_ends.is_empty(), "non-empty slice but empty run-ends");
+            let end = E::from_usize(offset.saturating_add(len)).unwrap();
+            assert!(
+                *run_ends.first().unwrap() >= E::usize_as(0),


Missed this during original review. Should be >0 and not >=0

Add RunEndBuffer (apache#1799)

5347a44

tustvold added the api-change Changes to the arrow API label Mar 7, 2023

github-actions bot added the arrow Changes to the arrow crate label Mar 7, 2023

tustvold marked this pull request as draft March 7, 2023 19:48

Fix test

f44c312

tustvold marked this pull request as ready for review March 7, 2023 19:50

Revert rename

06bdea2

tustvold commented Mar 7, 2023

View reviewed changes

tustvold added 2 commits March 7, 2023 19:55

Format

5a98deb

Clippy

d7561c8

askoa reviewed Mar 8, 2023

View reviewed changes

tustvold mentioned this pull request Mar 8, 2023

Add ArrayDataLayout, port validation (#1799) #3818

Merged

askoa reviewed Mar 8, 2023

View reviewed changes

Remove unnecessary check

39d993b

askoa reviewed Mar 8, 2023

View reviewed changes

askoa approved these changes Mar 8, 2023

View reviewed changes

Fix

2a7d310

Tweak docs

8e7727b

alamb approved these changes Mar 8, 2023

View reviewed changes

Add docs

37815ef

tustvold merged commit 36f2db3 into apache:master Mar 8, 2023

askoa reviewed Mar 8, 2023

View reviewed changes

tustvold mentioned this pull request Mar 9, 2023

RunEndBuffer review feedback #3825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RunEndBuffer (#1799) #3817

Add RunEndBuffer (#1799) #3817

tustvold commented Mar 7, 2023

tustvold Mar 7, 2023

askoa Mar 8, 2023

tustvold Mar 8, 2023

askoa Mar 8, 2023

askoa Mar 8, 2023

askoa Mar 8, 2023 •

edited

Loading

askoa left a comment

tustvold commented Mar 8, 2023

alamb left a comment

alamb Mar 8, 2023

alamb Mar 8, 2023

alamb Mar 8, 2023

alamb Mar 8, 2023

alamb Mar 8, 2023

alamb Mar 8, 2023

ursabot commented Mar 8, 2023

askoa Mar 8, 2023

tustvold Mar 8, 2023

askoa Mar 8, 2023

	/// The result is arbitrary if `logical_index > self.len()`
	/// The result is arbitrary if `logical_index >= self.len()`

-    /// Returns the logical offset into the run-ends stored by this buffer
+    /// Returns the logical offset into the run-ends stored by this buffer.
+    ///
+    /// See  [`RunEndBuffer`] for interpretation of logical offset

Add RunEndBuffer (#1799) #3817

Add RunEndBuffer (#1799) #3817

Conversation

tustvold commented Mar 7, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askoa Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

askoa left a comment

Choose a reason for hiding this comment

tustvold commented Mar 8, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Mar 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askoa Mar 8, 2023 •

edited

Loading