Skip to content

Conversation

findepi
Copy link
Member

@findepi findepi commented Jun 18, 2025

Which issue does this PR close?

none

Rationale for this change

This is necessary to support DISTINCT and GROUP BY over fixed-sized arrays in DataFusion.

What changes are included in this PR?

Add DataType::FixedSizeList support to RowConverter.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 18, 2025
@findepi findepi force-pushed the findepi/support-grouping-by-fixedsizelist-a01caf branch from 5e19043 to 4551310 Compare June 18, 2025 13:10
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should he relatively straightforward to avoid needing to cast and to encode/decode the array directly. This would avoid adding two very heavy dependencies in the form of arrow-cast and by extension arrow-select.

Further, it is actually wasteful to encode fixed size lists in this way, we don't need to var-encode them, we can simply encode the values directly one after each other, much like we do for StructArray.

@alamb alamb changed the title Support grouping by FixedSizeList Support FixedSizeList RowConverter Jun 18, 2025
@findepi
Copy link
Member Author

findepi commented Jun 19, 2025

@tustvold thanks for the feedback!
my thought process was that it might be a niche use-case so I optimized for minimal code size with the intent of best maintainability. However, I take your judgement on the dependencies (also impacting maintainability). Will update.

findepi added 3 commits June 23, 2025 14:50
The test data doesn't contain 42 value.
Validate that null values are correctly masked out.
Add `DataType::FixedSizeList` support to `RowConverter`. This is
necessary to support DISTINCT and GROUP BY over fixed-sized arrays in
DataFusion.
@findepi findepi force-pushed the findepi/support-grouping-by-fixedsizelist-a01caf branch from 4551310 to 667cbdf Compare June 23, 2025 12:53
@findepi findepi marked this pull request as ready for review June 23, 2025 12:53
@findepi
Copy link
Member Author

findepi commented Jun 23, 2025

Updated to remove casts.
There still is variable encoding. A null FSL[n] is now encoded to a null sentinel byte.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi -- I won't say I fully follow the logic but I did verify the tests and the validation performed and that this seems to follow the pattern of the other converters.

It might also be worth adding a benchmark in arrow/benches/row_format.rs in case someone wants to try to optimize this code more in the future

}

pub fn compute_lengths_fixed_size_list(
tracker: &mut LengthTracker,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pattern matching, I wonder why this doesn't use the same pattern as List?

Why not

Suggested change
tracker: &mut LengthTracker,
lengths: &mut [usize],

(I don't see anything wrong with this I am just curious)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller used to operate on lengths: &mut [usize] and that's why list helper functions have this in their API
the caller has been migrated to tracker: &mut LengthTracker, the helper functions for list hasn't been updated.
if this code was inline in the caller (as for eg structs), it would operate on LengthTracker directly.

@findepi
Copy link
Member Author

findepi commented Jun 25, 2025

It might also be worth adding a benchmark in arrow/benches/row_format.rs in case someone wants to try to optimize this code more in the future

I didn't find an example with a normal list to follow. I'd suggest doing this as a follow-up.

BTW with the row format so precisely documented, is the format itself set in stone, or subject to change?

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

It might also be worth adding a benchmark in arrow/benches/row_format.rs in case someone wants to try to optimize this code more in the future

I didn't find an example with a normal list to follow. I'd suggest doing this as a follow-up.

Makes sense

BTW with the row format so precisely documented, is the format itself set in stone, or subject to change?

I don't know of any policy / discussion on this topic (so the answer is "I don't know").

Part of the rationale to document the Row Format was that it was a pretty tricky thing to make correct -- I don't remember any rationale about not changing it. I also don't know of any use as a long term interchange format

@alamb alamb merged commit d7fc416 into apache:main Jun 25, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

Thanks again @findepi

@findepi findepi deleted the findepi/support-grouping-by-fixedsizelist-a01caf branch June 25, 2025 20:41
findepi added a commit to sdf-labs/arrow-rs that referenced this pull request Jun 26, 2025
# Which issue does this PR close?

none

# Rationale for this change

This is necessary to support DISTINCT and GROUP BY over fixed-sized
arrays in DataFusion.

# What changes are included in this PR?

Add `DataType::FixedSizeList` support to `RowConverter`.

# Are there any user-facing changes?

No

(cherry picked from commit d7fc416)
findepi added a commit to sdf-labs/arrow-rs that referenced this pull request Jun 27, 2025
# Which issue does this PR close?

none

# Rationale for this change

This is necessary to support DISTINCT and GROUP BY over fixed-sized
arrays in DataFusion.

# What changes are included in this PR?

Add `DataType::FixedSizeList` support to `RowConverter`.

# Are there any user-facing changes?

No

(cherry picked from commit d7fc416)
findepi added a commit to sdf-labs/arrow-rs that referenced this pull request Sep 1, 2025
none

This is necessary to support DISTINCT and GROUP BY over fixed-sized
arrays in DataFusion.

Add `DataType::FixedSizeList` support to `RowConverter`.

No

(cherry picked from commit d7fc416)
findepi added a commit to sdf-labs/arrow-rs that referenced this pull request Sep 5, 2025
none

This is necessary to support DISTINCT and GROUP BY over fixed-sized
arrays in DataFusion.

Add `DataType::FixedSizeList` support to `RowConverter`.

No

(cherry picked from commit d7fc416)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants