Skip to content

perf: optimize array_distinct with batched row conversion#20364

Open
lyne7-sc wants to merge 2 commits intoapache:mainfrom
lyne7-sc:perf/array_distinct
Open

perf: optimize array_distinct with batched row conversion#20364
lyne7-sc wants to merge 2 commits intoapache:mainfrom
lyne7-sc:perf/array_distinct

Conversation

@lyne7-sc
Copy link
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

This PR optimizes the array_distinct function by batching value conversions and utilizing a HashSet for deduplication.
It is a follow-up to #20243.

What changes are included in this PR?

This PR optimizes array_distinct by:

  1. Converting all values to rows in a single batch rather than individually.
  2. Using a HashSet to deduplicate values for each list.

Benchmark

group                                main                                   optimized
-----                                ----                                   ---------
array_distinct/high_duplicate/10     2.66   855.1±28.18µs        ? ?/sec    1.00    321.9±8.70µs        ? ?/sec
array_distinct/high_duplicate/100    2.21      6.4±0.13ms        ? ?/sec    1.00      2.9±0.09ms        ? ?/sec
array_distinct/high_duplicate/50     2.14      3.2±0.05ms        ? ?/sec    1.00  1478.3±41.90µs        ? ?/sec
array_distinct/low_duplicate/10      2.73  1017.3±44.67µs        ? ?/sec    1.00   372.5±17.33µs        ? ?/sec
array_distinct/low_duplicate/100     1.32      4.4±0.13ms        ? ?/sec    1.00      3.3±0.15ms        ? ?/sec
array_distinct/low_duplicate/50      1.55      2.6±0.06ms        ? ?/sec    1.00  1689.0±94.15µs        ? ?/sec

Are these changes tested?

Yes, unit tests exist and pass.

Are there any user-facing changes?

Yes, there is a slight change in the output order. This new behavior is consistent with array_union and array_intersect, where the output order is more intuitive as it preserves the original order of elements in the array.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 15, 2026

// Convert all values to row format in a single batch for performance
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?;
let rows = converter.convert_columns(&[Arc::clone(array.values())])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as a follow-up, it might reuse the Rows and HashSet allocations between batches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I'll look into it and try reusing the allocations. Not sure if thread_local will help, but any tips or better approaches you'd recommend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants