perf: Faster `string_agg()` aggregate function (1000x speed for no DISTINCT and ORDER case) #17837

2010YOUY01 · 2025-09-30T10:03:20Z

Which issue does this PR close?

Rationale for this change

string_agg is slow, see the tracking issue for details.

This PR added a new Accumulator implementation for the simple case: if there is no DISTINCT and ORDER BY like string_agg(distinct str, ',' ORDER BY str), use SimpleStringAggAccumulator instead. The original StringAggAccumulator is used for the general case with potential DISTINCT and ORDER BYs.

While @vegarsti is working on a GroupsAccumulator solution that can further speed it up potentially, I think this PR is still necessary because the single group case like select string_agg(str, ',') from t1 is still using the Accumulator interface instead of GroupsAccumulator

I haven't checked the original implementation yet, and I don't know why is it so slow. It's using array_agg internally, and memory bloat can be observed. I guess the reason is redundant transcoding, and incorrect operations on StringView buffers.
There are several ongoing work to improve arrray_agg: #17829

Benchmark

It's around 1000x faster. See the original issue for data setup, scaling factor 0.1 is used for the table.

DataFusion CLI v50.0.0
> CREATE EXTERNAL TABLE partsupp
STORED AS PARQUET
LOCATION '/Users/yongting/Code/datafusion-sqlstorm/data/partsupp.parquet';

> select ps_partkey, string_agg(ps_comment, ';')
from partsupp
group by ps_partkey;

Before: ~50s
PR: 0.05s

What changes are included in this PR?

Implemented a SimpleStringAggAccumulator for string_agg
In the aggregate function implementation, opt for the new accumulator if there is no DISTINCT and ORDER BY in the string_agg() aggregate function.

Are these changes tested?

Existing tests

Are there any user-facing changes?

vegarsti · 2025-09-30T10:55:31Z

Amazing!

vegarsti

Really nice!!

datafusion/functions-aggregate/src/string_agg.rs

vegarsti · 2025-09-30T11:14:10Z

datafusion/functions-aggregate/src/string_agg.rs

+pub(crate) struct SimpleStringAggAccumulator {
+    delimiter: String,
+    // Updating during `update_batch()`. e.g. "foo,bar"
+    in_progress_string: String,


Just thinking out loud about the name here: I think acc or accumulated would also be conventional. But this name is fine!

Good point, updated.

vegarsti · 2025-09-30T11:15:59Z

datafusion/functions-aggregate/src/string_agg.rs

+        size_of_val(self) + self.delimiter.capacity() + self.in_progress_string.capacity()
+    }
+
+    fn state(&mut self) -> Result<Vec<ScalarValue>> {


Just asking to understand the Accumulator trait: I see that this and evaluate are the same except for what they return - what is the difference between the two and when they are used, do you know?

state is for per-partition intermediate result, and evaluate() is the final result.
e.g. for group key1, it's getting executed in 2 partitions.
partition 1:
-- [INPUT] (foo, bar) --state()--> "foo, bar"
partition 2:
-- [input] (baz) --state--> "baz"

and evaluate() is called after merge_batch to combine the above intermediates from all partitions, and get the final result "foo, bar, baz"

I think there is a detailed doc in the Accumulator interface

Thanks for the explanation! And oh yeah, I should read the doc comments on the trait!

vegarsti · 2025-09-30T11:16:57Z

datafusion/functions-aggregate/src/string_agg.rs

+    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
+        let string_arr = values.first().ok_or_else(|| {


Is there only one element in values? That was surprising

Yes, it's an array of arg1. The arg is validated during the planning time, and we can also assume it's the right type here (values[0] is a string array)

datafusion/functions-aggregate/src/string_agg.rs

Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com>

2010YOUY01 · 2025-10-01T11:25:54Z

CI should pass after #17855 is merged, we can re-run afterwards

alamb

Thanks @2010YOUY01 -- I agree this is great . I left a few suggestions.

I tested doing this:

# created data
tpchgen-cli -v --tables=partsupp --format=parquet --parts=2 -s 0.1
# run query: 
time datafusion-cli -c "select ps_partkey, string_agg(ps_comment, ';') from 'partsupp' group by ps_partkey;"

main:

real 0m50.296s

This branch 😮

real 0m0.706s

alamb · 2025-10-01T15:56:12Z

datafusion/functions-aggregate/src/string_agg.rs

+            // Case `SimpleStringAggAccumulator`
+            Ok(vec![Field::new(
+                format_state_name(args.name, "string_agg"),
+                DataType::LargeUtf8,
+                true,
+            )
+            .into()])


It would be nice to put this as part of SimpleStringAggAccumulator, something like

SimpleStringAggAccumulator::state_fields(args)

alamb · 2025-10-01T15:56:32Z

datafusion/functions-aggregate/src/string_agg.rs

+            Ok(Box::new(SimpleStringAggAccumulator::new(delimiter)))
+        } else {
+            // general case
+            let array_agg_acc = self.array_agg.accumulator(AccumulatorArgs {


ditto here for encapsulating this

alamb · 2025-10-01T15:57:32Z

datafusion/functions-aggregate/src/string_agg.rs

+pub(crate) struct SimpleStringAggAccumulator {
+    delimiter: String,
+    /// Updated during `update_batch()`. e.g. "foo,bar"
+    accumulated_string: String,


Rater than has_value perhaps using an option would be better / more rust idomatic and harder to misuse

accumulated_string: Option<String>,

alamb · 2025-10-01T15:57:57Z

datafusion/functions-aggregate/src/string_agg.rs

+/// because it accumulates the string directly,
+/// whereas `StringAggAccumulator` uses `ArrayAggAccumulator`.
+#[derive(Debug)]
+pub(crate) struct SimpleStringAggAccumulator {


Yes, this is likely much better than what we have. We can probably do better still with a GroupsAccumulator as well

alamb · 2025-10-01T16:00:25Z

datafusion/functions-aggregate/src/string_agg.rs

+            if self.has_value {
+                self.accumulated_string.push_str(&self.delimiter);
+            }
+
+            self.accumulated_string.push_str(value);
+            self.has_value = true;


If you used an option, this could be like

Suggested change

if self.has_value {

self.accumulated_string.push_str(&self.delimiter);

}

self.accumulated_string.push_str(value);

self.has_value = true;

if let Some(accumulated_value) = self.accumulated_value.as_mut() {

accumulated_string.push_str(&self.delimiter);

} else {

self.accumulated_valie = Some(String::from(&value))

}

comphead · 2025-10-01T19:34:45Z

datafusion/functions-aggregate/src/string_agg.rs

+                self.accumulated_string.push_str(&self.delimiter);
+            }
+
+            self.accumulated_string.push_str(value);


push_str is 💪

2010YOUY01 added 2 commits September 30, 2025 16:50

impl SimpleStringAggAccumulator for performance

d7c7c84

lint

2ee945b

github-actions bot added the functions Changes to functions implementation label Sep 30, 2025

2010YOUY01 mentioned this pull request Sep 30, 2025

string_agg aggregate function is 1000x slower than duckdb (SQLStorm) #17789

Open

vegarsti reviewed Sep 30, 2025

View reviewed changes

2010YOUY01 and others added 3 commits October 1, 2025 19:14

review: rename in_progress_string --> accumualted_string

574ef18

Update datafusion/functions-aggregate/src/string_agg.rs

a041219

Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com>

Update datafusion/functions-aggregate/src/string_agg.rs

2bc95c1

Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com>

2010YOUY01 closed this Oct 1, 2025

2010YOUY01 reopened this Oct 1, 2025

alamb added the performance Make DataFusion faster label Oct 1, 2025

alamb approved these changes Oct 1, 2025

View reviewed changes

comphead reviewed Oct 1, 2025

View reviewed changes

		fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
		let string_arr = values.first().ok_or_else(\|\| {

perf: Faster string_agg() aggregate function (1000x speed for no DISTINCT and ORDER case) #17837

Are you sure you want to change the base?

perf: Faster string_agg() aggregate function (1000x speed for no DISTINCT and ORDER case) #17837

Conversation

2010YOUY01 commented Sep 30, 2025

Which issue does this PR close?

Rationale for this change

Benchmark

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

vegarsti commented Sep 30, 2025

Uh oh!

vegarsti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vegarsti Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2010YOUY01 commented Oct 1, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

perf: Faster `string_agg()` aggregate function (1000x speed for no DISTINCT and ORDER case) #17837

perf: Faster `string_agg()` aggregate function (1000x speed for no DISTINCT and ORDER case) #17837

vegarsti Sep 30, 2025 •

edited

Loading