Implement special Groups for StringViews #12771

alamb · 2024-10-05T13:05:26Z

Is your feature request related to a problem or challenge?

In #12269 @jayzhan211 made significant improvements to how group values are stored in multi-column aggregations. This requires specialized implementations for different column types

His initial PR has implementations for PrimitiveArray and String/Binary. However it does not have a specialization for StringView

So that means that queries that group on multiple columns are even faster. This shows up by effectively slowing down some clickbench queries when they are run with StringView:

For example, this query is 10% slower with StringView

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Describe the solution you'd like

I would like to make this (and similar) query faster when string view is enabled :

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Note this is grouping by 2 columns

Here is how to reproduce the issue

Step 1. Get `hits.parquet` using `bench.sh`:

cd benchmarks
./bench.sh data clickbench_1

Step 2: Prepare a script with reproducer query:

set datafusion.execution.parquet.schema_force_view_types = true;

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Step 3: Run query

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql

set datafusion.execution.parquet.schema_force_view_types = true; --> Elapsed 0.688 seconds.
set datafusion.execution.parquet.schema_force_view_types = false; --> Elapsed 0.565 seconds.

Describe alternatives you've considered

I suggest implementing something like ByteViewGroupValueBuilder following the model of ByteGroupValueBuilder

datafusion/datafusion/physical-plan/src/aggregates/group_values/group_column.rs

Line 177 in 6f8c74c

pub struct ByteGroupValueBuilder<O>

The in progress values would be u128s and some buffers (maybe 2MB?)

implementing equal_to can take advantage of the inlined prefix optimization (aka compare the prefix inlined in the u128 and only check the value in the buffer if that is already equal)

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-10-05T13:27:50Z

BTW here are the flamegraphs:

alamb · 2024-10-05T13:28:21Z

FYI @Rachelint / @jayzhan211 -- this might be an interesting project

Rachelint · 2024-10-05T13:38:07Z

Actually interesting, I am willing to help push it forward

Rachelint · 2024-10-05T13:38:16Z

take

alamb · 2024-10-06T11:11:05Z

I think it will be quite a cool optimization -- specifically checking for equal values can likely be optimized using the inlined prefix

alamb added the enhancement New feature or request label Oct 5, 2024

alamb mentioned this issue Oct 5, 2024

[EPIC] Improvements to GroupColumn multi-column aggregation performance #12680

Open

14 tasks

alamb changed the title ~~Implement special Groups or StringViews~~ Implement special Groups for StringViews Oct 5, 2024

This was referenced Oct 5, 2024

Enable reading StringViewArray by default from Parquet #12092

Closed

Casting from Binary --> Utf8 to evaluate LIKE slows down some ClickBench queries #12509

Closed

github-actions bot assigned Rachelint Oct 5, 2024

alamb mentioned this issue Oct 7, 2024

Enable reading StringView by default from Parquet (schema_force_string_view) by default #11682

Closed

Rachelint mentioned this issue Oct 8, 2024

Implement GroupColumn support for StringView / ByteView (faster grouping performance) #12809

Merged

alamb mentioned this issue Oct 15, 2024

Release DataFusion 43.0.0 #12470

Closed

4 tasks

alamb closed this as completed in #12809 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement special Groups for StringViews #12771

Implement special Groups for StringViews #12771

alamb commented Oct 5, 2024 •

edited

Loading

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

Rachelint commented Oct 5, 2024

Rachelint commented Oct 5, 2024

alamb commented Oct 6, 2024

Implement special Groups for StringViews #12771

Implement special Groups for StringViews #12771

Comments

alamb commented Oct 5, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Step 1. Get hits.parquet using bench.sh:

Step 2: Prepare a script with reproducer query:

Step 3: Run query

Describe alternatives you've considered

Additional context

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

Rachelint commented Oct 5, 2024

Rachelint commented Oct 5, 2024

alamb commented Oct 6, 2024

alamb commented Oct 5, 2024 •

edited

Loading

Step 1. Get `hits.parquet` using `bench.sh`: