Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement special Groups for StringViews #12771

Closed
Tracked by #12680
alamb opened this issue Oct 5, 2024 · 5 comments · Fixed by #12809
Closed
Tracked by #12680

Implement special Groups for StringViews #12771

alamb opened this issue Oct 5, 2024 · 5 comments · Fixed by #12809
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 5, 2024

Is your feature request related to a problem or challenge?

Part of #12680

In #12269 @jayzhan211 made significant improvements to how group values are stored in multi-column aggregations. This requires specialized implementations for different column types

His initial PR has implementations for PrimitiveArray and String/Binary. However it does not have a specialization for StringView

So that means that queries that group on multiple columns are even faster. This shows up by effectively slowing down some clickbench queries when they are run with StringView:

For example, this query is 10% slower with StringView

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Describe the solution you'd like

I would like to make this (and similar) query faster when string view is enabled :

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Note this is grouping by 2 columns

Here is how to reproduce the issue

Step 1. Get hits.parquet using bench.sh:

cd benchmarks
./bench.sh data clickbench_1

Step 2: Prepare a script with reproducer query:

set datafusion.execution.parquet.schema_force_view_types = true;

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits.parquet' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Step 3: Run query

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
  • set datafusion.execution.parquet.schema_force_view_types = true; --> Elapsed 0.688 seconds.
  • set datafusion.execution.parquet.schema_force_view_types = false; --> Elapsed 0.565 seconds.

Describe alternatives you've considered

I suggest implementing something like ByteViewGroupValueBuilder following the model of ByteGroupValueBuilder

The in progress values would be u128s and some buffers (maybe 2MB?)

implementing equal_to can take advantage of the inlined prefix optimization (aka compare the prefix inlined in the u128 and only check the value in the buffer if that is already equal)

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 5, 2024

BTW here are the flamegraphs:

flamegraph-string
flamegraph-stringview

Screenshot 2024-10-05 at 9 26 48 AM

@alamb alamb changed the title Implement special Groups or StringViews Implement special Groups for StringViews Oct 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 5, 2024

FYI @Rachelint / @jayzhan211 -- this might be an interesting project

@Rachelint
Copy link
Contributor

Actually interesting, I am willing to help push it forward

@Rachelint
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Oct 6, 2024

I think it will be quite a cool optimization -- specifically checking for equal values can likely be optimized using the inlined prefix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants