Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better multi-column aggregation support with StringView #11684

Closed
wants to merge 2 commits into from

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Related to #7000.

Rationale for this change

I get some time to implement the multi-column aggregation with StringView, the implementation is inspired by @jayzhan211 's #10976.

However, the performance is mixed, I see almost no performance diff with ClickBench query 18:

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

This is becuase the average size of SearchPhrase is only 4, meaning the string is always inlined to view and thus not being reused.

However, I made this query (q14, changed SearchPhrase to URL)

SELECT "SearchEngineID", "URL", COUNT(*) AS c FROM hits WHERE "URL" <> '' GROUP BY "SearchEngineID", "URL" ORDER BY c DESC LIMIT 10;

Without this patch, it takes 7.4s; with this patch, it takes 5.9s, a 20% improvement.

With this patch, the output string will reuse the string values in the buffer, so it is good.
However, when interning the values, we need to hash twice: (1) the string itself to build BytesViewMap (2) the group index - u32, which adds overhead.

What changes are included in this PR?

Things not implemented yet:

  • EmtiTo::First(n): I need to think a bit more about how to efficiently reuse the BytesViewBuilder

Are these changes tested?

Are there any user-facing changes?

No

@XiangpengHao
Copy link
Contributor Author

Update here that the new implementation does not need to hash twice, so it should not create noticeable performance regression on shorter strings. And the perf on longer strings are still much faster!

@alamb alamb deleted the branch apache:string-view2 July 29, 2024 20:52
@alamb alamb closed this Jul 29, 2024
@jayzhan211
Copy link
Contributor

Update here that the new implementation does not need to hash twice, so it should not create noticeable performance regression on shorter strings. And the perf on longer strings are still much faster!

What is the new impl you referred to

@XiangpengHao
Copy link
Contributor Author

What is the new impl you referred to

Sorry for the late reply! I meant the new commit I made to this PR. Since string-view2 was merged, I created a new PR that targets the main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants