Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481

Open
Tracked by #14482 ...
alamb opened this issue Feb 4, 2025 · 3 comments
Open
Tracked by #14482 ...

Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481

alamb opened this issue Feb 4, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

Is your feature request related to a problem or challenge?

@pmcgleenon ran ClickBench on DataFusion 44 ❤

Here are the results of ClickBench across several DataFusion versions:
clickbench-latest.html.zip

Q36 and Q37 look like they got slower

Image

Describe the solution you'd like

Investigate (and hopefully restore) the performance in Q36 and Q37

Here are the queries (note the queries are numbered starting at 0 but the line numbers start at 1):

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

I got flamegraphs from them using https://github.com/flamegraph-rs/flamegraph

Q36

./datafusion-cli-44 -c "SELECT \"URL\", COUNT(*) AS PageViews FROM 'hits.parquet' WHERE \"CounterID\" = 62 AND \"EventDate\"::INT::DATE >= '2013-07-01' AND \"EventDate\"::INT::DATE <= '2013-07-31' AND \"DontCountHits\" = 0 AND \"IsRefresh\" = 0 AND \"URL\" <> '' GROUP BY \"URL\" ORDER BY PageViews DESC LIMIT 10; "

And made flamegraphs with

sudo flamegraph -- ./datafusion-cli-43 -c "SELECT \"URL\", COUNT(*) AS PageViews FROM 'hits.parquet' WHERE \"CounterID\" = 62 AND \"EventDate\"::INT::DATE >= '2013-07-01' AND \"EventDate\"::INT::DATE <= '2013-07-31' AND \"DontCountHits\" = 0 AND \"IsRefresh\" = 0 AND \"URL\" <> '' GROUP BY \"URL\" ORDER BY PageViews DESC LIMIT 10;

Here is DataFusion 43:

Image

Here is DataFusion 44:

Image

A largre amount of the time is spent decoding ParquetMetadata
Image

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

Also when I tried in DataFusion 45 (pre-release) the speed seems to have gotten better again...

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2025

Given how much time is spent decoding ParquetMetadata, maybe it would be good to add some sort of small built in cache for parquet metadata 🤔 I think @Ted-Jiang made hooks to do this a long time ago but we don't have anything in by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant