-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafusion-cli scanning a single large parquet file uses only a single core #5995
Comments
Suggestion from @tustvold on #5942 (comment) Here is the result of running |
Do you have a profile of the CPU usage, I would have expected it to parallelize the parquet scanning part, perhaps the bottleneck is elsewhere? |
There are 49 row groups, 48 of them has size |
I do not have a profile. |
I added the following line to
And got
So whilst it is creating lots of partitions, all the row group appear to lie in a single partition. This explains why we are not seeing any parallelism. Why this is the case needs more investigation Edit: using https://github.com/apache/arrow-rs/pull/4086/files I confirmed the byte ranges should distribute the row groups |
With the fix in #5997 we have much more parallelism and the query runs much faster |
Describe the bug
datafusion-cli scanning a single large parquet file uses only a single core
This is bad as it makes datafusion look bad compared to other systems such as duckdb
To Reproduce
Download this file:
slow_tpch_query_repro.zip
and follow the instructions:
Then run the query like:
Only one core is used and the query takes several seconds to complete
Expected behavior
I expect to see all the cores on the machine used to operate on the query
Additional context
Found while looking at #5942
The text was updated successfully, but these errors were encountered: