-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiency Problem: Parallelization and vectorization #9547
Comments
@alamb I am kinda stuck here, could you please provide some clues about this one? Thanks |
probably related: #5942 |
My current plan for this is to generate a vectorization instruction coverage in CI/CD to track the usage of SIMD instructions. Also I think tokio may got some bugs for this. Maybe start to add parallism for different operator. Probably starting with SCAN |
Hi @Lordworms -- thank you for this analysis.
I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it. We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (#6782) DataFusion_Query_Engine___SIGMOD_2024.pdf where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in. If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts:
|
Is your feature request related to a problem or challenge?
I was doing a course project on efficiency comparison. And I try on TPC-H benchmark to compare the efficiency between datafusion and duckDB. The results indicated that There might be some efficiency issues. I also noticed that the effective CPU use time of datafusion is much higher than DuckDB, but the runtime on TPC-H is slower(seems like we did not really do parallism and I really think that's some problem comes from Tokio)
This is DuckDB's result
This is Datafusion's result
Also the flame graph shows that datafusion has a much deeper stack.
duckDB
datafusion
I kind of generated some distrust towards Tokio.
I doubt whether the slower performance is due to incomplete use of SIMD instruction so I did some statistics on SIMD instructions using PIN(may be the result is not that precise, but I expected the number of SIMD instruction generated should be comparable), the results shows below
Turns out that datafusion may use less SIMD instructions than DuckDB (that might be the rustc problem)
Describe the solution you'd like
I plan to do this week after next after. But got no clues yet
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: