Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency Problem: Parallelization and vectorization #9547

Open
Lordworms opened this issue Mar 11, 2024 · 5 comments
Open

Efficiency Problem: Parallelization and vectorization #9547

Lordworms opened this issue Mar 11, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@Lordworms
Copy link
Contributor

Lordworms commented Mar 11, 2024

Is your feature request related to a problem or challenge?

I was doing a course project on efficiency comparison. And I try on TPC-H benchmark to compare the efficiency between datafusion and duckDB. The results indicated that There might be some efficiency issues. I also noticed that the effective CPU use time of datafusion is much higher than DuckDB, but the runtime on TPC-H is slower(seems like we did not really do parallism and I really think that's some problem comes from Tokio)
This is DuckDB's result
9a087441e7ecb79d01fc382d14f47ffe
This is Datafusion's result
1980a8f73e043fff172c3763114110e3

Also the flame graph shows that datafusion has a much deeper stack.
duckDB
1def3e3446638dbd5fd305db09421227

datafusion
26f898b9cbf0352ea76383ae1faf7d88

I kind of generated some distrust towards Tokio.

I doubt whether the slower performance is due to incomplete use of SIMD instruction so I did some statistics on SIMD instructions using PIN(may be the result is not that precise, but I expected the number of SIMD instruction generated should be comparable), the results shows below

SIMD instruction datafusion number duckDB number
ADDSD 34 25
CMPSD_XMM 1 6
COMISD - 44
DIVSD 14 32
MAXSD 1 1
MULSD 21 52
PACKUSWB 5 7
PADDB 30 12
PADDD 100 33
PADDQ 291 200
PADDW 8 5
PCMPEQB 548 544
PCMPEQD 58 38
PCMPGTB - 1
PCMPGTD 44 14
PCMPGTW - 6
PMINUB 8 20
PMOVMSKB 1169 278
PMULHUW 1 2
PMULLW 1 2
PMULUDQ - 4
PSHUFD 646 88
PSLLD 6 2
PSLLDQ 72 217
PSLLQ 213 16
PSLLW 30 2
PSRAD 8 -
PSRLD 3 40
PSRLDQ 39 179
PSRLQ 11 7
PSUBB 84 243
PSUBD 4 3
PSUBQ 12 4
PSUBUSB - 6
PSUBW - 6
PUNPCKHBW 41 7
PUNPCKHDQ 45 66
PUNPCKHQDQ 102 14
PUNPCKHWD 42 50
PUNPCKLBW 211 19
PUNPCKLDQ 94 338
PUNPCKLQDQ 353 2713
PUNPCKLWD 73 80
ROUNDSD 1 -
SHUFPD 4 20
SHUFPS - 28
SQRTSD - 2
SUBSD 10 19
UCOMISD 16 39
VPCMPB 56 86
VPCMPUB 206 19
VPMINUB 2 15
Total 4851 5293

Turns out that datafusion may use less SIMD instructions than DuckDB (that might be the rustc problem)

Describe the solution you'd like

I plan to do this week after next after. But got no clues yet

Describe alternatives you've considered

No response

Additional context

No response

@Lordworms Lordworms added the enhancement New feature or request label Mar 11, 2024
@Lordworms
Copy link
Contributor Author

@alamb I am kinda stuck here, could you please provide some clues about this one? Thanks

@yyy1000
Copy link
Contributor

yyy1000 commented Mar 11, 2024

probably related: #5942

@Lordworms
Copy link
Contributor Author

Lordworms commented Mar 11, 2024

My current plan for this is to generate a vectorization instruction coverage in CI/CD to track the usage of SIMD instructions. Also I think tokio may got some bugs for this. Maybe start to add parallism for different operator. Probably starting with SCAN

@alamb
Copy link
Contributor

alamb commented Mar 11, 2024

Hi @Lordworms -- thank you for this analysis.

(seems like we did not really do parallism and I really think that's some problem comes from Tokio)

I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it.

We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (#6782) DataFusion_Query_Engine___SIGMOD_2024.pdf where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in.

If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts:

  1. The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. HashJoinExec) is important for good TPCH
  2. The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see [Epic] A collection of Join Improvements #8398 for a list of potential ideas)

@Omega359
Copy link
Contributor

I run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and during my processing I'm seeing almost 100% cpu usage across the board. So parallelism in my usecase is essentially perfect - though I can't speak for the efficiency.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants