Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement right semi join and support it in HashBuildProbeOrder #3945

Closed
Dandandan opened this issue Oct 24, 2022 · 0 comments · Fixed by #3958
Closed

Implement right semi join and support it in HashBuildProbeOrder #3945

Dandandan opened this issue Oct 24, 2022 · 0 comments · Fixed by #3958
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Oct 24, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, we have a SemiJoin option, this is basically a left semi join.
This works nicely, but because we don't have a RightSemiJoin we can't change the build/probe order.

If we have a RightSemiJoin we could support this - changing to use a right semi join whenever the build side is larger than the probe side. So basically - if there is a match, include the row in the results.

A right semi join returns the right row (at most once) for every match on the left side.

Example of this can be found in q18 in ballista (the top CoalesceBatchesExec is much bigger, optimal is the other way around):

      HashJoinExec: mode=Partitioned, join_type=Semi, on=[(Column { name: "o_orderkey", index: 2 }, Column { name: "l_orderkey", index: 0 })], metrics=[output_rows=4368, input_batches=32, output_batches=32, input_rows=4992, join_time=150.113398ms]
        CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=59985993, elapsed_compute=11.114282245s, spill_count=0, spilled_bytes=0, mem_used=0]
          ShuffleReaderExec: partitions=16, metrics=[]
        CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=624, elapsed_compute=1.087954ms, spill_count=0, spilled_bytes=0, mem_used=0]
          ShuffleReaderExec: partitions=16, metrics=[]

Here is q20

    HashJoinExec: mode=Partitioned, join_type=Semi, on=[(Column { name: "ps_partkey", index: 0 }, Column { name: "p_partkey", index: 0 })], metrics=[output_rows=86204, input_batches=32, output_batches=32, input_rows=107755, join_time=44.847746ms]
      CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=7999999, elapsed_compute=273.447259ms, spill_count=0, spilled_bytes=0, mem_used=0]
        ShuffleReaderExec: partitions=16, metrics=[]
      CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=21551, elapsed_compute=2.720426ms, spill_count=0, spilled_bytes=0, mem_used=0]
        ShuffleReaderExec: partitions=16, metrics=[]

Describe the solution you'd like
Implement RightSemiJoin and swap join order if the build side is bigger.

Describe alternatives you've considered

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant