Skip to content

Commit bf3dc94

Browse files
jonathanc-ntobixdev
authored andcommitted
doc: Add Join Physical Plan documentation, and configuration flag to benchmarks (apache#18209)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change Allow users to understand some decisions for when to change certain joins configurations. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Add readme to joins physical plan <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
1 parent 70e2ab3 commit bf3dc94

File tree

3 files changed

+133
-0
lines changed

3 files changed

+133
-0
lines changed

benchmarks/src/tpch/run.rs

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,15 @@ pub struct RunOpt {
9292
#[structopt(short = "j", long = "prefer_hash_join", default_value = "true")]
9393
prefer_hash_join: BoolDefaultTrue,
9494

95+
/// If true then Piecewise Merge Join can be used, if false then it will opt for Nested Loop Join
96+
/// True by default.
97+
#[structopt(
98+
short = "j",
99+
long = "enable_piecewise_merge_join",
100+
default_value = "false"
101+
)]
102+
enable_piecewise_merge_join: BoolDefaultTrue,
103+
95104
/// Mark the first column of each table as sorted in ascending order.
96105
/// The tables should have been created with the `--sort` option for this to have any effect.
97106
#[structopt(short = "t", long = "sorted")]
@@ -112,6 +121,8 @@ impl RunOpt {
112121
.config()?
113122
.with_collect_statistics(!self.disable_statistics);
114123
config.options_mut().optimizer.prefer_hash_join = self.prefer_hash_join;
124+
config.options_mut().optimizer.enable_piecewise_merge_join =
125+
self.enable_piecewise_merge_join;
115126
let rt_builder = self.common.runtime_env_builder()?;
116127
let ctx = SessionContext::new_with_config_rt(config, rt_builder.build_arc()?);
117128
// register tables
@@ -379,6 +390,7 @@ mod tests {
379390
output_path: None,
380391
disable_statistics: false,
381392
prefer_hash_join: true,
393+
enable_piecewise_merge_join: false,
382394
sorted: false,
383395
};
384396
opt.register_tables(&ctx).await?;
@@ -416,6 +428,7 @@ mod tests {
416428
output_path: None,
417429
disable_statistics: false,
418430
prefer_hash_join: true,
431+
enable_piecewise_merge_join: false,
419432
sorted: false,
420433
};
421434
opt.register_tables(&ctx).await?;

dev/update_config_docs.sh

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,66 @@ SET datafusion.execution.batch_size = 1024;
175175
176176
[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html
177177
178+
## Join Queries
179+
180+
Currently Apache Datafusion supports the following join algorithms:
181+
182+
- Nested Loop Join
183+
- Sort Merge Join
184+
- Hash Join
185+
- Symmetric Hash Join
186+
- Piecewise Merge Join (experimental)
187+
188+
The physical planner will choose the appropriate algorithm based on the statistics + join
189+
condition of the two tables.
190+
191+
# Join Algorithm Optimizer Configurations
192+
193+
You can modify join optimization behavior in your queries by setting specific configuration values.
194+
Use the following command to update a configuration:
195+
196+
``` sql
197+
SET datafusion.optimizer.<configuration_name>;
198+
```
199+
200+
Example
201+
202+
``` sql
203+
SET datafusion.optimizer.prefer_hash_join = false;
204+
```
205+
206+
Adjusting the following configuration values influences how the optimizer selects the join algorithm
207+
used to execute your SQL query:
208+
209+
## Join Optimizer Configurations
210+
211+
Adjusting the following configuration values influences how the optimizer selects the join algorithm
212+
used to execute your SQL query.
213+
214+
### allow_symmetric_joins_without_pruning (bool, default = true)
215+
216+
Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
217+
lack ordering or filtering.
218+
219+
- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.
220+
221+
### prefer_hash_join (bool, default = true)
222+
223+
Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.
224+
225+
- true: favors HashJoin for faster execution when sufficient memory is available.
226+
- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.
227+
228+
### enable_piecewise_merge_join (bool, default = false)
229+
230+
Enables the experimental Piecewise Merge Join algorithm.
231+
232+
- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
233+
filter in the join condition.
234+
- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
235+
except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
236+
equal in size.
237+
178238
EOF
179239

180240

docs/source/user-guide/configs.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,3 +253,63 @@ SET datafusion.execution.batch_size = 1024;
253253
```
254254

255255
[`fairspillpool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html
256+
257+
## Join Queries
258+
259+
Currently Apache Datafusion supports the following join algorithms:
260+
261+
- Nested Loop Join
262+
- Sort Merge Join
263+
- Hash Join
264+
- Symmetric Hash Join
265+
- Piecewise Merge Join (experimental)
266+
267+
The physical planner will choose the appropriate algorithm based on the statistics + join
268+
condition of the two tables.
269+
270+
# Join Algorithm Optimizer Configurations
271+
272+
You can modify join optimization behavior in your queries by setting specific configuration values.
273+
Use the following command to update a configuration:
274+
275+
```sql
276+
SET datafusion.optimizer.<configuration_name>;
277+
```
278+
279+
Example
280+
281+
```sql
282+
SET datafusion.optimizer.prefer_hash_join = false;
283+
```
284+
285+
Adjusting the following configuration values influences how the optimizer selects the join algorithm
286+
used to execute your SQL query:
287+
288+
## Join Optimizer Configurations
289+
290+
Adjusting the following configuration values influences how the optimizer selects the join algorithm
291+
used to execute your SQL query.
292+
293+
### allow_symmetric_joins_without_pruning (bool, default = true)
294+
295+
Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
296+
lack ordering or filtering.
297+
298+
- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.
299+
300+
### prefer_hash_join (bool, default = true)
301+
302+
Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.
303+
304+
- true: favors HashJoin for faster execution when sufficient memory is available.
305+
- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.
306+
307+
### enable_piecewise_merge_join (bool, default = false)
308+
309+
Enables the experimental Piecewise Merge Join algorithm.
310+
311+
- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
312+
filter in the join condition.
313+
- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
314+
except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
315+
equal in size.

0 commit comments

Comments
 (0)