Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions benchmarks/src/tpch/run.rs
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,15 @@ pub struct RunOpt {
#[structopt(short = "j", long = "prefer_hash_join", default_value = "true")]
prefer_hash_join: BoolDefaultTrue,

/// If true then Piecewise Merge Join can be used, if false then it will opt for Nested Loop Join
/// True by default.
#[structopt(
short = "j",
long = "enable_piecewise_merge_join",
default_value = "false"
)]
enable_piecewise_merge_join: BoolDefaultTrue,

/// Mark the first column of each table as sorted in ascending order.
/// The tables should have been created with the `--sort` option for this to have any effect.
#[structopt(short = "t", long = "sorted")]
Expand All @@ -112,6 +121,8 @@ impl RunOpt {
.config()?
.with_collect_statistics(!self.disable_statistics);
config.options_mut().optimizer.prefer_hash_join = self.prefer_hash_join;
config.options_mut().optimizer.enable_piecewise_merge_join =
self.enable_piecewise_merge_join;
let rt_builder = self.common.runtime_env_builder()?;
let ctx = SessionContext::new_with_config_rt(config, rt_builder.build_arc()?);
// register tables
Expand Down Expand Up @@ -379,6 +390,7 @@ mod tests {
output_path: None,
disable_statistics: false,
prefer_hash_join: true,
enable_piecewise_merge_join: false,
sorted: false,
};
opt.register_tables(&ctx).await?;
Expand Down Expand Up @@ -416,6 +428,7 @@ mod tests {
output_path: None,
disable_statistics: false,
prefer_hash_join: true,
enable_piecewise_merge_join: false,
sorted: false,
};
opt.register_tables(&ctx).await?;
Expand Down
60 changes: 60 additions & 0 deletions dev/update_config_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,66 @@ SET datafusion.execution.batch_size = 1024;

[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html

## Join Queries

Currently Apache Datafusion supports the following join algorithms:

- Nested Loop Join
- Sort Merge Join
- Hash Join
- Symmetric Hash Join
- Piecewise Merge Join (experimental)

The physical planner will choose the appropriate algorithm based on the statistics + join
condition of the two tables.

# Join Algorithm Optimizer Configurations

You can modify join optimization behavior in your queries by setting specific configuration values.
Use the following command to update a configuration:

``` sql
SET datafusion.optimizer.<configuration_name>;
```

Example

``` sql
SET datafusion.optimizer.prefer_hash_join = false;
```

Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query:

## Join Optimizer Configurations

Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query.

### allow_symmetric_joins_without_pruning (bool, default = true)

Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
lack ordering or filtering.

- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.

### prefer_hash_join (bool, default = true)

Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.

- true: favors HashJoin for faster execution when sufficient memory is available.
- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.

### enable_piecewise_merge_join (bool, default = false)

Enables the experimental Piecewise Merge Join algorithm.

- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
filter in the join condition.
- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
equal in size.

EOF


Expand Down
60 changes: 60 additions & 0 deletions docs/source/user-guide/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,3 +253,63 @@ SET datafusion.execution.batch_size = 1024;
```

[`fairspillpool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html

## Join Queries

Currently Apache Datafusion supports the following join algorithms:

- Nested Loop Join
- Sort Merge Join
- Hash Join
- Symmetric Hash Join
- Piecewise Merge Join (experimental)

The physical planner will choose the appropriate algorithm based on the statistics + join
condition of the two tables.

# Join Algorithm Optimizer Configurations

You can modify join optimization behavior in your queries by setting specific configuration values.
Use the following command to update a configuration:

```sql
SET datafusion.optimizer.<configuration_name>;
```

Example

```sql
SET datafusion.optimizer.prefer_hash_join = false;
```

Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query:

## Join Optimizer Configurations

Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query.

### allow_symmetric_joins_without_pruning (bool, default = true)

Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
lack ordering or filtering.

- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.

### prefer_hash_join (bool, default = true)

Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.

- true: favors HashJoin for faster execution when sufficient memory is available.
- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.

### enable_piecewise_merge_join (bool, default = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw you may also want include it in tpch cli utility, so people can test TPC queries with this kind of join


Enables the experimental Piecewise Merge Join algorithm.

- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
filter in the join condition.
- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
equal in size.