Skip to content

Conversation

rkrishn7
Copy link
Contributor

@rkrishn7 rkrishn7 commented Sep 6, 2025

Which issue does this PR close?

What changes are included in this PR?

Adds a simple synchronization mechanism (BoundsWaiter) for tasks to wait on complete bounds for left side and for the dynamic filter expr to be built.

Are these changes tested?

Added regression check for number of output rows from probe side scan (see #17452 (comment))

Are there any user-facing changes?

N/A

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Sep 6, 2025
_file_meta: FileMeta,
_file: PartitionedFile,
) -> Result<FileOpenFuture> {
if let Some(predicate) = &self.predicate {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take this out, just leaving here for now for visualization.

I wonder what are good ways to regression test this 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best I can think of would be to make a custom ExecutionPlan that delays partition 0 for 1s or something and check that we still read the expected number of rows on the probe side

Copy link
Contributor Author

@rkrishn7 rkrishn7 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @adriangb, I modified the existing partitioned test to inspect metrics from the probe side scan and verify rows are correctly being filtered out there. I think we can get away without the delay since I applied the patch to main and the test became flaky, as expected. Does that sound good to you or did you have something else in mind?

@adriangb
Copy link
Contributor

adriangb commented Sep 7, 2025

@rkrishn7 I've looked over the implementation - super clean, very nice stuff! I think we just need some regression tests. Can you take a look at my suggestion in #17452 (comment) and see if that works?

- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, x], file_type=test, pushdown_supported=true
- HashJoinExec: mode=Partitioned, join_type=Inner, on=[(c@1, d@0)]
- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[b, c, y], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ b@0 >= aa AND b@0 <= ab ]
- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[d, z], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ d@0 >= ca AND d@0 <= ce ]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb This update seems correct to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the filter is more selective (generally a good thing) but I'm curious why it changed, it's not immediately coming to me why waiting would change the value of the filter. Could you help me understand why that is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I believe this is simply due to the fact that the filter is now applied in TestOpener. The bounds on the left side have changed due to rows being filtered out so we're seeing the correct filter now.

@adriangb
Copy link
Contributor

adriangb commented Sep 8, 2025

I'll give this a final review tomorrow (Monday)!

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkrishn7 this PR looks excellent to me

My only concern is about performance in edge cases. I do think for a lot of queries this will be an improvement, but I fear that in some cases (e.g. if the filter was not selective and the distribution amongst partitions is very uneven) the synchronization might hurt performance.

I propose that @alamb kick off a set of benchmarks and if we see no regressions we move forward with this. It's an internal revertible change so if down the road we find that the cons outweigh the pros (I doubt it) we can always roll it back.

let mut new_batches = Vec::new();
for batch in batches {
let batch = if let Some(predicate) = &self.predicate {
batch_filter(&batch, predicate)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice I didn't know this little function existed!

- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, x], file_type=test, pushdown_supported=true
- HashJoinExec: mode=Partitioned, join_type=Inner, on=[(c@1, d@0)]
- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[b, c, y], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ b@0 >= aa AND b@0 <= ab ]
- DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[d, z], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ d@0 >= ca AND d@0 <= ce ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the filter is more selective (generally a good thing) but I'm curious why it changed, it's not immediately coming to me why waiting would change the value of the filter. Could you help me understand why that is?

@alamb
Copy link
Contributor

alamb commented Sep 9, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix/wait-until-partition-bounds-reported (c24ef79) to b4a8b5a diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 9, 2025

I propose that @alamb kick off a set of benchmarks and if we see no regressions we move forward with this. It's an internal revertible change so if down the road we find that the cons outweigh the pros (I doubt it) we can always roll it back.

done!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also spent some time reviewing this code and it makes sense to me. I would be great to have some tests but I don't have a great suggestion for that

Thank you @rkrishn7 and @adriangb

}
}

/// Utility future to wait until all partitions have reported completion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to a Barrier -- https://docs.rs/tokio/latest/tokio/sync/struct.Barrier.html

Maybe as a follow on PR we could simplify the code a bit by potentially using that pre-defied structure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - @rkrishn7 could you refactor this to use Barrier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I was reaching for Barrier at first but I didn't see many tokio sync primitives being used in the crate so I decided against it. Agreed that using Barrier is preferable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you rather push to this PR or do it as a followup? Either is good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle in this PR! Should get it done shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @adriangb Done ✅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking pretty sweet now that we have the barrier in there :bowtie:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is some very nice code @rkrishn7 !

@alamb
Copy link
Contributor

alamb commented Sep 9, 2025

🤖: Benchmark completed

Details

Comparing HEAD and fix_wait-until-partition-bounds-reported
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fix_wait-until-partition-bounds-reported ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2729.13 ms │                               2769.08 ms │ no change │
│ QQuery 1     │  1211.24 ms │                               1254.20 ms │ no change │
│ QQuery 2     │  2387.44 ms │                               2377.36 ms │ no change │
│ QQuery 3     │  1203.78 ms │                               1177.84 ms │ no change │
│ QQuery 4     │  2262.85 ms │                               2265.06 ms │ no change │
│ QQuery 5     │ 27547.53 ms │                              27678.39 ms │ no change │
│ QQuery 6     │  4295.91 ms │                               4278.10 ms │ no change │
│ QQuery 7     │  3719.50 ms │                               3592.64 ms │ no change │
└──────────────┴─────────────┴──────────────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                       │ 45357.36ms │
│ Total Time (fix_wait-until-partition-bounds-reported)   │ 45392.68ms │
│ Average Time (HEAD)                                     │  5669.67ms │
│ Average Time (fix_wait-until-partition-bounds-reported) │  5674.08ms │
│ Queries Faster                                          │          0 │
│ Queries Slower                                          │          0 │
│ Queries with No Change                                  │          8 │
│ Queries with Failure                                    │          0 │
└─────────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ fix_wait-until-partition-bounds-reported ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.04 ms │                                  2.18 ms │  1.07x slower │
│ QQuery 1     │    48.30 ms │                                 49.23 ms │     no change │
│ QQuery 2     │   135.20 ms │                                136.85 ms │     no change │
│ QQuery 3     │   159.94 ms │                                171.46 ms │  1.07x slower │
│ QQuery 4     │  1003.67 ms │                               1182.09 ms │  1.18x slower │
│ QQuery 5     │  1456.25 ms │                               1594.36 ms │  1.09x slower │
│ QQuery 6     │     2.15 ms │                                  2.27 ms │  1.05x slower │
│ QQuery 7     │    54.72 ms │                                 54.40 ms │     no change │
│ QQuery 8     │  1394.69 ms │                               1544.75 ms │  1.11x slower │
│ QQuery 9     │  1797.64 ms │                               1984.12 ms │  1.10x slower │
│ QQuery 10    │   387.33 ms │                                389.05 ms │     no change │
│ QQuery 11    │   429.42 ms │                                443.07 ms │     no change │
│ QQuery 12    │  1326.44 ms │                               1485.86 ms │  1.12x slower │
│ QQuery 13    │  2099.97 ms │                               2276.44 ms │  1.08x slower │
│ QQuery 14    │  1234.81 ms │                               1331.83 ms │  1.08x slower │
│ QQuery 15    │  1142.33 ms │                               1294.07 ms │  1.13x slower │
│ QQuery 16    │  2543.10 ms │                               2738.82 ms │  1.08x slower │
│ QQuery 17    │  2515.98 ms │                               2739.37 ms │  1.09x slower │
│ QQuery 18    │  5068.98 ms │                               5090.39 ms │     no change │
│ QQuery 19    │   127.18 ms │                                129.36 ms │     no change │
│ QQuery 20    │  2060.73 ms │                               2078.11 ms │     no change │
│ QQuery 21    │  2354.13 ms │                               2408.36 ms │     no change │
│ QQuery 22    │  4005.57 ms │                               4050.08 ms │     no change │
│ QQuery 23    │ 12749.51 ms │                              12881.11 ms │     no change │
│ QQuery 24    │   228.03 ms │                                213.41 ms │ +1.07x faster │
│ QQuery 25    │   518.13 ms │                                510.26 ms │     no change │
│ QQuery 26    │   226.24 ms │                                223.45 ms │     no change │
│ QQuery 27    │  2882.18 ms │                               2884.58 ms │     no change │
│ QQuery 28    │ 23381.94 ms │                              24776.31 ms │  1.06x slower │
│ QQuery 29    │   993.29 ms │                                986.32 ms │     no change │
│ QQuery 30    │  1324.40 ms │                               1381.89 ms │     no change │
│ QQuery 31    │  1311.69 ms │                               1350.99 ms │     no change │
│ QQuery 32    │  4480.63 ms │                               5012.25 ms │  1.12x slower │
│ QQuery 33    │  5793.94 ms │                               5914.53 ms │     no change │
│ QQuery 34    │  5953.76 ms │                               5925.12 ms │     no change │
│ QQuery 35    │  2017.82 ms │                               2075.76 ms │     no change │
│ QQuery 36    │   125.13 ms │                                125.07 ms │     no change │
│ QQuery 37    │    52.98 ms │                                 52.48 ms │     no change │
│ QQuery 38    │   126.88 ms │                                124.09 ms │     no change │
│ QQuery 39    │   197.21 ms │                                201.56 ms │     no change │
│ QQuery 40    │    44.50 ms │                                 43.06 ms │     no change │
│ QQuery 41    │    39.86 ms │                                 38.58 ms │     no change │
│ QQuery 42    │    33.74 ms │                                 33.80 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                       │ 93832.46ms │
│ Total Time (fix_wait-until-partition-bounds-reported)   │ 97931.10ms │
│ Average Time (HEAD)                                     │  2182.15ms │
│ Average Time (fix_wait-until-partition-bounds-reported) │  2277.47ms │
│ Queries Faster                                          │          1 │
│ Queries Slower                                          │         15 │
│ Queries with No Change                                  │         27 │
│ Queries with Failure                                    │          0 │
└─────────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ fix_wait-until-partition-bounds-reported ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 168.24 ms │                                167.63 ms │     no change │
│ QQuery 2     │  25.72 ms │                                 27.34 ms │  1.06x slower │
│ QQuery 3     │  45.78 ms │                                 44.61 ms │     no change │
│ QQuery 4     │  26.57 ms │                                 26.71 ms │     no change │
│ QQuery 5     │  74.68 ms │                                 75.70 ms │     no change │
│ QQuery 6     │  18.99 ms │                                 19.21 ms │     no change │
│ QQuery 7     │ 146.68 ms │                                150.57 ms │     no change │
│ QQuery 8     │  33.53 ms │                                 31.32 ms │ +1.07x faster │
│ QQuery 9     │  85.72 ms │                                 90.16 ms │  1.05x slower │
│ QQuery 10    │  58.94 ms │                                 60.08 ms │     no change │
│ QQuery 11    │  41.26 ms │                                 42.60 ms │     no change │
│ QQuery 12    │  51.06 ms │                                 51.21 ms │     no change │
│ QQuery 13    │  47.20 ms │                                 47.82 ms │     no change │
│ QQuery 14    │  13.88 ms │                                 14.61 ms │  1.05x slower │
│ QQuery 15    │  24.33 ms │                                 24.23 ms │     no change │
│ QQuery 16    │  23.96 ms │                                 23.81 ms │     no change │
│ QQuery 17    │ 148.22 ms │                                149.04 ms │     no change │
│ QQuery 18    │ 324.96 ms │                                319.65 ms │     no change │
│ QQuery 19    │  36.91 ms │                                 38.25 ms │     no change │
│ QQuery 20    │  49.09 ms │                                 50.11 ms │     no change │
│ QQuery 21    │ 220.06 ms │                                223.42 ms │     no change │
│ QQuery 22    │  19.35 ms │                                 19.19 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                       │ 1685.12ms │
│ Total Time (fix_wait-until-partition-bounds-reported)   │ 1697.27ms │
│ Average Time (HEAD)                                     │   76.60ms │
│ Average Time (fix_wait-until-partition-bounds-reported) │   77.15ms │
│ Queries Faster                                          │         1 │
│ Queries Slower                                          │         3 │
│ Queries with No Change                                  │        18 │
│ Queries with Failure                                    │         0 │
└─────────────────────────────────────────────────────────┴───────────┘

@rkrishn7
Copy link
Contributor Author

rkrishn7 commented Sep 9, 2025

@adriangb What are your thoughts on the benchmark run? I see some regressions 🤔

@adriangb
Copy link
Contributor

adriangb commented Sep 9, 2025

@adriangb What are your thoughts on the benchmark run? I see some regressions 🤔

I think it's noise. For example, let's look at:

Benchmark clickbench_partitioned.json
QQuery 12    │  1326.44 ms │                               1485.86 ms │  1.12x slower

SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

Which does not obviously have a join. And I can confirm with the query plan that there are no joins:

> explain SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │      c DESClimit: 10      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          c@1 DESC         │ |
|               | │                           │ |
|               | │         limit: 10         │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │       SearchPhrase:       │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │     c: count(Int64(1))    │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │       aggr: count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          12 -> 12         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │ Hash([SearchPhrase@0], 12)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │       aggr: count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │      SearchPhrase !=      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 111        │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │      SearchPhrase !=      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

So a regression in this query could not possibly be caused by the changes in this PR which are narrowly scoped to HashJoinExec.

@adriangb
Copy link
Contributor

adriangb commented Sep 9, 2025

@rkrishn7 I merged #17444 which may cause some conflicts with this branch just FYI

@rkrishn7
Copy link
Contributor Author

rkrishn7 commented Sep 9, 2025

@adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single AND binary expr once we have the final min/max (i.e. all partition bounds have been reported).

Aside from one less mutex, I think it'll help reduce output in EXPLAIN as well. Happy to tackle in a follow-up PR

@adriangb
Copy link
Contributor

adriangb commented Sep 9, 2025

@adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single AND binary expr once we have the final min/max (i.e. all partition bounds have been reported).

Aside from one less mutex, I think it'll help reduce output in EXPLAIN as well. Happy to tackle in a follow-up PR

I think that will regress performance: imagine partition 1 has bounds (0, 1) and partition 2 has bounds (999998, 999999). With bounds per partition the value 1234 is filtered out. The merged bounds of (0, 999999) would include that value.

@rkrishn7
Copy link
Contributor Author

rkrishn7 commented Sep 9, 2025

@adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single AND binary expr once we have the final min/max (i.e. all partition bounds have been reported).
Aside from one less mutex, I think it'll help reduce output in EXPLAIN as well. Happy to tackle in a follow-up PR

I think that will regress performance: imagine partition 1 has bounds (0, 1) and partition 2 has bounds (999998, 999999). With bounds per partition the value 1234 is filtered out. The merged bounds of (0, 999999) would include that value.

Ah yes 🤦🏾 , definitely. Good catch!

@adriangb
Copy link
Contributor

adriangb commented Sep 9, 2025

@adriangb I think there is opportunity to simplify the bounds collection for each partition. That is, we can probably just track the min/max across all partitions and build a single AND binary expr once we have the final min/max (i.e. all partition bounds have been reported).
Aside from one less mutex, I think it'll help reduce output in EXPLAIN as well. Happy to tackle in a follow-up PR

I think that will regress performance: imagine partition 1 has bounds (0, 1) and partition 2 has bounds (999998, 999999). With bounds per partition the value 1234 is filtered out. The merged bounds of (0, 999999) would include that value.

Ah yes 🤦🏾 , definitely. Good catch!

This is the fundamental limitation of a min/max bounds approach. For some queries / datasets it's going to be very effective, for others not at all. Hence why we are discussing pushing down bloom filters, etc. But keeping the min/max per partition is at least a good compromise for now / seems to be working well.

@adriangb
Copy link
Contributor

adriangb commented Sep 9, 2025

I plan to merge this PR once CI is green 🚀

@adriangb adriangb merged commit da3d90a into apache:main Sep 9, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure dynamic filter expr is built before fetching probe batch in HashJoin
3 participants