[GLUTEN-9849][VL] Avoid VeloxBloomFilterMightContain being applied to FileSourceScan partition filters #9850

wForget · 2025-06-03T08:49:06Z

What changes were proposed in this pull request?

Exclude VeloxBloomFilterMightContain from FileSourceScan parition filters

Fixes: #9849

How was this patch tested?

added unit test

github-actions · 2025-06-03T08:49:21Z

#9849

github-actions · 2025-06-03T08:49:35Z

Run Gluten ClickHouse CI on ARM

… FileSourceScan partition filters

github-actions · 2025-06-03T08:57:35Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-06-05T04:57:24Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-06-05T07:04:35Z

Run Gluten ClickHouse CI on ARM

rui-mo · 2025-06-05T08:23:44Z

gluten-substrait/src/main/scala/org/apache/gluten/execution/ScanTransformerFactory.scala

      scanExec.output,
      scanExec.requiredSchema,
-      scanExec.partitionFilters,
+      partitionFilters,


Do you need to consider this issue also for the BatchScanTransformer? Thanks.

Do you need to consider this issue also for the BatchScanTransformer? Thanks.

~~I will try to add a unit test to cover this case~~

Do you need to consider this issue also for the BatchScanTransformer? Thanks.

There are no partition filters in BatchScanExecTransformer, and runtime filters will also be converted to source v2 predicates, so there is no similar issue for ``BatchScanExecTransformer`.

github-actions · 2025-06-05T10:02:28Z

Run Gluten ClickHouse CI on ARM

github-actions · 2025-06-05T12:02:33Z

Run Gluten ClickHouse CI on ARM

wForget · 2025-06-05T12:55:20Z

Duplicate of #6650 , cc @WangGuangxin @zhztheplayer Could you please take a look?

gluten-ut/spark35/src/test/scala/org/apache/spark/sql/GlutenInjectRuntimeFilterSuite.scala

github-actions · 2025-06-06T02:14:30Z

Run Gluten ClickHouse CI on ARM

zhouyuan · 2025-06-23T14:34:01Z

Cc @WangGuangxin as he made a fix on similar issue before #6652

zhztheplayer

Give me a couple of minutes to check this. Thanks!

zhztheplayer

I haven't had a simple solution finalized so feel free to proceed.

@wForget If we exclude the might_contain from partition filters, we may lose the advantage brought by Spark bloom filters. No?

wForget · 2025-06-26T07:15:46Z

@wForget If we exclude the might_contain from partition filters, we may lose the advantage brought by Spark bloom filters. No?

Yes, this will indeed cause performance regression, but the bloomFilterData that has been constructed by VeloxBloomFilter cannot be used in the driver unless we use jni to call velox in the driver.

jinchengchenghh · 2025-06-26T07:19:34Z

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

zhztheplayer · 2025-06-26T07:53:11Z

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

It's a good idea while the challenge is we need to maintain the consistency among Spark versions and Velox. I.e., once Spark updates its bloom filter algorithm, we have to add a copy of that algorithm to Velox. Maybe we can first leave an abstraction layer in the C++ implementation for different Spark versions.

github-actions · 2025-06-26T08:55:19Z

Run Gluten ClickHouse CI on ARM

wForget · 2025-06-26T08:56:08Z

backends-velox/src/main/scala/org/apache/gluten/expression/VeloxBloomFilterMightContain.scala

 * Spark so produces different intermediate aggregate data. Thus we use different filter function /
 * agg function types for Velox's version to distinguish from vanilla Spark's implementation.
+ *
+ * FIXME: Remove GlutenTaskOnlyExpression after the VeloxBloomFilter expr is made compatible with


@zhztheplayer @jinchengchenghh I added a comment to record this improvement, how about we merge this PR first?

No problem from my end. cc @jinchengchenghh

And can we log an issue highlighting the performance gap because of the unavailability of scan + might_contain?

Also I guess vanilla Spark's scan + Velox might_contain will cause the issue as well?

Also I guess vanilla Spark's scan + Velox might_contain will cause the issue as well?

In what scenarios does a vanilla spark scan with native expressions occur? Do we have an existing test case?

I meant when velox_might_contain is included in a vanilla Spark scan node, which for whatever reason was fallen back by Gluten. I was just guessing whether the same issue will happen in the case.

I meant when velox_might_contain is included in a vanilla Spark scan node, which for whatever reason was fallen back by Gluten. I was just guessing whether the same issue will happen in the case.

Oh, I see.

And can we log an issue highlighting the performance gap because of the unavailability of scan + might_contain?

Filed an issue #10071

wForget · 2025-06-26T10:17:38Z

By the way, the current scenario may be uncommon. PartitionPruning rule is applied before InjectRuntimeFilter, and InjectRuntimeFilter will check whether a DPP filter exists.

wForget · 2025-06-27T08:20:45Z

@zhztheplayer @jinchengchenghh @WangGuangxin @zhouyuan Thank you for your review, I will merge this pr later if there is no further discussion.

wForget · 2025-06-30T02:05:55Z

Thanks, merged to main

… disabled in #9850 (#10240)

zhztheplayer · 2025-07-24T02:20:15Z

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

@jinchengchenghh The Spark community just accepted a V2 bloom filter implementation apache/spark#50933.

github-actions bot added CORE works for Gluten Core VELOX labels Jun 3, 2025

[GLUTEN-9849][VL] Avoid VeloxBloomFilterMightContain being applied to…

8816d1a

… FileSourceScan partition filters

wForget force-pushed the GLUTEN-9849 branch from 0760d7f to 8816d1a Compare June 3, 2025 08:57

wForget marked this pull request as ready for review June 3, 2025 10:07

FelixYBW requested a review from rui-mo June 3, 2025 23:03

wForget marked this pull request as draft June 5, 2025 04:57

wForget force-pushed the GLUTEN-9849 branch from ca02e64 to 9812b94 Compare June 5, 2025 07:04

rui-mo reviewed Jun 5, 2025

View reviewed changes

test

a381223

wForget force-pushed the GLUTEN-9849 branch from 9812b94 to a381223 Compare June 5, 2025 10:01

test

3dd7f48

wForget commented Jun 5, 2025

View reviewed changes

gluten-ut/spark35/src/test/scala/org/apache/spark/sql/GlutenInjectRuntimeFilterSuite.scala Outdated Show resolved Hide resolved

fix

a1c7296

wForget marked this pull request as ready for review June 6, 2025 05:33

rui-mo requested a review from jinchengchenghh June 25, 2025 14:19

rui-mo approved these changes Jun 25, 2025

View reviewed changes

WangGuangxin approved these changes Jun 26, 2025

View reviewed changes

jinchengchenghh requested a review from zhztheplayer June 26, 2025 02:06

zhztheplayer requested changes Jun 26, 2025

View reviewed changes

zhztheplayer approved these changes Jun 26, 2025

View reviewed changes

add comment

a2ace5e

wForget commented Jun 26, 2025

View reviewed changes

wForget requested a review from zhztheplayer June 26, 2025 08:58

wForget mentioned this pull request Jun 27, 2025

[VL] Scan with might_contain on bloom_filter_agg partition filter may have performance regression #10071

Closed

wForget merged commit 903858d into apache:main Jun 30, 2025
50 checks passed

wForget mentioned this pull request Jun 30, 2025

[CORE] Remove bloom filter from partition filter since the BloomFilter result format is different with spark #6650

Closed

zhztheplayer mentioned this pull request Jul 22, 2025

[GLUTEN-9849][VL] Reenable native might_contain evaluation that was disabled in #9850 #10240

Merged

zhztheplayer added a commit that referenced this pull request Jul 24, 2025

[GLUTEN-9849][VL] Reenable native might_contain evaluation that was…

3b68715

… disabled in #9850 (#10240)

[GLUTEN-9849][VL] Avoid VeloxBloomFilterMightContain being applied to FileSourceScan partition filters #9850

[GLUTEN-9849][VL] Avoid VeloxBloomFilterMightContain being applied to FileSourceScan partition filters #9850

Uh oh!

Conversation

wForget commented Jun 3, 2025

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

rui-mo Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

wForget Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

wForget commented Jun 5, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

zhouyuan commented Jun 23, 2025

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

zhztheplayer left a comment • edited by wForget Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget commented Jun 26, 2025

Uh oh!

jinchengchenghh commented Jun 26, 2025

Uh oh!

zhztheplayer commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

wForget Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

wForget Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

wForget Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

wForget commented Jun 26, 2025

Uh oh!

wForget commented Jun 27, 2025

Uh oh!

Uh oh!

wForget commented Jun 30, 2025

wForget Jun 5, 2025 •

edited

Loading

zhztheplayer left a comment •

edited by wForget

Loading

zhztheplayer commented Jun 26, 2025 •

edited

Loading

zhztheplayer Jun 26, 2025 •

edited

Loading