Skip to content

Conversation

@wForget
Copy link
Member

@wForget wForget commented Jun 3, 2025

What changes were proposed in this pull request?

Exclude VeloxBloomFilterMightContain from FileSourceScan parition filters

Fixes: #9849

How was this patch tested?

added unit test

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Jun 3, 2025
@github-actions
Copy link

github-actions bot commented Jun 3, 2025

#9849

@github-actions
Copy link

github-actions bot commented Jun 3, 2025

Run Gluten ClickHouse CI on ARM

@github-actions
Copy link

github-actions bot commented Jun 3, 2025

Run Gluten ClickHouse CI on ARM

@wForget wForget marked this pull request as ready for review June 3, 2025 10:07
@FelixYBW FelixYBW requested a review from rui-mo June 3, 2025 23:03
@wForget wForget marked this pull request as draft June 5, 2025 04:57
@github-actions
Copy link

github-actions bot commented Jun 5, 2025

Run Gluten ClickHouse CI on ARM

@github-actions
Copy link

github-actions bot commented Jun 5, 2025

Run Gluten ClickHouse CI on ARM

scanExec.output,
scanExec.requiredSchema,
scanExec.partitionFilters,
partitionFilters,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to consider this issue also for the BatchScanTransformer? Thanks.

Copy link
Member Author

@wForget wForget Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to consider this issue also for the BatchScanTransformer? Thanks.

I will try to add a unit test to cover this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to consider this issue also for the BatchScanTransformer? Thanks.

There are no partition filters in BatchScanExecTransformer, and runtime filters will also be converted to source v2 predicates, so there is no similar issue for ``BatchScanExecTransformer`.

@github-actions
Copy link

github-actions bot commented Jun 5, 2025

Run Gluten ClickHouse CI on ARM

@github-actions
Copy link

github-actions bot commented Jun 5, 2025

Run Gluten ClickHouse CI on ARM

@wForget
Copy link
Member Author

wForget commented Jun 5, 2025

Duplicate of #6650 , cc @WangGuangxin @zhztheplayer Could you please take a look?

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

Run Gluten ClickHouse CI on ARM

@wForget wForget marked this pull request as ready for review June 6, 2025 05:33
@zhouyuan
Copy link
Contributor

Cc @WangGuangxin as he made a fix on similar issue before #6652

@rui-mo rui-mo requested a review from jinchengchenghh June 25, 2025 14:19
Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me a couple of minutes to check this. Thanks!

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a simple solution finalized so feel free to proceed.

@wForget If we exclude the might_contain from partition filters, we may lose the advantage brought by Spark bloom filters. No?

@wForget
Copy link
Member Author

wForget commented Jun 26, 2025

@wForget If we exclude the might_contain from partition filters, we may lose the advantage brought by Spark bloom filters. No?

Yes, this will indeed cause performance regression, but the bloomFilterData that has been constructed by VeloxBloomFilter cannot be used in the driver unless we use jni to call velox in the driver.

@jinchengchenghh
Copy link
Contributor

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

@zhztheplayer
Copy link
Member

zhztheplayer commented Jun 26, 2025

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

It's a good idea while the challenge is we need to maintain the consistency among Spark versions and Velox. I.e., once Spark updates its bloom filter algorithm, we have to add a copy of that algorithm to Velox. Maybe we can first leave an abstraction layer in the C++ implementation for different Spark versions.

@github-actions
Copy link

Run Gluten ClickHouse CI on ARM

* Spark so produces different intermediate aggregate data. Thus we use different filter function /
* agg function types for Velox's version to distinguish from vanilla Spark's implementation.
*
* FIXME: Remove GlutenTaskOnlyExpression after the VeloxBloomFilter expr is made compatible with
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer @jinchengchenghh I added a comment to record this improvement, how about we merge this PR first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem from my end. cc @jinchengchenghh

And can we log an issue highlighting the performance gap because of the unavailability of scan + might_contain?

Also I guess vanilla Spark's scan + Velox might_contain will cause the issue as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess vanilla Spark's scan + Velox might_contain will cause the issue as well?

In what scenarios does a vanilla spark scan with native expressions occur? Do we have an existing test case?

Copy link
Member

@zhztheplayer zhztheplayer Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant when velox_might_contain is included in a vanilla Spark scan node, which for whatever reason was fallen back by Gluten. I was just guessing whether the same issue will happen in the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant when velox_might_contain is included in a vanilla Spark scan node, which for whatever reason was fallen back by Gluten. I was just guessing whether the same issue will happen in the case.

Oh, I see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we log an issue highlighting the performance gap because of the unavailability of scan + might_contain?

Filed an issue #10071

@wForget wForget requested a review from zhztheplayer June 26, 2025 08:58
@wForget
Copy link
Member Author

wForget commented Jun 26, 2025

By the way, the current scenario may be uncommon. PartitionPruning rule is applied before InjectRuntimeFilter, and InjectRuntimeFilter will check whether a DPP filter exists.

@wForget
Copy link
Member Author

wForget commented Jun 27, 2025

@zhztheplayer @jinchengchenghh @WangGuangxin @zhouyuan Thank you for your review, I will merge this pr later if there is no further discussion.

@wForget wForget merged commit 903858d into apache:main Jun 30, 2025
50 checks passed
@wForget
Copy link
Member Author

wForget commented Jun 30, 2025

Thanks, merged to main

@zhztheplayer
Copy link
Member

I would suggest to implement spark native bloom filter in velox, then we can remove this rule.

@jinchengchenghh The Spark community just accepted a V2 bloom filter implementation apache/spark#50933.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] VeloxBloomFilterMightContain should not be applied to FileSourceScan partition filters.

6 participants