Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results: Bloom filters on UInt8, Int8, UInt16 and Int16 columns always return false negatives #9779

Open
progval opened this issue Mar 24, 2024 · 6 comments
Labels
bug Something isn't working waiting-on-upstream PR is waiting on an upstream dependency to be updated

Comments

@progval
Copy link
Contributor

progval commented Mar 24, 2024

Describe the bug

Bloom filters on these columns always filter out every value.

To Reproduce

#9778 demonstrates this, through correct_bloom_filters: false as macro "parameter".

Expected behavior

No response

Additional context

No response

@progval
Copy link
Contributor Author

progval commented Mar 24, 2024

I just reproduced the bug in the parquet crate, so this isn't an issue in Datafusion: apache/arrow-rs#5550

@alamb alamb changed the title Bloom filters on Int8 and Int16 columns always return false negatives Bloom filters on UInt8, Int8, UInt16 and Int16 columns always return false negatives Apr 1, 2024
@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

It turns out that #9770 demonstrates that the unsigned variants are incorrect as well so I updated the title of this ticket

@alamb alamb added the waiting-on-upstream PR is waiting on an upstream dependency to be updated label Apr 1, 2024
@progval
Copy link
Contributor Author

progval commented Apr 1, 2024

Not exactly: as long as #9770 is not merged, bloom filters are not used on UInt8 and UInt16.

Now that I say it, I realize that I should probably amend that PR (and the existing code) to disable bloom filters entirely on these types; so Datafusion is slow instead of incorrect.

@alamb alamb changed the title Bloom filters on UInt8, Int8, UInt16 and Int16 columns always return false negatives Incorrect results: Bloom filters on UInt8, Int8, UInt16 and Int16 columns always return false negatives Apr 2, 2024
@alamb
Copy link
Contributor

alamb commented Apr 2, 2024

It seems like #9770 was just merged .

I filed #9914 to fix the problem by disabling this feature

@alamb
Copy link
Contributor

alamb commented Apr 7, 2024

@edmondop disabled this code #9969 🙏

@alamb
Copy link
Contributor

alamb commented Apr 18, 2024

This issue came up in the context of 37.1.0 release: #9904 and I wanted to cross post here

Specifically, versions 34.0.0 through 37.0.0 have a bug where int8/int16 bloom filters can incorrectly filter out incorrect answers.

The int8/int16 bloom filter support was added in #7821 / shipped as part of https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/33.0.0.md

We have disabled using bloom filters for int8/int16 columns as of datafusion 38.0.0 (until we fix the underlying issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting-on-upstream PR is waiting on an upstream dependency to be updated
Projects
None yet
Development

No branches or pull requests

2 participants