-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683
Comments
Currently, arrow parquet for C++ not support reading / writing BF. Impala and parquet-mr supports it, maybe you can take a look there. |
I'm a little bit lost here as well. I see that the Parquet 12.0.0 release notes contain your new BF contributions, but that doesn't mean if checking for value equality, the I'm not a fan of creating many issues ahead, but this is a feature with specification (even if it's optional) so it's not likely the requirements or the ecosystem would change a lot in the future |
Support for reading bloom filters from parquet files into memory was added in 12.0.0. There is an open issue for using this feature to do pushdown filtering here: #27277 The datasets feature was already doing some pushdown using the parquet file statistics. That issue asks to also use the bloom filter for pushdown filtering for datasets. The parquet reader itself hasn't done pushdown in the past, but I'd be generally in favor of moving the pushdown filtering out of the datasets layer and into the file reader layer itself if someone was motivated to do the work. That would be more complex than just adding bloom filter filtering support to the datasets layer though because you'd have to figure out how to formulate filter expressions (you could add a dependency on arrow expressions but I'm not sure if that makes sense in the parquet layer). |
I'll finish the dataset scanner part it this month if no other people interested in it @westonpace |
Regarding the expressions: Maybe it’s an overkill but would using the filter subset of substrait work? |
All depends on data distribution and user's query. Maybe it could make query faster. The worst case may make query slower |
That would be great, thanks. |
That probably is overkill though it would work if someone had a desire. I believe bloom filters are only useful for equality / inequality. The statistics support comparison. So you probably just need =,!=,<,>,<=,>=. The simplest thing to do might be to do what we used to do for the old python datasets and accept disjunctive normal form:
|
Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an /// \brief Per-scan options for Parquet fragments
class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
public:
ParquetFragmentScanOptions();
std::string type_name() const override { return kParquetTypeName; }
/// Reader properties. Not all properties are respected: memory_pool comes from
/// ScanOptions.
std::shared_ptr<parquet::ReaderProperties> reader_properties;
/// Arrow reader properties. Not all properties are respected: batch_size comes from
/// ScanOptions. Additionally, dictionary columns come from
/// ParquetFileFormat::ReaderOptions::dict_columns.
std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
}; |
Yes, that would be a good place for it. We would want a comment that provides users with enough information to help make the correct choice. For example "This feature allows parquet bloom filters to be used to reduce the amount of data that needs to be read from the disk. However, applying these filters can be expensive and, if the filter is not very selective, may cost more CPU time than they save." (I don't know if that is the actual reason, feel free to modify as appropriate based on your testing) |
I've two problem about this: First, I found it a bit hard to implement it using current framework. Our input is an expression tree, bloom filter can transform:
To an "false" in row-group expressions. So I need to:
Can you help me what expression handling method can I use? Second, |
I've fixed (1) with |
Sorry for the slow response. For (2) I think you will want to do something similar to arrow/cpp/src/arrow/compute/expression.cc Line 1332 in 9736dde
I don't think you can use
|
Got it, I'll have a try on it |
I see this is still open and I have the same original question: can bloom filters be utilized through Arrow Read API? |
Generally not implemented yet. I'll repick that after https://github.com/apache/arrow/pull/37400/files is merged |
Describe the usage question you have. Please include as many useful details as possible.
Parquet cpp has implemented BloomFilter however Arrow FileReader or any one else never call it during reading.
I am confused and want to figure out:
(1) Does Arrow::FileReader has plan to support filter push down, and when?
(2) When and how BloomFilter will be use by Arrow::FileReader?
Looking forward to reply,thx a lot!
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: