[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

Dream-hu · 2023-01-16T03:06:18Z

Describe the usage question you have. Please include as many useful details as possible.

Parquet cpp has implemented BloomFilter however Arrow FileReader or any one else never call it during reading.
I am confused and want to figure out:
(1) Does Arrow::FileReader has plan to support filter push down, and when?
(2) When and how BloomFilter will be use by Arrow::FileReader?

Looking forward to reply，thx a lot!

Component(s)

Parquet

mapleFU · 2023-01-17T19:28:50Z

Currently, arrow parquet for C++ not support reading / writing BF.

Impala and parquet-mr supports it, maybe you can take a look there.

alippai · 2023-05-10T21:08:53Z

I'm a little bit lost here as well. I see that the Parquet 12.0.0 release notes contain your new BF contributions, but that doesn't mean if checking for value equality, the pyarrow.parquet.read_table() would actually use it. Also pyarrow doesn't write it either, right?
Are there any issues / draft PRs open for tracking the BF support? Searching GitHub issues didn't come up with good results.

I'm not a fan of creating many issues ahead, but this is a feature with specification (even if it's optional) so it's not likely the requirements or the ecosystem would change a lot in the future

westonpace · 2023-05-11T13:22:58Z

Support for reading bloom filters from parquet files into memory was added in 12.0.0. There is an open issue for using this feature to do pushdown filtering here: #27277

The datasets feature was already doing some pushdown using the parquet file statistics. That issue asks to also use the bloom filter for pushdown filtering for datasets.

The parquet reader itself hasn't done pushdown in the past, but I'd be generally in favor of moving the pushdown filtering out of the datasets layer and into the file reader layer itself if someone was motivated to do the work. That would be more complex than just adding bloom filter filtering support to the datasets layer though because you'd have to figure out how to formulate filter expressions (you could add a dependency on arrow expressions but I'm not sure if that makes sense in the parquet layer).

mapleFU · 2023-05-11T13:25:26Z

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

alippai · 2023-05-11T13:35:29Z

Regarding the expressions: Maybe it’s an overkill but would using the filter subset of substrait work?

mapleFU · 2023-05-11T13:56:58Z

All depends on data distribution and user's query. Maybe it could make query faster. The worst case may make query slower

westonpace · 2023-05-12T15:19:49Z

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

That would be great, thanks.

westonpace · 2023-05-12T15:24:12Z

Maybe it’s an overkill but would using the filter subset of substrait work?

That probably is overkill though it would work if someone had a desire. I believe bloom filters are only useful for equality / inequality. The statistics support comparison. So you probably just need =,!=,<,>,<=,>=. The simplest thing to do might be to do what we used to do for the old python datasets and accept disjunctive normal form:

Predicates are expressed using an Expression or using the disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

mapleFU · 2023-06-01T13:28:28Z

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

/// \brief Per-scan options for Parquet fragments
class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
 public:
  ParquetFragmentScanOptions();
  std::string type_name() const override { return kParquetTypeName; }

  /// Reader properties. Not all properties are respected: memory_pool comes from
  /// ScanOptions.
  std::shared_ptr<parquet::ReaderProperties> reader_properties;
  /// Arrow reader properties. Not all properties are respected: batch_size comes from
  /// ScanOptions. Additionally, dictionary columns come from
  /// ParquetFileFormat::ReaderOptions::dict_columns.
  std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
};

@westonpace

westonpace · 2023-06-09T17:41:01Z

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

Yes, that would be a good place for it. We would want a comment that provides users with enough information to help make the correct choice. For example "This feature allows parquet bloom filters to be used to reduce the amount of data that needs to be read from the disk. However, applying these filters can be expensive and, if the filter is not very selective, may cost more CPU time than they save." (I don't know if that is the actual reason, feel free to modify as appropriate based on your testing)

mapleFU · 2023-06-23T08:17:19Z

I've two problem about this:

First, I found it a bit hard to implement it using current framework. Our input is an expression tree, bloom filter can transform:

("eq", field_ref, literal)
("is_in", field_ref, literals...)

To an "false" in row-group expressions. So I need to:

match these expressions
If I found we can turn them to false, transform it.

Can you help me what expression handling method can I use?

Second, ParquetFileFragment::Subset should ensure metadata already in, so maybe I should add an options, and load all bloomfilter in memory. I guess I would be time-consuming. Should I load them all at first, or loading them by requirement?

@westonpace .

mapleFU · 2023-06-23T09:05:09Z

I've fixed (1) with ModifyExpression, but (2) remains a problem

westonpace · 2023-06-29T13:26:31Z

Sorry for the slow response. For (2) I think you will want to do something similar to SimplifyWithGuarantee:

arrow/cpp/src/arrow/compute/expression.cc

Line 1332 in 9736dde

Result<Expression> SimplifyWithGuarantee(Expression expr,

I don't think you can use SimplifyWithGuarantee directly because the bloom filter is a weird sort of guarantee. So I think you will need a new method. Specifically for (2) I think you want to:

Replace the is_in/eq in the AST with literal(false).
Run Canonicalize
Run FoldConstants

mapleFU · 2023-06-29T13:29:22Z

Got it, I'll have a try on it

arthurpassos · 2024-04-09T18:15:11Z

I see this is still open and I have the same original question: can bloom filters be utilized through Arrow Read API?

@mapleFU

mapleFU · 2024-04-09T18:19:39Z

Generally not implemented yet. I'll repick that after https://github.com/apache/arrow/pull/37400/files is merged

Dream-hu added the Type: usage Issue is a user question label Jan 16, 2023

github-actions bot added the Component: Parquet label Jan 16, 2023

kou changed the title ~~Would Arrow::FileReader support filter evaluating and optimize by BloomFilter~~ [C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter Jan 18, 2023

mapleFU mentioned this issue Jun 23, 2023

[C++] Support Parquet BloomFilter in dataset #36268

Open

mr-brobot mentioned this issue Mar 30, 2024

Bloom filters for i8 and i16 always return false negatives apache/arrow-rs#5550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

Dream-hu commented Jan 16, 2023

mapleFU commented Jan 17, 2023 •

edited

Loading

alippai commented May 10, 2023 •

edited

Loading

westonpace commented May 11, 2023

mapleFU commented May 11, 2023 •

edited

Loading

alippai commented May 11, 2023

mapleFU commented May 11, 2023

westonpace commented May 12, 2023

westonpace commented May 12, 2023

mapleFU commented Jun 1, 2023

westonpace commented Jun 9, 2023

mapleFU commented Jun 23, 2023 •

edited

Loading

mapleFU commented Jun 23, 2023 •

edited

Loading

westonpace commented Jun 29, 2023

mapleFU commented Jun 29, 2023

arthurpassos commented Apr 9, 2024

mapleFU commented Apr 9, 2024

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

Comments

Dream-hu commented Jan 16, 2023

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

mapleFU commented Jan 17, 2023 • edited Loading

alippai commented May 10, 2023 • edited Loading

westonpace commented May 11, 2023

mapleFU commented May 11, 2023 • edited Loading

alippai commented May 11, 2023

mapleFU commented May 11, 2023

westonpace commented May 12, 2023

westonpace commented May 12, 2023

mapleFU commented Jun 1, 2023

westonpace commented Jun 9, 2023

mapleFU commented Jun 23, 2023 • edited Loading

mapleFU commented Jun 23, 2023 • edited Loading

westonpace commented Jun 29, 2023

mapleFU commented Jun 29, 2023

arthurpassos commented Apr 9, 2024

mapleFU commented Apr 9, 2024

mapleFU commented Jan 17, 2023 •

edited

Loading

alippai commented May 10, 2023 •

edited

Loading

mapleFU commented May 11, 2023 •

edited

Loading

mapleFU commented Jun 23, 2023 •

edited

Loading

mapleFU commented Jun 23, 2023 •

edited

Loading