Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

Open
Dream-hu opened this issue Jan 16, 2023 · 16 comments
Labels
Component: Parquet Type: usage Issue is a user question

Comments

@Dream-hu
Copy link

Describe the usage question you have. Please include as many useful details as possible.

Parquet cpp has implemented BloomFilter however Arrow FileReader or any one else never call it during reading.
I am confused and want to figure out:
(1) Does Arrow::FileReader has plan to support filter push down, and when?
(2) When and how BloomFilter will be use by Arrow::FileReader?

Looking forward to reply,thx a lot!

Component(s)

Parquet

@Dream-hu Dream-hu added the Type: usage Issue is a user question label Jan 16, 2023
@mapleFU
Copy link
Member

mapleFU commented Jan 17, 2023

Currently, arrow parquet for C++ not support reading / writing BF.

Impala and parquet-mr supports it, maybe you can take a look there.

@kou kou changed the title Would Arrow::FileReader support filter evaluating and optimize by BloomFilter [C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter Jan 18, 2023
@alippai
Copy link
Contributor

alippai commented May 10, 2023

I'm a little bit lost here as well. I see that the Parquet 12.0.0 release notes contain your new BF contributions, but that doesn't mean if checking for value equality, the pyarrow.parquet.read_table() would actually use it. Also pyarrow doesn't write it either, right?
Are there any issues / draft PRs open for tracking the BF support? Searching GitHub issues didn't come up with good results.

I'm not a fan of creating many issues ahead, but this is a feature with specification (even if it's optional) so it's not likely the requirements or the ecosystem would change a lot in the future

@westonpace
Copy link
Member

Support for reading bloom filters from parquet files into memory was added in 12.0.0. There is an open issue for using this feature to do pushdown filtering here: #27277

The datasets feature was already doing some pushdown using the parquet file statistics. That issue asks to also use the bloom filter for pushdown filtering for datasets.

The parquet reader itself hasn't done pushdown in the past, but I'd be generally in favor of moving the pushdown filtering out of the datasets layer and into the file reader layer itself if someone was motivated to do the work. That would be more complex than just adding bloom filter filtering support to the datasets layer though because you'd have to figure out how to formulate filter expressions (you could add a dependency on arrow expressions but I'm not sure if that makes sense in the parquet layer).

@mapleFU
Copy link
Member

mapleFU commented May 11, 2023

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

@alippai
Copy link
Contributor

alippai commented May 11, 2023

Regarding the expressions: Maybe it’s an overkill but would using the filter subset of substrait work?

@mapleFU
Copy link
Member

mapleFU commented May 11, 2023

All depends on data distribution and user's query. Maybe it could make query faster. The worst case may make query slower

@westonpace
Copy link
Member

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

That would be great, thanks.

@westonpace
Copy link
Member

Maybe it’s an overkill but would using the filter subset of substrait work?

That probably is overkill though it would work if someone had a desire. I believe bloom filters are only useful for equality / inequality. The statistics support comparison. So you probably just need =,!=,<,>,<=,>=. The simplest thing to do might be to do what we used to do for the old python datasets and accept disjunctive normal form:

Predicates are expressed using an Expression or using the disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

@mapleFU
Copy link
Member

mapleFU commented Jun 1, 2023

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

/// \brief Per-scan options for Parquet fragments
class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
 public:
  ParquetFragmentScanOptions();
  std::string type_name() const override { return kParquetTypeName; }

  /// Reader properties. Not all properties are respected: memory_pool comes from
  /// ScanOptions.
  std::shared_ptr<parquet::ReaderProperties> reader_properties;
  /// Arrow reader properties. Not all properties are respected: batch_size comes from
  /// ScanOptions. Additionally, dictionary columns come from
  /// ParquetFileFormat::ReaderOptions::dict_columns.
  std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
};

@westonpace

@westonpace
Copy link
Member

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

Yes, that would be a good place for it. We would want a comment that provides users with enough information to help make the correct choice. For example "This feature allows parquet bloom filters to be used to reduce the amount of data that needs to be read from the disk. However, applying these filters can be expensive and, if the filter is not very selective, may cost more CPU time than they save." (I don't know if that is the actual reason, feel free to modify as appropriate based on your testing)

@mapleFU
Copy link
Member

mapleFU commented Jun 23, 2023

I've two problem about this:

First, I found it a bit hard to implement it using current framework. Our input is an expression tree, bloom filter can transform:

  • ("eq", field_ref, literal)
  • ("is_in", field_ref, literals...)

To an "false" in row-group expressions. So I need to:

  1. match these expressions
  2. If I found we can turn them to false, transform it.

Can you help me what expression handling method can I use?

Second, ParquetFileFragment::Subset should ensure metadata already in, so maybe I should add an options, and load all bloomfilter in memory. I guess I would be time-consuming. Should I load them all at first, or loading them by requirement?

@westonpace .

@mapleFU
Copy link
Member

mapleFU commented Jun 23, 2023

I've fixed (1) with ModifyExpression, but (2) remains a problem

@westonpace
Copy link
Member

Sorry for the slow response. For (2) I think you will want to do something similar to SimplifyWithGuarantee:

Result<Expression> SimplifyWithGuarantee(Expression expr,

I don't think you can use SimplifyWithGuarantee directly because the bloom filter is a weird sort of guarantee. So I think you will need a new method. Specifically for (2) I think you want to:

  • Replace the is_in/eq in the AST with literal(false).
  • Run Canonicalize
  • Run FoldConstants

@mapleFU
Copy link
Member

mapleFU commented Jun 29, 2023

Got it, I'll have a try on it

@arthurpassos
Copy link

I see this is still open and I have the same original question: can bloom filters be utilized through Arrow Read API?

@mapleFU

@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

Generally not implemented yet. I'll repick that after https://github.com/apache/arrow/pull/37400/files is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Parquet Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

5 participants