Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add expr::like and expr::notlike to pruning logic #508

Closed
wants to merge 2 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jun 5, 2021

Which issue does this PR close?

Closes #507.

Rationale for this change

Extending pruning to include string columns with LIKE

What changes are included in this PR?

Checks if a LIKE and NOT LIKE condition don't start with %, and converts them into a EQ filter.

Are there any user-facing changes?

No

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 5, 2021

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2021

Codecov Report

Merging #508 (1ee63dd) into master (a9d04ca) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #508      +/-   ##
==========================================
+ Coverage   76.07%   76.14%   +0.06%     
==========================================
  Files         155      155              
  Lines       26544    26626      +82     
==========================================
+ Hits        20194    20274      +80     
- Misses       6350     6352       +2     
Impacted Files Coverage Δ
datafusion/src/physical_optimizer/pruning.rs 93.05% <100.00%> (+0.78%) ⬆️
datafusion/src/optimizer/constant_folding.rs 91.31% <0.00%> (-0.38%) ⬇️
datafusion-cli/src/lib.rs 0.00% <0.00%> (ø)
datafusion-cli/src/main.rs 0.00% <0.00%> (ø)
datafusion/src/logical_plan/expr.rs 84.96% <0.00%> (+0.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9d04ca...1ee63dd. Read the comment docs.

@Dandandan
Copy link
Contributor

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

I can try!
In my experience / to my knowlegde pruning only matters on sorted / bucketed / colocated data. The data generated by the TPC-H benchmark is very well distributed by default without doing some kind of sorting.

match &**right {
// If the literal is a 'starts_with'
Expr::Literal(ScalarValue::Utf8(Some(string)))
if !string.starts_with('%') =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about patterns like pat1%pat2 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider escaped percent characters in the pattern. Example: LIKE '100\% %'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as we can evaluate the like expression anyway, it might be easier to support like / not like to the full extent instead of only "startswith".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only focused on expressions that don't start with %, under the assumption that they would be a starts_with. I don't think we can support anything other than a starts_with because we translate the queries to min LtEq value && value LtEq max.

Or how would LIKE '100\% %' be evaluated?

Copy link
Contributor

@Dandandan Dandandan Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, that makes sense 👍 (I think escaping might be an issue in arrow-rs too)

@Dandandan
Copy link
Contributor

@alamb is it enough to add the like and not like where I added them? Not sure of where else I need to change.

@Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).

If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use like and not like that can be pruned.

Ah I see, it has some patterns that don't match any value at all... However we don't support query 14/16/20 yet:

#165
#167
#171

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 5, 2021

However we don't support query 14/16/20 yet:

Aw :( it would have been great to see what impact the small change has.

Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.

There's also parquet column indices that were introduced in 2.5.0. I'd like to work on them, as that's where we'll see bigger read improvements on sorted data.

@alamb
Copy link
Contributor

alamb commented Jun 5, 2021

Thanks @nevi-me -- this looks great. I agree iOX may very well benefit from this as regex are common in our query workload. I will try and review this PR carefully tomorrow

@Dandandan
Copy link
Contributor

Dandandan commented Jun 5, 2021

@nevi-me

I did some checking with data from TPC-H:

Before:

CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
select l_orderkey from T where l_comment like '1%';
0 rows in set. Query took 150 milliseconds.
> 
select l_orderkey from T where l_comment like '{%';
0 rows in set. Query took 143 milliseconds.

After:

CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
select l_orderkey from T where l_comment like '1%';
0 rows in set. Query took 148 milliseconds.
> 
select l_orderkey from T where l_comment like '{%';
0 rows in set. Query took 43 milliseconds.

So looks like 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nevi-me -- this is awesome.

I think there are a few items that need to be fixed (mentioned in comments), but overall the idea here is 💯 -- thank you so much

Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.

I think this is likely in IOx because we have several columns (string) columns that often are very very low cardinality (like 8 distinct values for 100k rows) and matched with regexp style matches

@@ -548,7 +549,7 @@ fn build_predicate_expression(
// allow partial failure in predicate expression generation
// this can still produce a useful predicate when multiple conditions are joined using AND
Err(_) => {
return Ok(logical_plan::lit(true));
return Ok(unhandled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks I forgot that

if !string.starts_with('%') =>
{
let scalar_expr =
Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if just removing % is correct:

For example in a pattern like foo%bar would be converted to foobar and when compared with a value of fooaaabar would be deemed "out of range" by this logic, even though it matches the original predicate foo%bar.

If instead, for foo%bar we used foo (only use the string up to the first unescaped %) I think then the logic applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed and changed tests

#[test]
fn row_group_predicate_not_like() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// test LIKE operator that can't be converted to a 'starts_with'
// test NOT LIKE operator that can't be converted to a 'starts_with'

#[test]
fn row_group_predicate_not_starts_with() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// test LIKE operator that can't be converted to a 'starts_with'
// test NOT LIKE operator that can't be converted to a 'starts_with'

fn row_group_predicate_not_starts_with() -> Result<()> {
let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
// test LIKE operator that can't be converted to a 'starts_with'
let expr = col("c1").not().like(lit("Banana%"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a difference between !(a LIKE b) and a NOT LIKE b -- so to test the NOT LIKE operator above this should be something like

Suggested change
let expr = col("c1").not().like(lit("Banana%"));
let expr = col("c1").not_like(lit("Banana%");

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr.rs#L455-L457

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explains why the filter was negated, thanks!

alamb
alamb previously approved these changes Jun 7, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nevi-me

jorgecarleitao
jorgecarleitao previously approved these changes Jun 8, 2021
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to go. Thanks a lot, @nevi-me !

if !string.starts_with('%') =>
{
// Split the string to get the first part before '%'
let split = string.split('%').next().unwrap().to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this unwrap panic if the string does not contain any %? (if "like" always requires that, maybe we should throw an error instead?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like does not require %

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wondered the same thing -- lol! But my conclusion was "no it won't panic"

I made a quick playground that shows this working https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=049f4c1640386ff99c3b5e07085e0889

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, won't panic because String::split() will always return at least 1 result, the full string if there's nothing to spilt by

@Dandandan
Copy link
Contributor

@nevi-me do you also want to address escaping the percentage character? \%

I know like_utf8 is broken in Arrow but it might be confusing to introduce this error at different parts.

\% should just match the literal % character.
E.g. nevi-\%x% should use nevi-%x as start, not nevi-\ as is the case currently.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 8, 2021

@nevi-me do you also want to address escaping the percentage character? \%

I know like_utf8 is broken in Arrow but it might be confusing to introduce this error at different parts.

\% should just match the literal % character.
E.g. nevi-\%x% should use nevi-%x as start, not nevi-\ as is the case currently.

@alamb @jorgecarleitao please don't merge this yet, so I can address the above.

@alamb alamb dismissed stale reviews from jorgecarleitao and themself June 8, 2021 21:33

Avoid accidentally merging, at Nevi's request, until fixed escaping of %,

@alamb alamb added the datafusion Changes in the datafusion crate label Jun 10, 2021
@alamb
Copy link
Contributor

alamb commented Jun 27, 2021

Marking as draft so it is clearer from the list of PRs that there is planned work for this one

@alamb alamb marked this pull request as draft June 27, 2021 10:53
@alamb alamb added the stale-pr label Jul 13, 2021
@alamb
Copy link
Contributor

alamb commented Aug 20, 2021

Closing stale PRs to keep PR review list manageable. Please reopen if that is a mistake

@alamb alamb closed this Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pruning on string columns using starts_with
6 participants