Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "IS TRUE/FALSE" syntax #3189

Closed
wants to merge 16 commits into from
Closed

Conversation

sarahyurick
Copy link
Contributor

@sarahyurick sarahyurick commented Aug 17, 2022

Which issue does this PR close?

Closes #3159

What changes are included in this PR?

So far, I've added some basic changes so that IS TRUE works on a scalar (ColumnarValue::Scalar(scalar)), like SELECT TRUE IS TRUE as t, SELECT FALSE IS TRUE as t, SELECT 2 IS TRUE as t, etc.

Mainly, I am looking for feedback on how to handle the ColumnarValue::Array(array) case.

Please also let me know if there are any major issues with this implementation so far. Originally, @alamb suggested using IsNotDistinctFrom, but I was having trouble figuring out how that was supposed to work. Since we can already parse IS TRUE the initial error (without the changes from this PR) is Error: NotImplemented("Unsupported ast node IsTrue(Identifier(Ident { value: \"b\", quote_style: None })) in sqltorel"). Whenever I would try to link IsTrue to IsNotDistinctFrom, I would run into trouble because IsTrue looks like <expr> IS TRUE while IsNotDistinctFrom looks like <expr> IS NOT DISTINCT FROM <expr>. Maybe I'm missing something obvious, but the way I decided to go about this implementation was to mimic the implementations for IsNull and IsNotNull, since they don't have a RHS expression either.

Thanks in advance!

Update: IS TRUE should be fully functional.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner labels Aug 17, 2022
@sarahyurick sarahyurick changed the title [DRAFT] Support "IS TRUE/FALSE" syntax Support "IS TRUE/FALSE" syntax Aug 17, 2022
@andygrove
Copy link
Member

@sarahyurick I tried some expr IS TRUE type queries in Postgres and it only supports the case where expr is a boolean type. Does Dask SQL have different behavior to this?

@sarahyurick
Copy link
Contributor Author

sarahyurick commented Aug 18, 2022

@sarahyurick I tried some expr IS TRUE type queries in Postgres and it only supports the case where expr is a boolean type. Does Dask SQL have different behavior to this?

Good point, thanks! Dask SQL main currently throws a Cannot apply 'IS TRUE' to arguments of type '<BIGINT> IS TRUE'. Supported form(s): '<BOOLEAN> IS TRUE' for all other data types, so it would probably be best to match this.

@andygrove
Copy link
Member

@sarahyurick Could you target this to the sqlparser-0.21 branch? We'll need to merge PRs there in the order specified in #3192

Also, see discussion on this feature branch approach in #3191

@andygrove andygrove mentioned this pull request Aug 18, 2022
5 tasks
@sarahyurick sarahyurick changed the base branch from master to sqlparser-0.21 August 18, 2022 17:38
@andygrove andygrove changed the title Support "IS TRUE/FALSE" syntax [sqlparser-0.21] Support "IS TRUE/FALSE" syntax Aug 18, 2022
@sarahyurick
Copy link
Contributor Author

IS TRUE and IS FALSE should both be fully functional. I'm not sure what's going on with the proto files, though.

After those are fixed, it should just be formatting checks and tests, if any.

@andygrove andygrove changed the title [sqlparser-0.21] Support "IS TRUE/FALSE" syntax Support "IS TRUE/FALSE" syntax Aug 18, 2022
@sarahyurick sarahyurick changed the base branch from sqlparser-0.21 to master August 18, 2022 20:14
Co-authored-by: Andy Grove <andygrove73@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #3189 (affa48d) into master (929eb6d) will decrease coverage by 0.12%.
The diff coverage is 42.28%.

@@            Coverage Diff             @@
##           master    #3189      +/-   ##
==========================================
- Coverage   85.87%   85.75%   -0.13%     
==========================================
  Files         291      293       +2     
  Lines       52885    53033     +148     
==========================================
+ Hits        45415    45476      +61     
- Misses       7470     7557      +87     
Impacted Files Coverage Δ
datafusion/core/src/datasource/listing/helpers.rs 95.01% <ø> (ø)
datafusion/core/src/physical_plan/planner.rs 80.55% <0.00%> (-0.49%) ⬇️
datafusion/expr/src/expr.rs 81.76% <0.00%> (-3.64%) ⬇️
datafusion/expr/src/expr_rewriter.rs 85.34% <0.00%> (-0.75%) ⬇️
datafusion/expr/src/expr_visitor.rs 62.19% <0.00%> (-1.56%) ⬇️
datafusion/expr/src/utils.rs 90.76% <ø> (ø)
...tafusion/optimizer/src/common_subexpr_eliminate.rs 93.81% <0.00%> (-0.49%) ⬇️
datafusion/optimizer/src/simplify_expressions.rs 83.76% <0.00%> (-0.18%) ⬇️
datafusion/physical-expr/src/expressions/mod.rs 100.00% <ø> (ø)
datafusion/physical-expr/src/planner.rs 92.36% <0.00%> (-2.91%) ⬇️
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sarahyurick
Copy link
Contributor Author

Should I also add tests in proto/src/lib.rs? I'm not quite sure what those would look like.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @sarahyurick

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I would recommend is adding some sort of SQL level test -- like https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/expr.rs#L257 for IS NULL

The rationale for adding this would be to ensure all the plumbing is hooked up through the planner and optimizer

| Expr::IsNotNull(_)
| Expr::IsTrue(_)
| Expr::IsFalse(_)
| Expr::Exists { .. } => Ok(false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +78 to +83
let array_len = array.len();
let mut result_builder = BooleanBuilder::new(array_len);
for i in 0..array_len {
result_builder.append_value(!bool_array.is_null(i) && !bool_array.value(i));
}
Ok(ColumnarValue::Array(Arc::new(result_builder.finish())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably do the same thing with less code (and more efficiently) using the FromIter method:

Suggested change
let array_len = array.len();
let mut result_builder = BooleanBuilder::new(array_len);
for i in 0..array_len {
result_builder.append_value(!bool_array.is_null(i) && !bool_array.value(i));
}
Ok(ColumnarValue::Array(Arc::new(result_builder.finish())))
let result: BooleanArray = bool_array.iter()
.map(|v| v.map(|v| !v).or(Some(false)))
.collect();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same comment applies to is_true.rs)

Comment on lines +1885 to +1887
SQLExpr::IsTrue(expr) => Ok(Expr::IsTrue(Box::new(
self.sql_expr_to_logical_expr(*expr, schema, ctes)?,
))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to avoid having to extend Expr and add a PhysicalExpr at all if you rewrote IS TRUE in the sql planner to IS NOT DISTINCT FROM

Something like

Suggested change
SQLExpr::IsTrue(expr) => Ok(Expr::IsTrue(Box::new(
self.sql_expr_to_logical_expr(*expr, schema, ctes)?,
))),
SQLExpr::IsTrue(expr) => Ok(Expr::BinaryExpr {
left: Box::new(self.sql_expr_to_logical_expr(*expr, schema, ctes)?),
op: Operator::IsNotDistinctFrom,
right: Box::new(lit(true)),
}),

That way a query like SELECT x IS TRUE becomes SELECT x IS NOT DISTINCT FROM TRUE

Here is an example from postgres showing they are equivalent:

alamb=# select column1 is true, column1 is not distinct from true from (values (true), (false), (null)) as sq;
 ?column? | ?column? 
----------+----------
 t        | t
 f        | f
 f        | f
(3 rows)

The same type of transformation applies to IS FALSE

Then this PR would likely be 8 lines of code and then the sql level tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to keep the logical plan as close to the original query as possible. Different query engines will have different approaches for how to implement this in the physical plan. I also think that it is better UX for the user to see IsTrue/False in the logical plan if that's what their query contained.

I think it is a good idea for DataFusion to map the logical IsTrue /False to the physical expression IsDistinctFrom and that removes the need for the new physical expressions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think that makes sense to me. The only thing I was wondering was how this would handle non-boolean/non-null inputs, since we should error for all other datatypes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is related to the work I am starting in #3222. I think we need to do some type validation here so that if we see <expr> IS TRUE where expr is not boolean, then we either add a cast or throw an error if it cannot be cast to boolean.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

BTW @sarahyurick -- thank you for sticking with this. It is great to see the functionality coming along so nicely 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "IS TRUE/FALSE" syntax
4 participants