ARROW-11822: [Rust][Datafusion] Support case sensitive for function#9600
ARROW-11822: [Rust][Datafusion] Support case sensitive for function#9600wqc200 wants to merge 44 commits intoapache:masterfrom wqc200:master
Conversation
ARROW-11778: [Rust] Cast from LargeUtf8 to Numerical and temporal types
alamb
left a comment
There was a problem hiding this comment.
Thanks @wqc200 -- this is great functionality to have.
I would personally like to have datafusion follow postgres when it comes to case sensitivity (rather than having it be an option).
@andygrove / @seddonm1 / @Dandandan what do you think about making DF follow postgres's lead (and effectively convert everything to lowercase, except identifiers explicitly quoted.
For example Database --> database but "Database" --> Database?
Here is a blog post that I found that talks about posgres sensitivity
https://blog.xojo.com/2016/09/28/about-postgresql-case-sensitivity/
|
This would be great to have indeed (and start doing some more testing on, I think ATM we are both case sensitive and case insensitive. I agree it would be best to start having it default to using PostgreSQL case insensitivity without it being an option, to avoid making the code more complex than needed. |
Thank you very much!I'm going to do this in two steps, step one: support case-sensitive functionality, step two: make it case-insensitive by default, and fix the test cases |
@alamb |
|
I think it is good that we default to Postgres dialect and case-sensitivity but I would like to make it possible for users to choose other dialects and case-sensitivity if there isn't a huge burden in doing so. I can see DataFusion being incredibly useful in projects where it is necessary to mimic another database system. |
Makes sense @wqc200 -- I will try and review this PR carefully tomorrow. |
rust/datafusion/src/sql/planner.rs
Outdated
|
|
||
| /// Generate a logic plan from an SQL select | ||
| fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> { | ||
| pub fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
Does this need to be made pub?
There was a problem hiding this comment.
Oh, this has nothing to do with the current PR, I'm using this function in my project, so I added pub
| /// Represents the call of a built-in scalar function with a set of arguments. | ||
| ScalarFunction { | ||
| /// The input name of the function | ||
| input_name: String, |
There was a problem hiding this comment.
Also, I looked through the diff and I couldn't figure out what input_name is actually used for. Is this new field needed?
There was a problem hiding this comment.
This is mainly used to display the name of the function entered by the user.
md5 (Utf8("a")) instead of mD5(Utf8("a")) if there is no input_name.
For example:
mysql> select mD5('a');
+----------------------------------+
| mD5(Utf8("a")) |
+----------------------------------+
| 0cc175b9c0f1b6a831c399e269772661 |
+----------------------------------+
1 row in set (0.01 sec)
There was a problem hiding this comment.
FWIW here is what postgres does for the same query (uses the lowercase, canonical function name)
alamb=# select mD5('a');
md5
----------------------------------
0cc175b9c0f1b6a831c399e269772661
(1 row)
There was a problem hiding this comment.
Yes, i fix it, use the lowercase in postgress. I add a enum with two feild, one is LikeMySQL, another is LikePostgreSQL.
select mD5('a');
md5
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb
left a comment
There was a problem hiding this comment.
Thank you very much @wqc200 -- I think this PR's code is now ready to go except for are some basic tests that show the difference in behavior when setting CaseStyle and cast_sensitive (not changing the defaults).
The purpose of such tests are to ensure no one breaks this feature accidentally when making changes elsewhere and in the future.
I am willing to write these tests this PR can be merged given how much time you have spent on this already -- let me know if you would like help to do so
Ok |
merge official branch
merge offical branch
|
This PR has been more than 1 month, when to merge? |
|
@wqc200 -- from my perspective all this PR is waiting for is tests:
I do have writing such tests on my personal work list but I have not yet had a chance to write them |
merge from official branch
There was a problem hiding this comment.
While trying to write tests for this PR I discovered some inconsistencies:
-
Aggregate functions already (prior to this PR) are checked using case insensitive comparison (e.g.
select MAX(x)andselect max(x)both work - the code is here -
scalar functions are checked using case sensitive comparison
This PR has strange behavior that the case_sensitive setting only applies to scalar functions (and aggregates still always use case insensitive comparisons).
Thus I plan to break this PR into two new PRs -- one for the CaseStyle setting and one that proposes to make function name lookup always case insensitive (and thus consistent between aggregates and scalar functions)
|
I spent some time working on case sensitive functions this morning and have a proposed PR: #9827. I think the exercise of writing tests to demonstrate the behavior was a good one. After #9827 is sorted out, I plan to give similar treatment to the part of this PR that defies how identifiers are shown in the query output |
merge from official branch
…r functions and aggregates Broken out from #9600 by @wqc200. Note this does not contain the part of #9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes #9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r functions and aggregates Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes apache#9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
merge from official
|
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
|
You're making me go back and forth. I've been working on this PR for months, but you just won't merge it.Now you have built a new warehouse, and I need to do this PR again. May I ask what you really want to do? |
|
Hi @wqc200 I am sorry this has been a frustrating experience for you. The reason I have not merged this PR is because it does not have tests for the new features that were added Also, we implemented the case insensitive comparison this PR describes in #9827 I suggest:
|
|
#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. Thank you for understanding and helping to make arrow-rs and datafusion better |
SELECT database() and SELECT DATABASE () are the same and can be queried normally