-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: Improve translation of String's StartsWith, EndsWith and Contains #474
Comments
@anpete here is the bug you asked me to file. |
@divega Thanks, keep em coming. |
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (CHARINDEX(REVERSE(pattern), REVERSE(string)) > 0) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (CHARINDEX(REVERSE(pattern), REVERSE(string)) > 0) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
…With and Contains We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _) Fix is to improve the translation as follows: Contains: (CHARINDEX(pattern, string) > 0) OR pattern = '' StartsWith: (string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = '' EndsWith: (RIGHT(string, LEN(pattern) = pattern) OR pattern = '' We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated. We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
@maumar Well, I wasn't advocating for sniffing parameter values, but since you mentioned it 😄: I assume you mean that things are not structured in a way that we can easily sniff parameters values for Re sniffing constants, as @roji mentioned, we are already doing it at https://github.com/aspnet/EntityFramework/blob/dev/src/Microsoft.EntityFrameworkCore.SqlServer/Query/ExpressionTranslators/Internal/SqlServerStartsWithOptimizedTranslator.cs#L39. I think we could consider removing the unnecessary I do see that there is a trade off between query compilation and SQL performance that we don't understand very well. And I suspect the real world scenario here is parameters, despite the fact that all the sample queries @jemiller0 provided in #7429 use constants. |
@divega yes, just how now we are looking at nullabilities of all parameters and not only the ones involved in null semantics-relevant operations. |
I have a newbie question. I'm not even sure how you would use a parameter with EF. I'm writing LINQ queries. As far as I knew, EF always just converted the LINQ queries to SQL queries with constants. However, the queries in my examples were generated using Dynamic LINQ. So, maybe that has something to do with it. I'm using Telerik's RadGrid UI component for ASP.NET Web Forms which generates the WHERE and ORDER BY clauses based on user input. Maybe if I weren't using that, the queries would be using parameters by default. I did notice that calls to .Skip() and .Take() were using parameters. |
@jemiller0 in general you can force parameters into your queries by using variables in your query, e.g.
|
@maumar Thanks. I just tested it and see what you mean. I didn't realize that's how it worked. Does EF Core use server-side prepared statements for these? As far as I know EF 6 didn't, at least for INSERT/UPDATE statements and I always wished it did because I think it would have improved performance. |
What @maumar said. The general rule is that variables that get captured in lambda expressions are translated as parameters. Thanks for the data point that Telerik's component generate arguments as constants. |
EF6 uses a similar set of rules. |
I had to do create a function to clean up the Dynamic LINQ strings that Telerik produces. By default they appeared to be more geared towards LINQ to Objects. I had to clean them up a bunch so that they would get translated into something more reasonable that would work better for use with EF. Since it's using Dynamic LINQ, I'm not sure how I would tell it to use parameters if I wanted to. Note, I'm using the grid's "advanced binding" method. I used to use EntityDataSource, but, found that the advanced method was more flexible and also EntityDataSource isn't supported with EF Core as far as I know. I much prefer using LINQ. The only thing I'm leery of is using Dynamic LINQ which is barely supported by Microsoft. It was in some example code from Microsoft, but, someone made a NuGet package out of it which is what I'm using. As far as I know Entity SQL isn't supported in EF Core. It would be great if Dynamic LINQ was something that was officially supported. |
@jemiller0 AFAIK EF Core currently doesn't ever prepare statements (this is tracked by #5459). Note, however, that prepared statements don't really do much in SqlServer which transparently caches query plans server-side. For PostgreSQL, where preparing is vital for performance, Npgsql 3.2.0 (just released) introduces an automatic preparation feature at the driver level, where frequently-used queries will be implicitly prepared without EF Core needing to do anything about it. |
Thanks for the valuable discussion. To summarize, at least in Npgsql I'm going to have the StartsWith() translator:
As a very minor implementation note, wouldn't it be slightly better to to replace CHARINDEX with |
Based on discussion in dotnet/efcore#474 (comment)
@roji I think you found the solution to the problem I've been having with .StartsWith() on SQL Server. I just tried using LEFT(LEN()) like you mentioned and it is FASTER that doing a LIKE alone. 0 seconds versus 2 seconds on the problem query that I had. Awesome! Also, thanks for he info on prepared statements. I think they would still be good to have for INSERT and UPDATE statements. BTW, I started playing around with PostgreSQL recently and was impressed by what a small memory footprint it has, at least in the default configuration. Also, of the DBMSs that I've tested, I have found it to have the fastest INSERT speed. It is as if it streams the data directly to disk. The main issue that I have with it is that as far as I know, there's no way to configure it for case insensitivity like you can with SQL Server, at least on Windows. I guess it works on other platforms, but, I haven't had a chance to try it yet. At any rate, unless I'm missing something, it looks like if the SQL Server provider could use the LEFT(LEN()) method that you mentioned, it looks like it could indeed speed things up. I haven't had to test it much, but, right off the bat, it looks like it cured the problem that I had with my problem query. Maybe it allows the backend to use a hash table lookup or something. |
@jemiller0 uh, happy to have helped although I didn't really mean to :) Please note that my proposal was to use LIKE AND LEFT(LEN()) and not LEFT(LEN()) alone, simply because I assumed that LEFT(LEN()) wouldn't be index-optimized. This retains the original logic of using LIKE first for speed, then filtering out false positives with something (CHARINDEX or LEFT(LEN())). So I'm not clear if you're still doing LIKE AND LEFT(LEN()) or have switched to LEFT(LEN()) on its own. I'll let the EF Core team comment further on the usefulness of LEFT(LEN()) for SqlServer. Regarding prepared statements, I don't think there's any relevance to the statement type (SELECT, INSERT, UPDATE...). All of them greatly benefit from preparation in PostgreSQL, whereas in SqlServer I'm assuming all statement types are implicitly cached without the need for explicit preparation (here's a doc page on this). It may be worth testing to confirm actual SqlServer behavior. Regardless, it seems like a good to implement preparation in EF Core simply to have all providers benefit from it, and EF Core may be in a good position to know what to prepare and what not to prepare. If you want to continue this conversation it may be better to do so in #5459. Regarding case sensitivity in PostgreSQL, unquoted identifiers are always folded to lowercase, whereas quoted ones maintain case (this is why Npgsql EF Core provider systematically quotes all identifiers). Since this is also off-topic feel free to open an issue in the Npgsql repo to continue discussing. |
When I tested it, I used LIKE and LEFT(LEN()) together. The query took 00:00:00.789. I just tried the query without LIKE and it took 00:00:02.021. I ran it several times and it looks like it's consistently faster with LIKE included. So, having the LIKE seems to improve performance. Otherwise, it would be nice to get rid of it altogether and then not have to worry about escaping special characters. With regard to prepared statements, the main thing I'm thinking about with regard to INSERT/UPDATE statements is that you are only sending the parameters after the first time through. So, it would cut down network traffic. Regarding case sensitivity, the issue is not case sensitivity of identifiers (I ran into that annoyance as well, but, it's less of an issue), but, of the data in the database itself. I.e. you need to use special PostgreSQL specific column types to get case insensitivity. I.e. you can't configure a case insensitive collation like SQL Server uses by default. Or at least that appears to be the case on Windows. I'm assuming it may work on other platforms. |
@jemiller0 re prepared statements, what you say it just as true of SELECT as it is of INSERT/UPDATE: it's very beneficial to have PostgreSQL parse a SELECT statement once and then send only parameters after the first time. Re case sensitivity you're right, and I'm pretty sure it's no different on non-Windows platforms. You do have the citext extension (which I think you were referring to), but as you say it's not the same as a global case insensitive collation. |
@roji You're right about SELECTs. I wasn't thinking correctly. I was thinking about it in the context of some loader applications that I have where I have 1 select and millions of INSERTS/UPDATEs, but, you are right. Especially, when you have things like N+1 queries happening. |
The trailing "OR @ParamValue = ''" might break "sargablity". I think, if I'm reading things correctly, this is a bit of left over code from when it was using CHARINDEX (that method needed the OR.... the "LIKE" method does not). See @divega comment on 7 Feb (#474 (comment)). My quick and dirty testing in SQL seems to support this. Here's the setup:
Here's the test:
With the "OR" commented out you get: |
@rmacfadyen thanks for raising this. We have created #11881 to track further optimizations of this translation. |
Our Entity Framework MySQL connector had been producing SQL like this: (`c`.`player_name` LIKE CONCAT('L\-','%') AND (POSITION('L\-' IN `c`.`player_name`) = 1)) Neglecting to escape backslashes in the LIKE, and not returning any players of the actual faction. This fails MySQL's requirements for LIKEs - From https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html: > Because MySQL uses C escape syntax in strings (for example, \n to represent a > newline character), you must double any \ that you use in LIKE strings. For > example, to search for \n, specify it as \\n. To search for \, specify it as > \\\\; this is because the backslashes are stripped once by the parser and > again when the pattern match is made, leaving a single backslash to be > matched against. This becomes eight backslashes in our code because C# needs them escaping too, so it becomes 2^3=8 backslashes. However, we cannot just do the escaping ourselves in our code, because of that extra POSITION call added by Entity Framework (see dotnet/efcore#474) which would be broken by the escaping. Therefore, what we instead do is add another special case around faction tags, this time to do the faction name check for this particular group as raw SQL, producing a query that looks more like this: SELECT ... FROM ( SELECT * FROM server_sessions WHERE player_name LIKE 'L\\\\-%' ) AS `c` WHERE ... session_start stuff ... ORDER BY `c`.`session_id` Credit to @westonpace for help verifying I'm not going mad and working around this problem.
@vishalsinhvaghela that looks like a bug in the MySQL provider, please report it to the maintaniers of that provider. |
PROVIDERS BEWARE:
Linq translation for methods Contains, EndsWith and StartsWith that we have in the Relational package uses LIKE operator, which may return incorrect results if the value parameter (what we are searching for) contains wildcard characters, e.g. '%' or '_'.
This issue addresses SqlServer and Sqlite providers, but all other providers will still use the old translation. Each provider that can be affected by this should implement their own MethodCallTranslators for Contains, EndsWith and StartsWith.
Currently in EF7 a LINQ query like this:
Gets translated to SQL like this (note that I am simplifying the query and expanding parameter values for clarity):
However, in order to return correct results, a LINQ query like this:
Should be translated to SQL like this:
The escaping accounts for SQL wildcard characters in the input string which should not be treated as wildcards (we can add a separate
Like()
method for passing patterns, but that belongs in a separate work item).When the input string is store correlated (e.g. is another column in the database instead of parameter or a literal in the query) using LIKE in the translation correctly becomes more difficult, e.g. it would be hard to perform the required escaping in SQL.
In general for cases in which LIKE doesn't work well we can fall back to alternative translations that don't rely on LIKE, e.g. for
String.StartsWith()
:Note that
CHARINDEX()
won't match an empty string butString.StartsWith("")
always returntrue
, that's why we add the Prefix ='' condition.The main disadvantage of this translation is that it is not sargable. That can be addressed with a hybrid translation, e.g.:
This should be quick to evaluate using an index because the LIKE condition should be able to take advantage of the index to produce fairly selective results and the second condition will filter out false positives returned by
LIKE
.Notice that this alternative removes the need to fiddle with the input value: we no longer need to escape wildcards because in the worse case they will produce false positive matches which the CHARINDEX() based condition will still be able to filter out.
Also notice that based on the current query caching design we wouldn't need to always produce this more complex translation. Instead, we could sniff into the argument of
String.StartsWith()
and pivot on it to produce different translations, e.g.:CHARINDEX()
LIKE
Similar approaches can be used for
String.EndsWith()
andString.Contains()
. However for these methods LIKE does not really contribute to the performance since the beginning of the input value cannot be used to perform index lookups, so it should be ok to produce a translation that doesn't use LIKE at all.The text was updated successfully, but these errors were encountered: