Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Improve translation of String's StartsWith, EndsWith and Contains #474

Closed
divega opened this issue Jul 31, 2014 · 37 comments
Closed
Assignees
Labels
area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. providers-beware type-bug
Milestone

Comments

@divega
Copy link
Contributor

divega commented Jul 31, 2014

PROVIDERS BEWARE:

Linq translation for methods Contains, EndsWith and StartsWith that we have in the Relational package uses LIKE operator, which may return incorrect results if the value parameter (what we are searching for) contains wildcard characters, e.g. '%' or '_'.

This issue addresses SqlServer and Sqlite providers, but all other providers will still use the old translation. Each provider that can be affected by this should implement their own MethodCallTranslators for Contains, EndsWith and StartsWith.

Currently in EF7 a LINQ query like this:

var things = Things.Where(t => t.Name.StartsWith("a"));

Gets translated to SQL like this (note that I am simplifying the query and expanding parameter values for clarity):

SELECT * FROM Things WHERE Name LIKE 'a%' ;

However, in order to return correct results, a LINQ query like this:

var underscoreAThings = Things.Where(t => t.Name.StartsWith("_a"));

Should be translated to SQL like this:

SELECT * FROM Things WHERE Name LIKE '~_a%' ESCAPE '~';

The escaping accounts for SQL wildcard characters in the input string which should not be treated as wildcards (we can add a separate Like() method for passing patterns, but that belongs in a separate work item).

When the input string is store correlated (e.g. is another column in the database instead of parameter or a literal in the query) using LIKE in the translation correctly becomes more difficult, e.g. it would be hard to perform the required escaping in SQL.

In general for cases in which LIKE doesn't work well we can fall back to alternative translations that don't rely on LIKE, e.g. for String.StartsWith():

var underscoreAThings = Things.Where(t => t.Name.StartsWith(t.Prefix));
SELECT * FROM Things WHERE CHARINDEX(Prefix, Name) = 1 OR Prefix='';

Note that CHARINDEX() won't match an empty string but String.StartsWith("") always return true, that's why we add the Prefix ='' condition.

The main disadvantage of this translation is that it is not sargable. That can be addressed with a hybrid translation, e.g.:

SELECT * FROM Things WHERE Name LIKE Prefix+'%' AND (CHARINDEX(Prefix, Name) = 1 OR Prefix = '');

This should be quick to evaluate using an index because the LIKE condition should be able to take advantage of the index to produce fairly selective results and the second condition will filter out false positives returned by LIKE.

Notice that this alternative removes the need to fiddle with the input value: we no longer need to escape wildcards because in the worse case they will produce false positive matches which the CHARINDEX() based condition will still be able to filter out.

Also notice that based on the current query caching design we wouldn't need to always produce this more complex translation. Instead, we could sniff into the argument of String.StartsWith() and pivot on it to produce different translations, e.g.:

  1. If the value is opaque (i.e. it comes from the store) or if it contains a wildcard character, then produce the condition based on CHARINDEX()
  2. If the value does not contain a wildcard character in the first position then we can emit the condition based on LIKE

Similar approaches can be used for String.EndsWith() and String.Contains(). However for these methods LIKE does not really contribute to the performance since the beginning of the input value cannot be used to perform index lookups, so it should be ok to produce a translation that doesn't use LIKE at all.

@divega
Copy link
Contributor Author

divega commented Jul 31, 2014

@anpete here is the bug you asked me to file.

@anpete
Copy link
Contributor

anpete commented Jul 31, 2014

@divega Thanks, keep em coming.

@rowanmiller rowanmiller added this to the 1.0.0-rc1 milestone Aug 1, 2014
@rowanmiller rowanmiller modified the milestones: 7.0.0-rc1, 7.0.0 Nov 25, 2014
@divega divega changed the title Translation of String's StartsWith, EndsWith and Contains to LIKE should use escaping Improve translation of String's StartsWith, EndsWith and Contains Feb 16, 2015
@rowanmiller rowanmiller changed the title Improve translation of String's StartsWith, EndsWith and Contains Query: Improve translation of String's StartsWith, EndsWith and Contains Dec 8, 2015
@rowanmiller rowanmiller removed the pri1 label Dec 8, 2015
@rowanmiller rowanmiller modified the milestones: Backlog, 7.0.0 Dec 8, 2015
@maumar maumar modified the milestones: 1.1.0, Backlog Jul 22, 2016
@maumar maumar self-assigned this Jul 22, 2016
maumar added a commit that referenced this issue Jul 27, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(CHARINDEX(REVERSE(pattern), REVERSE(string)) > 0) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Jul 27, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(CHARINDEX(REVERSE(pattern), REVERSE(string)) > 0) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Jul 29, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Jul 29, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Jul 29, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Jul 29, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Aug 4, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
maumar added a commit that referenced this issue Aug 4, 2016
…With and Contains

We were returning wrong results for StartsWith, EndsWith and Contains with patters containing wildcard characters (%, _)

Fix is to improve the translation as follows:

Contains:
(CHARINDEX(pattern, string) > 0) OR pattern = ''

StartsWith:
(string LIKE pattern + "%" AND CHARINDEX(pattern, string) = 1) OR pattern = ''

EndsWith:
(RIGHT(string, LEN(pattern) = pattern) OR pattern = ''

We can sometimes remove the final term (pattern = ''), i.e. when the pattern is a constant. If it's value is '', we just return true, otherwise the final term is redundant and will not be generated.
We can't really do same trick for parameters without changing the way caching works (currently it only looks at the nullability of the parameter, but in this case the value itself is important)
@divega
Copy link
Contributor Author

divega commented Feb 8, 2017

@maumar Well, I wasn't advocating for sniffing parameter values, but since you mentioned it 😄: I assume you mean that things are not structured in a way that we can easily sniff parameters values for StartsWith() calls alone and record whether they contains wildcard characters, and so we we would end up having to do it for all parameters on every query, correct?

Re sniffing constants, as @roji mentioned, we are already doing it at https://github.com/aspnet/EntityFramework/blob/dev/src/Microsoft.EntityFrameworkCore.SqlServer/Query/ExpressionTranslators/Internal/SqlServerStartsWithOptimizedTranslator.cs#L39. I think we could consider removing the unnecessary CHARINDEX() condition if there are no wildcard in the constant.

I do see that there is a trade off between query compilation and SQL performance that we don't understand very well. And I suspect the real world scenario here is parameters, despite the fact that all the sample queries @jemiller0 provided in #7429 use constants.

@maumar
Copy link
Contributor

maumar commented Feb 8, 2017

@divega yes, just how now we are looking at nullabilities of all parameters and not only the ones involved in null semantics-relevant operations.

@jemiller0
Copy link

I have a newbie question. I'm not even sure how you would use a parameter with EF. I'm writing LINQ queries. As far as I knew, EF always just converted the LINQ queries to SQL queries with constants. However, the queries in my examples were generated using Dynamic LINQ. So, maybe that has something to do with it. I'm using Telerik's RadGrid UI component for ASP.NET Web Forms which generates the WHERE and ORDER BY clauses based on user input. Maybe if I weren't using that, the queries would be using parameters by default. I did notice that calls to .Skip() and .Take() were using parameters.

@maumar
Copy link
Contributor

maumar commented Feb 8, 2017

@jemiller0 in general you can force parameters into your queries by using variables in your query, e.g.

var id = 7;
var query1 = ctx.Customers.Where(c => c.Id == id); // will produce query with parameter
var query2 = ctx.Customers.Where(c => c.Id == 7); // will produce query with constant

@jemiller0
Copy link

@maumar Thanks. I just tested it and see what you mean. I didn't realize that's how it worked. Does EF Core use server-side prepared statements for these? As far as I know EF 6 didn't, at least for INSERT/UPDATE statements and I always wished it did because I think it would have improved performance.

@divega
Copy link
Contributor Author

divega commented Feb 8, 2017

What @maumar said. The general rule is that variables that get captured in lambda expressions are translated as parameters. Skip() and Take() are an exception because their arguments are not lambda expressions but they get parameterized anyway.

Thanks for the data point that Telerik's component generate arguments as constants.

@divega
Copy link
Contributor Author

divega commented Feb 8, 2017

EF6 uses a similar set of rules.

@jemiller0
Copy link

I had to do create a function to clean up the Dynamic LINQ strings that Telerik produces. By default they appeared to be more geared towards LINQ to Objects. I had to clean them up a bunch so that they would get translated into something more reasonable that would work better for use with EF. Since it's using Dynamic LINQ, I'm not sure how I would tell it to use parameters if I wanted to. Note, I'm using the grid's "advanced binding" method. I used to use EntityDataSource, but, found that the advanced method was more flexible and also EntityDataSource isn't supported with EF Core as far as I know. I much prefer using LINQ. The only thing I'm leery of is using Dynamic LINQ which is barely supported by Microsoft. It was in some example code from Microsoft, but, someone made a NuGet package out of it which is what I'm using. As far as I know Entity SQL isn't supported in EF Core. It would be great if Dynamic LINQ was something that was officially supported.

@roji
Copy link
Member

roji commented Feb 8, 2017

@jemiller0 AFAIK EF Core currently doesn't ever prepare statements (this is tracked by #5459). Note, however, that prepared statements don't really do much in SqlServer which transparently caches query plans server-side. For PostgreSQL, where preparing is vital for performance, Npgsql 3.2.0 (just released) introduces an automatic preparation feature at the driver level, where frequently-used queries will be implicitly prepared without EF Core needing to do anything about it.

@roji
Copy link
Member

roji commented Feb 8, 2017

Thanks for the valuable discussion. To summarize, at least in Npgsql I'm going to have the StartsWith() translator:

  1. Check whether the pattern is constant or not.
  2. If constant, escape everything client-side in C# and send a simple LIKE (with backslash being the PostgreSQL default escape character)
  3. Otherwise (parameters, store values), send LIKE AND STRPOS (PG's CHARINDEX equivalent) just like you guys are doing today.

As a very minor implementation note, wouldn't it be slightly better to to replace CHARINDEX with LEFT(LEN(<pattern>)), similar to how EndsWith() is currently implemented? This would avoid going through the entire string, searching for the pattern.

roji added a commit to npgsql/efcore.pg that referenced this issue Feb 8, 2017
@jemiller0
Copy link

@roji I think you found the solution to the problem I've been having with .StartsWith() on SQL Server. I just tried using LEFT(LEN()) like you mentioned and it is FASTER that doing a LIKE alone. 0 seconds versus 2 seconds on the problem query that I had. Awesome!

Also, thanks for he info on prepared statements. I think they would still be good to have for INSERT and UPDATE statements.

BTW, I started playing around with PostgreSQL recently and was impressed by what a small memory footprint it has, at least in the default configuration. Also, of the DBMSs that I've tested, I have found it to have the fastest INSERT speed. It is as if it streams the data directly to disk. The main issue that I have with it is that as far as I know, there's no way to configure it for case insensitivity like you can with SQL Server, at least on Windows. I guess it works on other platforms, but, I haven't had a chance to try it yet.

At any rate, unless I'm missing something, it looks like if the SQL Server provider could use the LEFT(LEN()) method that you mentioned, it looks like it could indeed speed things up. I haven't had to test it much, but, right off the bat, it looks like it cured the problem that I had with my problem query. Maybe it allows the backend to use a hash table lookup or something.

@roji
Copy link
Member

roji commented Feb 8, 2017

@jemiller0 uh, happy to have helped although I didn't really mean to :) Please note that my proposal was to use LIKE AND LEFT(LEN()) and not LEFT(LEN()) alone, simply because I assumed that LEFT(LEN()) wouldn't be index-optimized. This retains the original logic of using LIKE first for speed, then filtering out false positives with something (CHARINDEX or LEFT(LEN())). So I'm not clear if you're still doing LIKE AND LEFT(LEN()) or have switched to LEFT(LEN()) on its own. I'll let the EF Core team comment further on the usefulness of LEFT(LEN()) for SqlServer.

Regarding prepared statements, I don't think there's any relevance to the statement type (SELECT, INSERT, UPDATE...). All of them greatly benefit from preparation in PostgreSQL, whereas in SqlServer I'm assuming all statement types are implicitly cached without the need for explicit preparation (here's a doc page on this). It may be worth testing to confirm actual SqlServer behavior. Regardless, it seems like a good to implement preparation in EF Core simply to have all providers benefit from it, and EF Core may be in a good position to know what to prepare and what not to prepare. If you want to continue this conversation it may be better to do so in #5459.

Regarding case sensitivity in PostgreSQL, unquoted identifiers are always folded to lowercase, whereas quoted ones maintain case (this is why Npgsql EF Core provider systematically quotes all identifiers). Since this is also off-topic feel free to open an issue in the Npgsql repo to continue discussing.

@jemiller0
Copy link

When I tested it, I used LIKE and LEFT(LEN()) together. The query took 00:00:00.789. I just tried the query without LIKE and it took 00:00:02.021. I ran it several times and it looks like it's consistently faster with LIKE included. So, having the LIKE seems to improve performance. Otherwise, it would be nice to get rid of it altogether and then not have to worry about escaping special characters.

With regard to prepared statements, the main thing I'm thinking about with regard to INSERT/UPDATE statements is that you are only sending the parameters after the first time through. So, it would cut down network traffic.

Regarding case sensitivity, the issue is not case sensitivity of identifiers (I ran into that annoyance as well, but, it's less of an issue), but, of the data in the database itself. I.e. you need to use special PostgreSQL specific column types to get case insensitivity. I.e. you can't configure a case insensitive collation like SQL Server uses by default. Or at least that appears to be the case on Windows. I'm assuming it may work on other platforms.

@roji
Copy link
Member

roji commented Feb 9, 2017

@jemiller0 re prepared statements, what you say it just as true of SELECT as it is of INSERT/UPDATE: it's very beneficial to have PostgreSQL parse a SELECT statement once and then send only parameters after the first time.

Re case sensitivity you're right, and I'm pretty sure it's no different on non-Windows platforms. You do have the citext extension (which I think you were referring to), but as you say it's not the same as a global case insensitive collation.

@jemiller0
Copy link

@roji You're right about SELECTs. I wasn't thinking correctly. I was thinking about it in the context of some loader applications that I have where I have 1 select and millions of INSERTS/UPDATEs, but, you are right. Especially, when you have things like N+1 queries happening.

@rmacfadyen
Copy link

The trailing "OR @ParamValue = ''" might break "sargablity". I think, if I'm reading things correctly, this is a bit of left over code from when it was using CHARINDEX (that method needed the OR.... the "LIKE" method does not). See @divega comment on 7 Feb (#474 (comment)).

My quick and dirty testing in SQL seems to support this. Here's the setup:

create table LikeTest (
	Value varchar(50) not null primary key
)

declare @i int set @i = 1

while @i < 10000 begin
	insert into LikeTest (Value) values (cast(@i as varchar(8)))

	set @i = @i + 1
end

Here's the test:

declare @p varchar(8) set @p = '23'

select 
	*
from LikeTest
where Value like @p +'%' and left(Value, len(@p)) = @p --or @p = ''

With the "OR" commented out you get:
Table 'LikeTest'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
With the "OR" included you get:
Table 'LikeTest'. Scan count 1, logical reads 41, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

@divega
Copy link
Contributor Author

divega commented May 3, 2018

@rmacfadyen thanks for raising this. We have created #11881 to track further optimizations of this translation.

natemcmaster pushed a commit that referenced this issue Oct 31, 2018
dsyalex added a commit to DiscoveryGC/DiscoveryApi that referenced this issue Mar 13, 2019
Our Entity Framework MySQL connector had been producing SQL like this:
(`c`.`player_name` LIKE CONCAT('L\-','%') AND
(POSITION('L\-' IN `c`.`player_name`) = 1))

Neglecting to escape backslashes in the LIKE, and not returning any players of
the actual faction.

This fails MySQL's requirements for LIKEs -
From https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html:
> Because MySQL uses C escape syntax in strings (for example, \n to represent a
> newline character), you must double any \ that you use in LIKE strings. For
> example, to search for \n, specify it as \\n. To search for \, specify it as
> \\\\; this is because the backslashes are stripped once by the parser and
> again when the pattern match is made, leaving a single backslash to be
> matched against.

This becomes eight backslashes in our code because C# needs them escaping too,
so it becomes 2^3=8 backslashes.

However, we cannot just do the escaping ourselves in our code, because of
that extra POSITION call added by Entity Framework (see
dotnet/efcore#474) which would be broken
by the escaping.
Therefore, what we instead do is add another special case around faction tags,
this time to do the faction name check for this particular group as raw SQL,
producing a query that looks more like this:
      SELECT ...
      FROM (
          SELECT * FROM server_sessions WHERE player_name LIKE 'L\\\\-%'
      ) AS `c`
      WHERE ... session_start stuff ...
      ORDER BY `c`.`session_id`

Credit to @westonpace for help verifying I'm not going mad and working around
this problem.
@roji
Copy link
Member

roji commented May 27, 2022

@vishalsinhvaghela that looks like a bug in the MySQL provider, please report it to the maintaniers of that provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. providers-beware type-bug
Projects
None yet
Development

No branches or pull requests

10 participants