-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pomelo's 2.2.6+ string comparison methods are case-sensitive by default (in contrast to other providers, which use the database table settings) #996
Comments
We can't really tell without seeing the actual LINQ query, but i suspect your context.OleDsBibt
.Include(o => o.UcBibExt)
.Where(u => u.Title.StartsWith("death")) Pomelo versions below 3.0.0 translate this LINQ query to a case-insensitive
Pomelo 3.0.0+ does this correctly, by converting the string first to UTF-8 and then enforcing a binary (case-sensitive) collation on it before applying the So if you are using context.OleDsBibt
.Include(o => o.UcBibExt)
.Where(u => u.Title.StartsWith("death", StringComparison.OrdinalIgnoreCase)) There is currently no difference between In most cases, it is a good idea to take the following paragraph of the EF Core docs to heart and be as explicit as possible when calling string comparison methods:
|
@lauxjpn Thanks for the quick and detailed response. While I agree that the new behavior is more precise and is consistent with LINQ to Objects, IMHO, it isn't a desired behavior. If I wanted case sensitive comparisons, I would have configured the database to be case sensitive. It would be nice if there was a configuration option that could be used to disable the new behavior. The reason I didn't supply the C# code for the LINQ query is because I'm using Telerik's RadGrid UI control and Dynamic LiNQ. The UI component has built-in filter controls and it gives you a LINQ WHERE clause string that you can add using Dynamic LINQ. Long story short, in my case, it is not as easy as using the StringComparison enum overload of StartsWith(). I can probably use a regex and add it to the string. I will try that next. Hopefully, using the StringComparison enum is compatible with other EF Core providers, as I use PostgreSQL and SQL Server as well. |
I tried it and it appears that Dynamic LINQ (System.Linq.Dynamic NuGet package) doesn't support the StartsWith() overloaded method. Original Dynamic LINQ expression that works is
New Dynamic LINQ expression with case-insensitive comparison
results in
Another way to do it is
However, the extra call to UPPER() or the conversions appears to cause it not to use the index. This is a table with 8 million rows. It needs the index. So, most likely, I will need to test the new interceptor functionality in EF Core 3 and remove the extra conversion logic. While well intentioned, it is not helpful at all. In fact, it is a total pain and is the kind of thing that is likely going to drive me to switch to Dapper. Every new version of EF Core is a train wreck and a constant source of breaking changes. |
@jemiller0 I have to correct myself here in regards to previous edits of this post. It looks like it is only Pomelo that is actually translating string methods with a So it looks like that Pomelo is the only provider that supports
That is correct. In fact, Pomelo had been using some
While Pomelo's implementation is technically correct and the other implementations (including SQL Server's) are not, I have to agree now that because no other provider has made the move to support this yet, this is less of a feature and more of an issue when using multiple providers.
As mentioned above, this is not an EF Core issue in this case, but just Pomelo being ahead of everybody else. With EF Core having reached 3.0, there are going to be far fewer breaking changes moving forward than in the past. The same is true for Pomelo, which is quite stable now. Though we will still introduce breaking changes in major versions, to improve the quality of the provider for future versions. Proposed fix
Because we diverge from everybody else's implementation, we will most likely need to do exactly what you propose. We should discuss the pros and cons of doing it as an opt-in or opt-out setting. Current WorkaroundsUsage of interceptor and regular expression
The simplest (and probably most compatible way with future versions of Pomelo) will be to just replace any occurrence of Modifying the expression treeThe following approach works for Pomelo, if you can get access to the query ( You can just modify the expression tree generated by Dynamic LINQ and e.g. swap the You must activate this behavior only when using Pomelo, because other providers might not support a The following fully functional console program demonstrates this approach: Program.csusing System;
using System.Diagnostics;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.Logging;
namespace IssueConsoleTemplate
{
public class IceCream
{
public int IceCreamId { get; set; }
public string Name { get; set; }
}
public class Context : DbContext
{
public DbSet<IceCream> IceCreams { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder
.UseMySql("server=127.0.0.1;port=3306;user=root;password=;database=Issue996")
.UseLoggerFactory(LoggerFactory.Create(b => b.AddConsole()))
.EnableDetailedErrors()
.EnableSensitiveDataLogging();
}
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<IceCream>().HasData(
new IceCream
{
IceCreamId = 1,
Name = "Strawberry",
},
new IceCream
{
IceCreamId = 2,
Name = "Vanilla & Berry",
},
new IceCream
{
IceCreamId = 3,
Name = "Berry's Chocolate Brownie",
}
);
}
}
public class DynamicLinqStringMethodRewriter : ExpressionVisitor
{
private static readonly string[] SupportedStringMethods =
{
nameof(string.StartsWith),
nameof(string.EndsWith),
nameof(string.Contains),
};
public Expression Rewrite(Expression expression)
{
return Visit(expression);
}
protected override Expression VisitMethodCall(MethodCallExpression node)
{
foreach (var supportedStringMethod in SupportedStringMethods)
{
if (node.Method == typeof(string).GetRuntimeMethod(
supportedStringMethod,
new[] {typeof(string)}))
{
var methodWithComparisonParameter = typeof(string).GetRuntimeMethod(
supportedStringMethod,
new[] {typeof(string), typeof(StringComparison)});
var arguments = node.Arguments.ToList();
arguments.Add(Expression.Constant(StringComparison.OrdinalIgnoreCase));
return Expression.Call(
node.Object,
methodWithComparisonParameter,
arguments);
}
}
return base.VisitMethodCall(node);
}
}
public static class DynamicLinqQueryableExtensions
{
public static IQueryable<T> MakeCaseInsensitive<T>(this IQueryable<T> query)
=> query.Provider.CreateQuery<T>(new DynamicLinqStringMethodRewriter()
.Rewrite(query.Expression));
}
internal class Program
{
private static void Main()
{
using var context = new Context();
context.Database.EnsureDeleted();
context.Database.EnsureCreated();
var startsWithResult = context.IceCreams
.Where(i => i.Name.StartsWith("berry"))
.MakeCaseInsensitive()
.ToList();
var endsWithResult = context.IceCreams
.Where(i => i.Name.EndsWith("berry"))
.MakeCaseInsensitive()
.ToList();
var containsResult = context.IceCreams
.Where(i => i.Name.Contains("berry"))
.MakeCaseInsensitive()
.ToList();
// With modifying the expression tree:
Debug.Assert(startsWithResult.Count == 1);
Debug.Assert(endsWithResult.Count == 2);
Debug.Assert(containsResult.Count == 3);
// Without modifying the expression tree:
// Debug.Assert(startsWithResult.Count == 0);
// Debug.Assert(endsWithResult.Count == 1);
// Debug.Assert(containsResult.Count == 1);
}
}
} Remapping of
|
Pomelo's behavior prior to 3.0.0 was the following:
Pomelo's behavior since 3.0.0 is the following:
Because Pomelo translates the I see the following pros and cons for keeping Pomelo's default implementations case-sensitive and adding an opt-in setting for compatibility with other providers. The opt-in setting would make all the default implementations case-insensitive, including Pros:
Cons:
On the other hand, if all major providers are going to keep their default implementations (using the underlying collation) in the future, we could theoretically think about making this an opt-out setting and breaking with the current (and in case of Anybody any thoughts on that? /cc @mguinness, @roji, @ajcvickers |
@lauxjpn Thanks for the extremely detailed and very helpful response. I apologize for the negativity in my last post. The code that you posted to fix up the expression tree is above and beyond the call of duty and is awesome. I just tested the Pomelo 2.2 provider with a Contains() filter, and it did the search case-insensitive. One thing I realized is that the string.Contains() with the StringComparson parameter only exists in .NET Core. .NET Framework doesn't have it. I'm surprised that it didn't have it. I'm using ASP.NET Web Forms, so, I'm stuck on .NET Framework. I figured I would need to use string.IndexOf() instead. So far I've always just used Contains(). As I mentioned, the UI component that I'm using generates a Dynamic LINQ string for me that I add to my query using the Where() extension method that System.Linq.Dynamic provides that accepts a string. I.e. I'm literally passing in a string like the following rather than a lambda function.
I don't see a "using System.Linq.Dynamic" in your code. Maybe we are referring to different types of "Dynamic LINQ"? The package I'm using (System.Linq.Dynamic) was originally a sample that Microsoft created for LINQ. Then, someone outside Microsoft created a NuGet package for it. As far as I know, it's not really supported. So, I know I'm skating on thin ice on that. It's apparent lack of support for the StringComparison overloaded methods is an issue with that, not Pomelo. I was able to use the following interceptor to remove the extra convert.
It's a little dangerous doing a text replace like that, but, it seems to be working. If you could add a configuration option to the provider to use the old behavior, it would be appreciated. Otherwise, I'm happy to use the interceptor for the time being. Failing that, I can use the expression tree fix that you provided. That's a more reliable way to do it. Thanks for all the effort you put in writing that up. It's greatly appreciated, and I greatly appreciate you considering adding a configuration option. |
We are definitely adding some kind of option. It will depend on how methods like
You are right. It seems the Its base class return AssertQuery<Customer>(
isAsync,
cs => cs.Where(c => c.ContactName.Contains("M")),
entryCount: 19); So it uses the The same query is also executed by LINQ to Objects against a local client sided store and then the results from the database query and the local query are compared with each other. The Northwind So there are Looking at this now though, the test does not explicitly state that So if Pomelo's Northwind database uses a case-insensitive collation by default, while SQL Server's uses a case-sensitive collation by default, this phenomenon could also be explained and would suddenly not lead to the conclusion, that That is probably what happened here and is the reason the behavior was working in the same way as in other providers (using the columns collation) until 2.2.6. I took another look into the Though only the EF Core team can say for sure. /cc @ajcvickers, @roji This is also of interest to us now, because I already converted the
I did not really test it against Dynamic LINQ, but because Dynamic LINQ needs to return an var q = oleContext.Bibs
.AsQueryable()
.Where("(BibExt.Contains(\"death\"))")
.MakeCaseInsensitive();
Looks good, is easy to understand and simple enough to maintain, as long as you don't use any explicit case-sensitive querying in your code. I would stick with that until we decided how to solve this issue.
Don't worry about it. We have all been there, where you urgently need to implement something and then realize that some other fundamental functionality isn't working. So you spent way too much time until you are sure, that it's actually some library that is the culprit. And then you have to spend even more time to ensure, that this issue hasn't been reported yet, wondering why such basic functionality is apparently not used by anybody else, and whether you are the only one using the library in production code. And then you still need to post a detailed bug report (by which time your blood sugar is already way too low), only to find out that either nobody answers at all, nobody is taking you seriously or somebody just denies that this is actually a problem. I think Hal fixing a light bulb is the best example of this kind of yak shaving. |
Thanks again @lauxjpn. On a separate note, it is on my to do list to try to get case-insensitive searches working on PostgreSQL. I'm working on another project that uses a PostgreSQL database that has a schema that I have no control over. As far as I know, there is no way to configure PostgreSQL at the database-level for case-insensitivity. PostgreSQL 12 has some new support for case-insensitive collations, but, there aren't predefined collations like MySQL and other databases have. Aa far as I can tell, there is no way to set it at the database-level, or even the table level. PostgreSQL does have a CITEXT data type. However, as I mentioned, it is an existing database schema that I have no control over. The way the database is setup, it has indexes on the columns that use the LOWER() function. So, where ever you have a WHERE with a filter, you need to use LOWER() for the index to be used. I've been wondering if there is a way to configure the PostgreSQL EF Core provider to automatically add those. Luckily, I think maybe I can do it using the new interceptor functionality. Though, again, using regexes with a text replace may not be reliable. I'm surprised that PostgreSQL doesn't have a way to configure case-insensitive collation at the database level like every other DBMS I've worked with does. I don't like the way it wants you to use a non-standard SQL data type (CITEXT) for it. Actually, maybe I do something like what you did with the expression tree for that. I will have to look into that further. Thanks again for you help and have a nice holiday. |
I think Npgsql does support ILIKE:
So depending on your scenario, if you are using EF Core directly in the traditional way, you can just use it like this: .Where(c => EF.Functions.ILike(c.Name, "foo%")) Or when using EF Core with Dynamic LINQ, you might be able to do some switcheroo in the spirit of the rewriter of my previous example, and replace But @roji is the expert and maintainer of Npgsql and will know more about these implementation details. I wish you some nice holidays as well! |
A bit of EF Core and PostgreSQL context on this complex and oft-discussed problem :) While it's true that EF strives to mimic LINQ to Objects to object whenever possible, as a matter of principle we don't do that in all places and at all costs, especially where the SQL involved would be inferior performance-wise and/or too complicated/brittle. I'd add to that that in my opinion, if a translation of a very trivial/obvious LINQ expression (such as string equality) would result in a potentially badly performing SQL (i.e. no index use although that would be reasonably expected), there may be reason to avoid that translation even if it's more correct. Simply put, we tend to avoid coercing databases to behavior which isn't native/natural to them. String comparisons is, unfortunately, a domain where databases have very different behaviors, and where C# behavior isn't always possible to replicate. In some databases case sensitivity is determined in advance on a database basis (via a collation definition), in others on column-by-column basis, in others (e.g. PostgreSQL) via a different database type. Determining comparison behavior in advance is typically required because databases have indexes (which need to be set up and maintained in advance), whereas C# doesn't. That's why EF Core providers typically don't attempt to provide case-insensitive comparisons where the database doesn't provide that naturally and efficiently. We also think that users of a specific database are minimally familiar with their database behavior, and some divergence from LINQ to Objects behavior to the database behavior is tolerable in these cases. Despite the above, IMHO it can still make sense to support comparison operations that explicitly specify StringComparison - even when the SQL translation would be inefficient - since the user is very explicitly opting into that specific behavior. But what's important IMHO is that the default behavior (without StringComparison) results in efficient SQL, otherwise we're creating a pretty huge pit of failure for users. @lauxjpn I'm not familiar enough with MySQL and user expectations. But if the translation of the equals operator (or the StartsWith overload without a StringComparison) involves inserting additional SQL nodes (UPPER/LOWER/CONVERT) to force case-sensitivity, and those nodes would typically prevent an index from being used, then it may be worth reconsidering... In any case I'm sure @ajcvickers will have some thoughts on this.
I don't think that's quite how the CITEXT type works: the whole point is that the details of executing LOWER are executed behind the scenes by PostgreSQL, so you (and the EF provider) don't have to worry about it. When you perform a string comparison on a CITEXT column, it will automatically use any indexes on that column without having LOWER specified in SQL. Here's some SQL to prove the point: CREATE EXTENSION citext;
CREATE TABLE data (id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, text TEXT, citext CITEXT);
CREATE INDEX IX_text ON data (text);
CREATE INDEX IX_citext ON data (citext);
DO $$BEGIN
FOR i IN 1..10000 LOOP
INSERT INTO data (text, citext) VALUES ('text' || i, 'citext' || i);
END LOOP;
END$$;
EXPLAIN SELECT id FROM data WHERE text = 'TEXT8';
EXPLAIN SELECT id FROM data WHERE citext = 'CITEXT8'; The output of EXPLAIN shows an index scan using ix_citext as desired, and the result is correct. ILIKE is an additional operator that works on both TEXT and CITEXT types, so it's not exactly related; LIKE on a CITEXT should also be case-insensitive. Note that version 3.1 of the Npgsql provider also added some important fixes for CITEXT handling (see npgsql/efcore.pg#388). @jemiller0 are you seeing any different behavior? @jemiller0 one last point about removing the CONVERT calls; rather than running a regex replace, it would probably be better to hook into EF Core's translation, either providing your own translation for StartsWith (or whatever) without CONVERT, or scanning the tree afterwards and removing unwanted Convert nodes. This would be much less brittle and would also be more efficient (no need to run a regex on each execution). |
@roji So the general consensus would be then to keep the default string method behavior (for methods without explicit It would be consistent then across providers, that these methods depend on the actual settings of the underlying database (which might be under the control of the user) and therefore could always work in an optimized way (but might not return the same results across different DBMS). I agree that if the database supports explicit case in-/sensitivity when comparing, while still making use of indices (which we have implemented in Pomelo for most cases), then implementing the methods with explicit If the database cannot make use of indices, then the user could also just add some Let's see what @ajcvickers thinks about that. In case this is how it should be, then it might either make sense for EF Core to document, that these methods likely dependent on the underlying database (though this is technically true for every method), or for providers to document their individual behavior of those methods. Personally, I think it would be great to have a provider specific list of all the translatable methods (and document behaviors where it makes sense), so users can easily lookup what is supported.
The two workarounds I posted above do basically that. One is translating the method calls differently in EF Core's query pipeline, while the other does not post-process the finished expression tree, but instead pre-processes the LINQ expression before it is handed off to the query pipeline. |
Thanks @roji. So far, I haven't used I did think that was pretty cool how you can explicitly do a case-sensitive search in MySQL though, using |
AFAIK, as a general rule, once you have an index defined on a string column, then simple string operations on that column will take advantage of that index: comparison, length, LIKE, etc. However, once a function such as LOWER/UPPER or CONVERT is applied to the column, the index will no longer be used. It's still possible for the user to define an expression index specifically for that function invocation on that column, but that's quite specific to the operation in question, requires relatively advanced user understanding and also has a cost during row modifications. This is why I prefer that users add ToLower/ToUpper very explicitly - expressing the added operation (which typically prevents index use) rather than for a provider to implicitly add them. In any case, @lauxjpn as you say, let's see with @ajcvickers (and probably the rest of the team).
I agree. I tried to do this partially for Npgsql (e.g. this example, there's more in other pages). Personally, I think it would be great to have a provider specific list of all the translatable methods (and document behaviors where it makes sense), so users can easily lookup what is supported.)) but it's definitely not systematic and missing from SQL Server an Sqlite. On the other hand, since we dropped client evaluation in 3.0 users can simply try an expression out to see if it works.
That doesn't make much sense to me. Sure, a pure case-sensitive comparison is faster than case-insensitive, since in principle characters can be simply compared (although Unicode introduces some important complications here - there can be multiple ways to encode the same character). However, as I wrote above, the most important factor in database performance is index usage, and using functions such as LOWER generally prevents index use.
Custom user functions are supports in both EF6 and EF Core. But I'd examine the performance impact of using
That's true, there's no CITEXT(n). I'm not sure that should be a decisive factor for choosing database types, although that's going to depend on your application/scenario. A typical way to work, is to have a code model where you have [StringLength] annotating maximum length, regardless of whether that impacts your database schema or not; a web framework such as ASP.NET Core would pick up this attribute for length validation. This decouples your database schema from your web framework validation, even as you continue using a single code model as a source of truth for both.
I'm not sure I agree here; string comparisons aren't just about user-facing search boxes... It seems to me that in-database comparison should be default by case-insensitive just like most programming languages - but I agree that a case-insensitive option should be available and easy to use.
Yeah, I'm vaguely aware of this but haven't had yet time to look into it. I've opened #1175 to track looking into this. |
Regarding case-sensitive searches being faster then case-insensitive, that is not something that I would give a lot of weight to myself. It's an argument that I have heard from a number of PostgreSQL users. The argument that I have heard is that PostgreSQL should not support case-insensitive collations because it is slower and it is better to create all indexes using LOWER() and compare to lowercase. From an ease of use perspective, I don't agree with this. IMHO, in a perfect world, all databases would support case-insensitive and case-sensitive collations at the database level and developers could use the same code and swap out back end databases and not have to make a bunch of back end specific changes. Honestly, given how many advanced features PostgreSQL has, I'm amazed that it lacks this functionality. That would be the chief complaint I have about it. There are a few other annoyances, like having to quote mixed-case identifiers (I really wish they would fix the idiotic behavior where it folds things to lowercase and doesn't preserve case). Other than that, it works well. |
I checked this against our string comparison operations, that are using An index is only used, if the collation of the indexed column matches. This goes so far in MySQL, that even a conversion of the indexed column to the same charset and collation of itself will invalidate the index (though doing that to the other literal operand will work): -- City uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci
-- Does use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` like (convert('M%' using utf8mb4) collate utf8mb4_0900_ai_ci);
-- Does not use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where (convert(`City` using utf8mb4) collate utf8mb4_0900_ai_ci) like 'M%';
-- Does not use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where (convert(`City` using utf8mb4) collate utf8mb4_0900_ai_ci) like (convert('M%' using utf8mb4) collate utf8mb4_0900_ai_ci); But then, when doing an exact comparison, the index will additionally be used when the literal operand's collation is changed to a binary one: -- City uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci
-- Does use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` = (convert('Berlin' using utf8mb4) collate utf8mb4_bin);
-- Does not use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` = (convert('Berlin' using utf8mb4) collate utf8mb4_0900_as_cs);
-- Does not use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` like (convert('Berlin' using utf8mb4) collate utf8mb4_bin);
-- Does not use index
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` like (convert('B%' using utf8mb4) collate utf8mb4_bin); Finally, having an index column with a different collation than the default one and comparing it to a string literal that has not been assigned an explicit collation (though the implicit collation will be -- City uses CHARSET latin1 with COLLATION latin1_swedish_ci
-- MySQL 8.0+ uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci as a default.
-- Does use index
-- (latin1_swedish_ci, IMPLICIT) like (utf8mb4_0900_ai_ci, EXPLICIT) works
SELECT SQL_NO_CACHE *
FROM `Customers`
where `City` like 'M%'; Anyway, it seems we will need to reoptimize our string method handling for the column-to-column and column-to-literal comparison scenarios again, now that we support charset and collation annotations, and might need to add some options to control their usage. And we need to make sure, that our default implementations honor the database collations again, instead of using the explicit implementations that (depending on the scenario) might not use indices. |
I thought about this a bit more, and I think we will categorize this divergence as a regression and will therefore revert it back to its pre-2.2.6 state right away as a patch release. |
Thanks for looking into this @lauxjpn, it's great to see this kind of research on MySQL!
That makes a lot of sense. Any operation on a literal is done only once when processing the query, and can even be cached in a query plan (assuming one exists in MySQL) since it's effectively a constant - so there's no reason it would affect index usage. However, operations on a column work very differently, and since they must be performed on each and every row value they typically preclude index usage. Bottom line, for a simple C# comparison I'd expect to see the simple SQL equivalent:
... whose exact behavior would depend on how the database and/or column were set up, and which should always use a naively-configured index without any advanced techniques. But let's see if @ajcvickers confirms these ideas before making any big changes. @bricelam would also probably be interested. Finally, for StartsWith specifically note that a simple translation with LIKE is usually insufficient since LIKE has escape characters (i.e. %, _), so some additional logic is needed to escape those (depending on whether the pattern is a constant or a pattern/column). Take a look at what SQL Server or PostgreSQL do. |
We had a working implementation prior to the 3.0.0 upgrade (or prior to the 2.2.6 upgrade regarding While I am at it, I added some benchmarking tests, to measure different implementation performances. As it turns out, our set @iterations = 100000000;
-- 14.516 sec.
SELECT BENCHMARK(@iterations, LOCATE('foo','barfoobar'));
-- 15.844 sec.
SELECT BENCHMARK(@iterations, LOCATE('foo','barfoobar') > 0);
-- 8.484 sec.
SELECT BENCHMARK(@iterations, 'barfoobar' like '%foo%'); So there is still some room for performance improvements. |
@lauxjpn Actually, in my database, it seemed like an index was still being used even though you were casting it to case-sensitive. The database is set for utf8mb4 case-insensitive, and you were casting it to bin for case-sensitive. I would need to go back and double-check, but, it's a table with 8 million rows in it and it was returning pretty fast. I was happy to see that it looked like it still was using the index. Maybe I'm wrong about that as I didn't do an EXPLAIN. |
@jemiller0 Yes, that is expected in your This is a simplified version of your query: SELECT *
FROM `uc_bib_ext` AS `u`
WHERE `u`.`title` IS NOT NULL
AND `u`.`title` LIKE CONCAT('death', '%')
AND LEFT(`u`.`title`, CHAR_LENGTH(CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin)) =
CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin The conversion with the binary collation happens after the result got already filtered by the If on the other hand you would remove the pre-filtering (that is technically not necessary and would still lead to the same result), the query should perform significantly worse: SELECT *
FROM `uc_bib_ext` AS `u`
WHERE `u`.`title` IS NOT NULL
/* AND `u`.`title` LIKE CONCAT('death', '%') */ /* Disable the pre-filter. */
AND LEFT(`u`.`title`, CHAR_LENGTH(CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin)) =
CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin; |
@lauxjpn As far as I know, LOCATE() would never use an index, even without the CONVERT()? If it can use an index, that's new news to me. I know with PostgreSQL, you can use a GIN index. I haven't heard of MySQL having that, but, I will be happy to hear if it does. |
@jemiller0 I changed to query in my original post to be more in line with your original one. The point I am making here exactly what you are saying. The moment you remove the pre-filtering (or make it useless as in the But it would definitely be interesting if you test those queries (or something similar) on you large database, compare the performance and post the results. Maybe I am missing something! |
The I consider this unexpected, even though the SQL Server provider does implement this behavior. We will therefore override theses tests. (This applies only to the /cc @ajcvickers, @roji |
@smitpatel There is another issue as well: The following code does not handle public void ApplyPredicate([NotNull] SqlExpression expression)
{
Check.NotNull(expression, nameof(expression));
if (expression is SqlConstantExpression sqlConstant
&& (bool)sqlConstant.Value)
{
return;
}
if (Limit != null
|| Offset != null)
{ In SQL, predicates can however result in // The pattern is constant. Aside from null or empty, we escape all special characters (%, _, \)
// in C# and send a simple LIKE
if (!(constantExpression.Value is string constantString))
{
return _sqlExpressionFactory.Like(
instance,
_sqlExpressionFactory.Constant(null, stringTypeMapping));
} So if we know in advance, that an expression will result in return _sqlExpressionFactory.Constant(null, RelationalTypeMapping.NullMapping); |
@lauxjpn I'll let @smitpatel or @roji comment on the specifics here, but see also this discussion about handling empty strings: dotnet/efcore#19402 We had made the call that the empty string is always contained in any other string, but this turned out to be a big can of worms, so we're kind of back to basics here discussing what we should do. |
@ajcvickers Ah, the good old ignorance of trailing spaces in SQL Server :) Thanks for the heads-up! |
Can you post exception message and stack trace too? |
Should be enough for |
Filed dotnet/efcore#20498 |
@ajcvickers I think it is important that two tests for each "optimized" version of One test of each test pair should execute against a case insensitive database, table or column, while the other test should do the same against a case sensitive database, table or column. This should prevent other providers in the future from making the same mistake I did, by assuming the results of those "optimized" methods is based on the documented default behavior of their .NET |
Correct, that is misleading but is clarified in Case Sensitivity in String Searches:
The MariaDB documentation for LIKE is a little clearer about case sensitivity:
|
@lauxjpn I filed dotnet/efcore#20501 |
@lauxjpn, I've seen your question in dotnet/efcore#20610 and your recent change in #1057 and wanted to ask a couple questions. I recently took a deep loop at collations (include some MySQL experiments) and want to make sure we're more or less aligned on things. First, you wrote above that "when doing an exact comparison, the index will additionally be used when the literal operand's collation is changed to a binary one", i.e. that equality on a column using some non-binary collation will use an index if the column is compared to a literal with binary collation. I've done a quick test and I couldn't see this behavior... From my tests a column index only ever gets used if the literal has the exact same collation as the column. Code used to test index usageSELECT VERSION(); -- 8.0.19-0ubuntu0.19.10.3
DROP TABLE data;
CREATE TABLE data (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
ci VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
bin VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_bin);
CREATE INDEX ix_ci ON data(ci);
CREATE INDEX ix_bin ON data(bin);
DROP PROCEDURE IF EXISTS myloop;
DELIMITER //
CREATE PROCEDURE myloop()
BEGIN
DECLARE i INT DEFAULT 0;
START TRANSACTION;
WHILE (i < 50000) DO
INSERT INTO data (ci, bin) values (i, i);
SET i = i+1;
END WHILE;
COMMIT;
END;
//
CALL myloop();
-- No explicit collations, indexes always used
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello';
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello';
-- Collate on column, indexes never used
EXPLAIN ANALYZE SELECT id FROM data WHERE ci COLLATE utf8mb4_0900_ai_ci='hello';
EXPLAIN ANALYZE SELECT id FROM data WHERE bin COLLATE utf8mb4_bin='hello';
-- Collate on literal, uses index only when the collation matches the column's
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello' COLLATE utf8mb4_0900_ai_ci; -- yes
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello' COLLATE utf8mb4_bin; -- no
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello' COLLATE utf8mb4_0900_ai_ci; -- no
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello' COLLATE utf8mb4_bin; -- yes
-- Same as above, explicitly specifying the charset
EXPLAIN ANALYZE SELECT id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci);
EXPLAIN ANALYZE SELECT id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin);
EXPLAIN ANALYZE SELECT id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci);
EXPLAIN ANALYZE SELECT id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin);
Now, looking at #1057, the provider seems to continue to translate string operations with StringComparison, applying the LCASE function for StringComparison values that have IgnoresCase, and a
We discussed the question of translating StringComparison overloads (see dotnet/efcore#1222 (comment)), but in the end decided against it (at least unless new information emerges). First, IMHO translating a very common/widely-used method such as string.Equals with StringComparison, which will prevent index usage, seems like a pretty big big of failure - I'd expect users to happily use it without realizing the big perf implication it brings. In addition, choosing exactly which collation to use is tricky (see my comment above about non-case differences), so I think it's preferable for users to explicitly specify the collation they want. The alternative to StringComparison is for users to use the new EF.Functions.Collate. It makes users specify collations in all cases, and hopefully to also think about index/perf implications (because it's a special function, with documentation, and they've had to discover it). Of course, your provider can do whatever it wants and diverge from other providers :) But I'd just like to see if there's anything MySQL-specific here that points this way, and see if we can possibly align things if that makes sense. |
I slightly changed your code (probably insignificant here) and ran it on my end, with different results (or possibly a different interpretation of the same results): SELECT VERSION(); -- 8.0.18
drop database if exists `CollationTests`;
create database if not exists `CollationTests`;
use `CollationTests`;
DROP TABLE if exists data;
CREATE TABLE data (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cs VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_as_cs,
ci VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
bin VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_bin);
CREATE INDEX ix_cs ON data(cs);
CREATE INDEX ix_ci ON data(ci);
CREATE INDEX ix_bin ON data(bin);
DROP PROCEDURE IF EXISTS myloop;
DELIMITER //
CREATE PROCEDURE myloop()
BEGIN
DECLARE i INT DEFAULT 0;
START TRANSACTION;
WHILE (i < 50000) DO
INSERT INTO data (cs, ci, bin) values (i, i, i);
SET i = i+1;
END WHILE;
COMMIT;
END;
//
CALL myloop();
-- No explicit collations, indexes always used
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello'; -- yes
-- Collate on column, all indices ARE always used (Non-Unique Key Lookup)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs COLLATE utf8mb4_0900_as_cs='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci COLLATE utf8mb4_0900_ai_ci='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin COLLATE utf8mb4_bin='hello'; -- yes
-- Collate on literal, uses the index when the collation matches the column's (fully), or a binary collation is used (partially).
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello' COLLATE utf8mb4_0900_as_cs; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello' COLLATE utf8mb4_bin; -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello' COLLATE utf8mb4_0900_ai_ci; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello' COLLATE utf8mb4_bin; -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello' COLLATE utf8mb4_0900_ai_ci; -- no (expected): ai/ci on bin column
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello' COLLATE utf8mb4_bin; -- yes
-- Same as above, explicitly specifying the charset
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_as_cs); -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci); -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci); -- no (expected): ai/ci on bin column
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes As you can see, Please verify my results again on your end, if you've got the time.
That is correct. The methods have already been translated in
That is correct. There is room for optimization here, where we would check for a collation specified for a used property and suppress the
If
That is correct. It is the simplest way to implement a similar behavior, without the need to examine possible collations of possible columns and maintaining a list of known mappings between CS/CI collations. So there is room for optimization here as well, if the community requests it, but the current behavior can at least partially use an existing index.
Yes, this is a valid concern. We could think about issuing a warning or introducing an option to disable this behavior. At least, this should be documented.
This will definitely become a valid option, once we support this! But to use it efficiently will require advanced database performance knowledge, that only a few users will have. So providing a simple way to accomplish the same for common scenarios still seems the way to go here.
Yes, having a standard documentation of supported functions (template) and their provider specific implementation, that every provider would fill out, would definitely make things easier for users that need to support multiple providers (or migrate projects from one provider to another). Or a provider could just state, that all the default implementations as defined in the EF Core docs (template) are used, except for the following implementations, and then just lists the provider specific divergences. |
Apologies, my bad... Am a bit new to MySQL and my database UI (Rider...) just shows the top line of the EXPLAIN result by default, which hides the index lookup 🤨 I can now see the results. I must say I'm surprised that a binary index can be used speed up a case-insensitive query in any way (the opposite is much less surprising).
FWIW support for EF.Functions.Collate is about to go into EF Core and will probably not require your provider to actually do anything.
This is where I'm a bit confused... It seems to be that the current translations with StringComparisons are far more dangerous in terms of performance than EF.Functions.Collate will be, simply because the natural behavior for users would be to use them as-is. A warning could help with this, although our experience has shown that warnings haven't been extremely effective (a good example is the client evaluation we used to have). At the very least, it seems that whatever advanced/performance knowledge will be required to use EF.Functions.Collate is also needed in order to efficiently use the StringComparison overloads today. At the end of the day, string comparison and collations is a good example of a big mismatch between the .NET world and the database world, among other things because of the subtle (and at the same time crucial) effect of indexes. In this kind of scenario we've generally tried to not hide the complexity, but to confront the user with it so they can make informed decisions. That's why I'm a bit scared about the StringComparison overloads. Anyway, all that's just my opinion... |
I generally share your concerns here. If we hadn't had support for this before, I would probably think twice about introducing it now. But since we already have it in place, I tend to keep supporting it and to just document it properly. I don't see any concrete dangers here however. The worst scenario should be, that the query does not use an existing index and that there could be corner cases in accent and special char handling, depending on the chosen
That is one of the main points I disagree with. Knowing e.g., that an index will partially be used, when doing a binary comparison, but only if used on the literal value, is not common knowledge and the information is not strait forward to obtain. Getting this right by yourself when using I think that the main body of this discussion here is about performance optimization. Pomelo provides both approaches:
The ramifications: If the performance of any of those two approaches is not sufficient in practice (which is more likely for the second approach), then this will be determined by the user when this becomes a problem. Then the usual steps to optimize the slow performing query can be applied here, as they need to for other slow performing queries as well. I do see the very real possibility, that people might choose a We could think about marking these methods as obsolete and removing them for EF Core 5, or only translating them in the future on an opt-in basis. I am not sure this is necessary though. As long as we document the potential performance issues, not every translated method needs to perform as good as similar methods. It should be noted, that there is also the opportunity for other providers to embrace the same approach that Pomelo uses, and provide additional (potentially not optimized) comparison mechanisms out-of-the-box and thereby make this method translation support the default (though I don't think this will happen). After all, these are very basic and common query scenarios that really any provider should somehow support. I think the actual underlying issue here, that comes to light in this discussion is, that there is no natural way for users to tell, which methods can be translated and what the ramifications of using them are. And providers currently have only limited ways of naturally providing this information to the user. I personally believe that this is one of the most important issues that needs to be tackled. Because currently, you either need serious development experience with EF Core to intuitively tell, what providers usually can translate, or you have to test every query with a trial-and-error approach, in hopes it will run as expected. Having some kind of code-time support (like an analyzer or add-in, where providers can plugin their specifics, e.g. additional or different method descriptions), that will provide IntelliSense support for the supported methods and will warn if something unsupported is being used), would be a game-changer. If this could also show the SQL that is currently being generated for a given query, that would be the cherry on top. /cc @ajcvickers |
Fair enough, removing something and breaking people definitely has its own ramifications.
But isn't that knowledge (or at least most of it) required today as well? For one thing, your users have to somehow know that StringComparison.*IgnoresCase prevents any index from being used. For another, I'm not sure what the exact impact of partial index use is (i.e. how good it is compared to a full matching index). I do understand your point that there's value in applying a binary collation via StringComparison.Ordinal - it does seem to work partially, which is specific to MySQL AFAIK. But at the very least for the insensitive translation, IMHO there's little value in translating to LCASE - which the user could easily add themselves. The way I see it, the complexity is there and IMHO can't really be hidden or glossed over... Each provider should probably have a guidance page specifically on case-sensitivity and collations.
That's a very important point, and the crux of the matter is what default/expected/naive code produces - ideally, idiomatic LINQ constructs which users produce naively should produce code that runs fast, and much as possible, doesn't need to be optimized. Users indeed always have the option of diving in and using EF.Functions.Collations and having full freedom there - we're in agreement about that. Note that there are analyzers out there which actually flag string comparisons without StringComparison, guiding the user to add them; if this causes no index to be used (or even a perf degradation due to partial index use), that's quite a pit of failure.
I think we differ here quite considerably. I don't think EF (or its providers) should have as a goal to translate anything the user can express in LINQ. If a translation maps well and provides reasonable perf that's great, otherwise IMHO it's better to avoid providing a translation, while ideally pointing the user (via the exception message) to their options. One of the weak points we've had IMO is uninformative "translation failed" messages which points users towards workarounds/solutions; FWIW that's exactly what we're doing in dotnet/efcore#20663. So to summarize, the strategy I have in mind is that if something is translated, the user should be able to assume that the translation is both correct and reasonably efficient. Re code-time support for knowing what's translatable, I agree that would be great; I'm a bit scared of the perf implications of that, since running the EF Core query pipeline from an analyzer as people type code could be quite slow. Another point to keep in mind, is that we think that the vast majority of people are using one provider only (setting aside InMemory for testing); multi-provider scenarios seem rare. So it's expected for people to have a bit of a learning curve as they learn what their provider supports, and then things should be relatively smooth. But of course documentation should definitely be improved in this area, at the very least. Just to be clear, I'm very happy we're having this conversation even if don't agree on everything, and I've already learned quite a bit (e.g. the MySQL binary index and partial index use). |
That is a valid approach. In my opinion, the golden path leads along the balance between great perf and common user requirements support. If we can provide the later reasonably well, I am all game. What reasonably well means, is open to interpretation of course.
I think a simple way to do this would be for providers to supply some kind of xml file, that could be read by IntelliSense, to mark the supported methods in some way (bold, colored star etc.) and provide custom method descriptions that can be displayed instead of the default ones (or in addition to them). This would be better than any documentation. (I haven't looked into this at all from a technical standpoint, so I have not the slightest idea, if this is supported, how challenging this would be or how much inter-team coordination this would require. I only know, that extending IntelliSense is generally possible, and that libraries ship their own xml files for their own classes/methods).
Absolutely! I consider this kind of discussions as very interesting and productive. If everybody would agree on everything, this would probably not represent many of the relevant user views appropriately. |
@lauxjpn I lost track of what he current status of this is. Is the provider back to using case insensitive by default? |
@jemiller0 Yeah, this issue has become a bit long.
The provider is back to using the casing that is defined by the collation of the column (currently only nightly build, but will be part of If your collation for a given column is If on the other hand a This is how other providers implement it, how we implemented it before 3.0.0 and how it should be implemented. See the code of #1057 for further details. Feel free to take the nightly build out for a spin and report back, if you like to and have the time. |
Thanks @lauxjpn. I appreciate yours and everyone else's meticulous attention to details like this. Hopefully, this will be a 3.x release? I'm out of luck in 5 no thanks to Microsoft switching to .NET Standard 2.1 and dropping support for .NET Framework. I'm using this with several Web Forms projects that aren't getting completely rewritten. I hope the 3.x version of EF Core continues to get fixes. |
Yes, this will be part of the next Pomelo release, which will be
I think it was a good decision of the EF Core team to switch back to .NET Standard 2.0 for EF Core 3.1 (remember, EF Core 3.0 was already using .NET Standard 2.1). But at some point, we all need to move on. Personally, I am a big fan of Microsoft's paradigm shift to actually evolve .NET from now on, instead of carrying around all the previous dead weight until eternity. But of course this comes at the price of keeping up with its evolution.
.NET Framework 4.8 will be with us forever as well and will get security fixes, though there will be no active development there anymore. But if you just need to maintain legacy projects, staying on .NET Framework 4.8 should be fine.
Take a look at the official .NET Core Support Policy:
But even after December 3, 2022, if EF Core is your only serious dependency, as long as you hide the EF Core access behind your WebForms layer, there should be not reason why EF Core 3.1 should not continue to run as before (and as secure as before). Depending on how well layered your app is, you should also be able to split your app into the WebForms specific stuff and everything else without too much effort. Then you should at least be able to run both components from different processes, one using .NET Framwork 4.8 (WebForms) and the other using the latest version of .NET Core/.NET 5+. This should only be necessary however, if you need to continue actively development on the project. This should not be necessary for a legacy app that is solely on life support. |
The issue
Using the 2.2 provider with a LINQ query with a Where() and a StartsWith() filter generates the following SQL
The same query using the 3.1 provider generates the following SQL
Note the extra calls to CONVERT() and the COLLATE clause. This is causing the query to not return any results when it should be.
Why is it doing this? Is there a way to turn it off?
Further technical details
MySQL version: 5.7.28
Operating system: RedHat Linux
Pomelo.EntityFrameworkCore.MySql version: 3.1.0
Other details about my project setup:
The server's default character set is set to utf8mb4 using the following settings
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
The text was updated successfully, but these errors were encountered: