Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pomelo's 2.2.6+ string comparison methods are case-sensitive by default (in contrast to other providers, which use the database table settings) #996

Closed
jemiller0 opened this issue Dec 22, 2019 · 50 comments · Fixed by #1057
Assignees
Milestone

Comments

@jemiller0
Copy link

The issue

Using the 2.2 provider with a LINQ query with a Where() and a StartsWith() filter generates the following SQL

SELECT `b`.`bib_id` AS `Id`, `b`.`fast_add` AS `FastAdd`, `b`.`staff_only` AS `StaffOnly`, `b`.`created_by` AS `CreationUserName`, `b`.`date_created` AS `CreationTime`, `b`.`updated_by` AS `LastWriteUserName`, `b`.`date_updated` AS `LastWriteTime`, `b`.`status`, `b`.`status_updated_by` AS `StatusLastWriteUserName`, `b`.`status_updated_date` AS `StatusLastWriteTime`, `b.BibExt`.`title` AS `BibExt`
FROM `ole_ds_bib_t` AS `b`
LEFT JOIN `uc_bib_ext` AS `b.BibExt` ON `b`.`bib_id` = `b.BibExt`.`id`
WHERE `b.BibExt`.`title` LIKE CONCAT('death', '%') AND (LEFT(`b.BibExt`.`title`, CHAR_LENGTH('death')) = 'death')
ORDER BY `LastWriteTime` DESC
LIMIT 100 OFFSET 0

The same query using the 3.1 provider generates the following SQL

SELECT `o`.`bib_id` AS `Id`, `o`.`fast_add` AS `FastAdd`, `o`.`staff_only` AS `StaffOnly`, `o`.`created_by` AS `CreationUserName`, `o`.`date_created` AS `CreationTime`, `o`.`updated_by` AS `LastWriteUserName`, `o`.`date_updated` AS `LastWriteTime`, `o`.`status` AS `Status`, `o`.`status_updated_by` AS `StatusLastWriteUserName`, `o`.`status_updated_date` AS `StatusLastWriteTime`, `u`.`title` AS `BibExt`
FROM `ole_ds_bib_t` AS `o`
LEFT JOIN `uc_bib_ext` AS `u` ON `o`.`bib_id` = `u`.`id`
WHERE `u`.`title` IS NOT NULL AND ((`u`.`title` LIKE CONCAT('death', '%')) AND (LEFT(`u`.`title`, CHAR_LENGTH(CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin)) = CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin))
ORDER BY `o`.`date_updated` DESC
LIMIT 100 OFFSET 0

Note the extra calls to CONVERT() and the COLLATE clause. This is causing the query to not return any results when it should be.

Why is it doing this? Is there a way to turn it off?

Further technical details

MySQL version: 5.7.28
Operating system: RedHat Linux
Pomelo.EntityFrameworkCore.MySql version: 3.1.0

Other details about my project setup:

The server's default character set is set to utf8mb4 using the following settings

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

@jemiller0 jemiller0 changed the title CONVERT() and COLLATE being added to SQL generated from Where() with StartsWith() filter incorrectly returns no results CONVERT() and COLLATE being added to SQL generated for Where() with StartsWith() filter incorrectly returns no results Dec 22, 2019
@jemiller0 jemiller0 changed the title CONVERT() and COLLATE being added to SQL generated for Where() with StartsWith() filter incorrectly returns no results CONVERT() and COLLATE being added to SQL generated from Where() with StartsWith() filter incorrectly returns no results Dec 22, 2019
@lauxjpn lauxjpn self-assigned this Dec 22, 2019
@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 22, 2019

We can't really tell without seeing the actual LINQ query, but i suspect your StartsWith() call to look similar to the following:

context.OleDsBibt
    .Include(o => o.UcBibExt)
    .Where(u => u.Title.StartsWith("death"))

Pomelo versions below 3.0.0 translate this LINQ query to a case-insensitive LIKE statement, which is incorrect for StartsWith(string) (see .NET Core 2.2 docs):

This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.

Pomelo 3.0.0+ does this correctly, by converting the string first to UTF-8 and then enforcing a binary (case-sensitive) collation on it before applying the LIKE operation.

So if you are using StartsWith(string) as the method call, but expect a case-insensitive search, use an overload where you can explicitly specify the string comparison method instead, like StartsWith(String, StringComparison) (see .NET Core 3.1 docs):

context.OleDsBibt
    .Include(o => o.UcBibExt)
    .Where(u => u.Title.StartsWith("death", StringComparison.OrdinalIgnoreCase))

There is currently no difference between Ordinal, CurrentCulture and InvariantCulture when translating calls with string comparison parameters to SQL in Pomelo. The only distinction being made is between case-sensitivity and case-insensitivity.


In most cases, it is a good idea to take the following paragraph of the EF Core docs to heart and be as explicit as possible when calling string comparison methods:

Notes to Callers

As explained in Best Practices for Using Strings, we recommend that you avoid calling string comparison methods that substitute default values and instead call methods that require parameters to be explicitly specified. To determine whether a string begins with a particular substring by using the string comparison rules of the current culture, call the StartsWith(String, StringComparison) method overload with a value of CurrentCulture for its comparisonType parameter.

@jemiller0
Copy link
Author

@lauxjpn Thanks for the quick and detailed response. While I agree that the new behavior is more precise and is consistent with LINQ to Objects, IMHO, it isn't a desired behavior. If I wanted case sensitive comparisons, I would have configured the database to be case sensitive. It would be nice if there was a configuration option that could be used to disable the new behavior.

The reason I didn't supply the C# code for the LINQ query is because I'm using Telerik's RadGrid UI control and Dynamic LiNQ. The UI component has built-in filter controls and it gives you a LINQ WHERE clause string that you can add using Dynamic LINQ. Long story short, in my case, it is not as easy as using the StringComparison enum overload of StartsWith(). I can probably use a regex and add it to the string. I will try that next. Hopefully, using the StringComparison enum is compatible with other EF Core providers, as I use PostgreSQL and SQL Server as well.

@jemiller0
Copy link
Author

I tried it and it appears that Dynamic LINQ (System.Linq.Dynamic NuGet package) doesn't support the StartsWith() overloaded method.

Original Dynamic LINQ expression that works is

(BibExt.StartsWith("death"))

New Dynamic LINQ expression with case-insensitive comparison

(BibExt.StartsWith("death", StringComparison.OrdinalIgnoreCase))

results in

System.Linq.Dynamic.ParseException: 'No property or field 'StringComparison' exists in type '<>f__AnonymousType22911''
`

Another way to do it is

(BibExt.ToUpper().StartsWith("death".ToUpper()))
Which results in

SELECT `o`.`bib_id` AS `Id`, `o`.`fast_add` AS `FastAdd`, `o`.`staff_only` AS `StaffOnly`, `o`.`created_by` AS `CreationUserName`, `o`.`date_created` AS `CreationTime`, `o`.`updated_by` AS `LastWriteUserName`, `o`.`date_updated` AS `LastWriteTime`, `o`.`status` AS `Status`, `o`.`status_updated_by` AS `StatusLastWriteUserName`, `o`.`status_updated_date` AS `StatusLastWriteTime`, `u`.`title` AS `BibExt`
FROM `ole_ds_bib_t` AS `o`
LEFT JOIN `uc_bib_ext` AS `u` ON `o`.`bib_id` = `u`.`id`
WHERE UPPER(`u`.`title`) IS NOT NULL AND ((UPPER(`u`.`title`) LIKE CONCAT('DEATH', '%')) AND (LEFT(UPPER(`u`.`title`), CHAR_LENGTH(CONVERT('DEATH' USING utf8mb4) COLLATE utf8mb4_bin)) = CONVERT('DEATH' USING utf8mb4) COLLATE utf8mb4_bin))
ORDER BY `o`.`date_updated` DESC
LIMIT @__p_1 OFFSET @__p_0

However, the extra call to UPPER() or the conversions appears to cause it not to use the index.

This is a table with 8 million rows. It needs the index.

So, most likely, I will need to test the new interceptor functionality in EF Core 3 and remove the extra conversion logic. While well intentioned, it is not helpful at all. In fact, it is a total pain and is the kind of thing that is likely going to drive me to switch to Dapper. Every new version of EF Core is a train wreck and a constant source of breaking changes.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 23, 2019

Hopefully, using the StringComparison enum is compatible with other EF Core providers, as I use PostgreSQL and SQL Server as well.

@jemiller0 I have to correct myself here in regards to previous edits of this post.

It looks like it is only Pomelo that is actually translating string methods with a StringComparison parameter. There are still two issues open for EF Core about this (dotnet/efcore#18881 and dotnet/efcore#1222) and none of the other major providers seems to have implemented the behavior yet.

So it looks like that Pomelo is the only provider that supports StringComparison as a parameter for string methods as of now.

However, the extra call to UPPER() or the conversions appears to cause it not to use the index.

That is correct. In fact, Pomelo had been using some UPPER()/LOWER() mechanism in some distant past versions, but switched to collation based comparisons to make use of indices.

While well intentioned, it is not helpful at all.

While Pomelo's implementation is technically correct and the other implementations (including SQL Server's) are not, I have to agree now that because no other provider has made the move to support this yet, this is less of a feature and more of an issue when using multiple providers.

In fact, it is a total pain and is the kind of thing that is likely going to drive me to switch to Dapper. Every new version of EF Core is a train wreck and a constant source of breaking changes.

As mentioned above, this is not an EF Core issue in this case, but just Pomelo being ahead of everybody else.

With EF Core having reached 3.0, there are going to be far fewer breaking changes moving forward than in the past. The same is true for Pomelo, which is quite stable now. Though we will still introduce breaking changes in major versions, to improve the quality of the provider for future versions.

Proposed fix

It would be nice if there was a configuration option that could be used to disable the new behavior.

Because we diverge from everybody else's implementation, we will most likely need to do exactly what you propose.

We should discuss the pros and cons of doing it as an opt-in or opt-out setting.

Current Workarounds

Usage of interceptor and regular expression

So, most likely, I will need to test the new interceptor functionality in EF Core 3 and remove the extra conversion logic.

The simplest (and probably most compatible way with future versions of Pomelo) will be to just replace any occurrence of utf8mb4_bin with your desired collation.

Modifying the expression tree

The following approach works for Pomelo, if you can get access to the query (IQueryable object) that Dynamic LINQ generates for you.

You can just modify the expression tree generated by Dynamic LINQ and e.g. swap the String.StartsWith(string) call with a String.StartsWith(string, StringComparison) call.

You must activate this behavior only when using Pomelo, because other providers might not support a StringComparision parameter yet. You would also need to slightly modify your query (add one extension method call).

The following fully functional console program demonstrates this approach:

Program.cs
using System;
using System.Diagnostics;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.Logging;

namespace IssueConsoleTemplate
{
    public class IceCream
    {
        public int IceCreamId { get; set; }
        public string Name { get; set; }
    }

    public class Context : DbContext
    {
        public DbSet<IceCream> IceCreams { get; set; }

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        {
            optionsBuilder
                .UseMySql("server=127.0.0.1;port=3306;user=root;password=;database=Issue996")
                .UseLoggerFactory(LoggerFactory.Create(b => b.AddConsole()))
                .EnableDetailedErrors()
                .EnableSensitiveDataLogging();
        }

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            modelBuilder.Entity<IceCream>().HasData(
                new IceCream
                {
                    IceCreamId = 1,
                    Name = "Strawberry",
                },
                new IceCream
                {
                    IceCreamId = 2,
                    Name = "Vanilla & Berry",
                },
                new IceCream
                {
                    IceCreamId = 3,
                    Name = "Berry's Chocolate Brownie",
                }
            );
        }
    }

    public class DynamicLinqStringMethodRewriter : ExpressionVisitor
    {
        private static readonly string[] SupportedStringMethods =
        {
            nameof(string.StartsWith),
            nameof(string.EndsWith),
            nameof(string.Contains),
        };

        public Expression Rewrite(Expression expression)
        {
            return Visit(expression);
        }

        protected override Expression VisitMethodCall(MethodCallExpression node)
        {
            foreach (var supportedStringMethod in SupportedStringMethods)
            {
                if (node.Method == typeof(string).GetRuntimeMethod(
                        supportedStringMethod,
                        new[] {typeof(string)}))
                {
                    var methodWithComparisonParameter = typeof(string).GetRuntimeMethod(
                        supportedStringMethod,
                        new[] {typeof(string), typeof(StringComparison)});

                    var arguments = node.Arguments.ToList();
                    arguments.Add(Expression.Constant(StringComparison.OrdinalIgnoreCase));

                    return Expression.Call(
                        node.Object,
                        methodWithComparisonParameter,
                        arguments);
                }
            }

            return base.VisitMethodCall(node);
        }
    }
    
    public static class DynamicLinqQueryableExtensions
    {
        public static IQueryable<T> MakeCaseInsensitive<T>(this IQueryable<T> query)
            => query.Provider.CreateQuery<T>(new DynamicLinqStringMethodRewriter()
                .Rewrite(query.Expression));
    }

    internal class Program
    {
        private static void Main()
        {
            using var context = new Context();

            context.Database.EnsureDeleted();
            context.Database.EnsureCreated();

            var startsWithResult = context.IceCreams
                .Where(i => i.Name.StartsWith("berry"))
                .MakeCaseInsensitive()
                .ToList();

            var endsWithResult = context.IceCreams
                .Where(i => i.Name.EndsWith("berry"))
                .MakeCaseInsensitive()
                .ToList();

            var containsResult = context.IceCreams
                .Where(i => i.Name.Contains("berry"))
                .MakeCaseInsensitive()
                .ToList();

            // With modifying the expression tree:
            Debug.Assert(startsWithResult.Count == 1);
            Debug.Assert(endsWithResult.Count == 2);
            Debug.Assert(containsResult.Count == 3);

            // Without modifying the expression tree:
            // Debug.Assert(startsWithResult.Count == 0);
            // Debug.Assert(endsWithResult.Count == 1);
            // Debug.Assert(containsResult.Count == 1);
        }
    }
}

Remapping of String methods without StringComparison parameter

The following approach works for Pomelo, if you cannot get access to the query (IQueryable object) that Dynamic LINQ generates for you.

You can just remap the function calls without StringComparison parameters inside Pomelo/EF Core. This makes use of internal Pomelo classes.

You must activate this behavior only when using Pomelo, because other providers might not support a StringComparision parameter yet.

The following fully functional console program demonstrates this approach:

Program.cs
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.CodeAnalysis;
using System.Linq;
using System.Reflection;
using Microsoft.EntityFrameworkCore;
using Microsoft.EntityFrameworkCore.Query;
using Microsoft.EntityFrameworkCore.Query.SqlExpressions;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Pomelo.EntityFrameworkCore.MySql.Query.ExpressionTranslators.Internal;
using Pomelo.EntityFrameworkCore.MySql.Query.Internal;

namespace IssueConsoleTemplate
{
    public class DynamicLinqCompatibleMySqlMethodCallTranslatorProvider : MySqlMethodCallTranslatorProvider
    {
        public DynamicLinqCompatibleMySqlMethodCallTranslatorProvider(
            [NotNull] RelationalMethodCallTranslatorProviderDependencies dependencies)
            : base(dependencies)
        {
            var sqlExpressionFactory = dependencies.SqlExpressionFactory;

            AddTranslators(new IMethodCallTranslator[]
            {
                new DynamicLinqCompatibleMySqlStringMethodTranslator(sqlExpressionFactory),
            });
        }

        private class DynamicLinqCompatibleMySqlStringMethodTranslator : IMethodCallTranslator
        {
            private static readonly MethodInfo _startsWithMethodInfo
                = typeof(string).GetRuntimeMethod(nameof(string.StartsWith), new[] {typeof(string)});

            private static readonly MethodInfo _containsMethodInfo
                = typeof(string).GetRuntimeMethod(nameof(string.Contains), new[] {typeof(string)});

            private static readonly MethodInfo _endsWithMethodInfo
                = typeof(string).GetRuntimeMethod(nameof(string.EndsWith), new[] {typeof(string)});

            private readonly MySqlSqlExpressionFactory _sqlExpressionFactory;

            public DynamicLinqCompatibleMySqlStringMethodTranslator(ISqlExpressionFactory sqlExpressionFactory)
            {
                _sqlExpressionFactory = (MySqlSqlExpressionFactory) sqlExpressionFactory;
            }

            public virtual SqlExpression Translate(
                SqlExpression instance,
                MethodInfo method,
                IReadOnlyList<SqlExpression> arguments)
            {
                if (_containsMethodInfo.Equals(method))
                {
                    // Use OrdinalIgnoreCase instead of Ordinal.
                    return new MySqlStringComparisonMethodTranslator(_sqlExpressionFactory)
                        .MakeContainsExpression(
                            instance,
                            arguments[0],
                            _sqlExpressionFactory.Constant(StringComparison.OrdinalIgnoreCase));
                }

                if (_startsWithMethodInfo.Equals(method))
                {
                    // Use CurrentCultureIgnoreCase instead of CurrentCulture.
                    return new MySqlStringComparisonMethodTranslator(_sqlExpressionFactory)
                        .MakeStartsWithExpression(
                            instance,
                            arguments[0],
                            _sqlExpressionFactory.Constant(StringComparison.CurrentCultureIgnoreCase));
                }

                if (_endsWithMethodInfo.Equals(method))
                {
                    // Use CurrentCultureIgnoreCase instead of CurrentCulture.
                    return new MySqlStringComparisonMethodTranslator(_sqlExpressionFactory)
                        .MakeEndsWithExpression(
                            instance,
                            arguments[0],
                            _sqlExpressionFactory.Constant(StringComparison.CurrentCultureIgnoreCase));
                }

                return null;
            }
        }
    }

    public class IceCream
    {
        public int IceCreamId { get; set; }
        public string Name { get; set; }
    }

    public class Context : DbContext
    {
        public DbSet<IceCream> IceCreams { get; set; }

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        {
            var serviceProvider = new ServiceCollection()
                .AddEntityFrameworkMySql()
                .AddSingleton<IMethodCallTranslatorProvider, DynamicLinqCompatibleMySqlMethodCallTranslatorProvider>()
                .AddLogging(b => b.AddConsole())
                .BuildServiceProvider();
            
            optionsBuilder
                .UseInternalServiceProvider(serviceProvider)
                .UseMySql("server=127.0.0.1;port=3306;user=root;password=;database=Issue996")
                .EnableDetailedErrors()
                .EnableSensitiveDataLogging();
        }

        protected override void OnModelCreating(ModelBuilder modelBuilder)
        {
            modelBuilder.Entity<IceCream>().HasData(
                new IceCream
                {
                    IceCreamId = 1,
                    Name = "Strawberry",
                },
                new IceCream
                {
                    IceCreamId = 2,
                    Name = "Vanilla & Berry",
                },
                new IceCream
                {
                    IceCreamId = 3,
                    Name = "Berry's Chocolate Brownie",
                }
            );
        }
    }

    internal class Program
    {
        private static void Main()
        {
            using var context = new Context();

            context.Database.EnsureDeleted();
            context.Database.EnsureCreated();

            var startsWithResult = context.IceCreams.Where(i => i.Name.StartsWith("berry")).ToList();
            var endsWithResult = context.IceCreams.Where(i => i.Name.EndsWith("berry")).ToList();
            var containsResult = context.IceCreams.Where(i => i.Name.Contains("berry")).ToList();

            // With using DynamicLinqCompatibleMySqlMethodCallTranslatorProvider:
            Debug.Assert(startsWithResult.Count == 1);
            Debug.Assert(endsWithResult.Count == 2);
            Debug.Assert(containsResult.Count == 3);
            
            // Without using DynamicLinqCompatibleMySqlMethodCallTranslatorProvider:
            // Debug.Assert(startsWithResult.Count == 0);
            // Debug.Assert(endsWithResult.Count == 1);
            // Debug.Assert(containsResult.Count == 1);
        }
    }
}

@lauxjpn lauxjpn changed the title CONVERT() and COLLATE being added to SQL generated from Where() with StartsWith() filter incorrectly returns no results Pomelo's string comparison methods are case-sensitive by default (in contrast to other providers) Dec 23, 2019
@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 23, 2019

We should discuss the pros and cons of doing it as an opt-in or opt-out setting.

Pomelo's behavior prior to 3.0.0 was the following:

Method Case-sensitivity
String.Contains(string) case-sensitive
String.StartsWith(string) same as collation
String.EndsWith(string) same as collation
String.IndexOf(string) same as collation

Pomelo's behavior since 3.0.0 is the following:

Method Case-sensitivity
String.Contains(string) case-sensitive
String.StartsWith(string) case-sensitive
String.EndsWith(string) case-sensitive
String.IndexOf(string) case-sensitive

Because Pomelo translates the StringComparison parameter methods as well, this is not an issue but a feature as long as there aren't multiple providers being used.

I see the following pros and cons for keeping Pomelo's default implementations case-sensitive and adding an opt-in setting for compatibility with other providers. The opt-in setting would make all the default implementations case-insensitive, including Contains, that was already case-sensitive prior to Pomelo 3.0.0:

Pros:

  • It generally makes sense to keep the default provider behavior as close as possible to the .NET Core behavior.
  • Other providers will likely implement StringComparison translation support in the future as well, in which case they might want to change their string.Contains(string) implementation to be case-sensitive as well.
  • This will keep the current behavior as the default and therefore does not introduce a sudden breaking change in behavior after the major 3.0 release.

Cons:

  • Users of previous Pomelo versions might expect StartsWith, EndsWith and IndexOf (but not Contains) to be case-insensitive by default.
  • To work nicely together with other providers, users need to add an option when setting up their DbContext.

On the other hand, if all major providers are going to keep their default implementations (using the underlying collation) in the future, we could theoretically think about making this an opt-out setting and breaking with the current (and in case of Contains with previous) version.

Anybody any thoughts on that?

/cc @mguinness, @roji, @ajcvickers

@jemiller0
Copy link
Author

@lauxjpn Thanks for the extremely detailed and very helpful response. I apologize for the negativity in my last post. The code that you posted to fix up the expression tree is above and beyond the call of duty and is awesome.

I just tested the Pomelo 2.2 provider with a Contains() filter, and it did the search case-insensitive.

One thing I realized is that the string.Contains() with the StringComparson parameter only exists in .NET Core. .NET Framework doesn't have it. I'm surprised that it didn't have it.

I'm using ASP.NET Web Forms, so, I'm stuck on .NET Framework. I figured I would need to use string.IndexOf() instead. So far I've always just used Contains(). As I mentioned, the UI component that I'm using generates a Dynamic LINQ string for me that I add to my query using the Where() extension method that System.Linq.Dynamic provides that accepts a string.

I.e. I'm literally passing in a string like the following rather than a lambda function.

var q = oleContext.Bibs.AsQueryable();
q = q.Where("(BibExt.Contains(\"death\"))");

I don't see a "using System.Linq.Dynamic" in your code. Maybe we are referring to different types of "Dynamic LINQ"?

The package I'm using (System.Linq.Dynamic) was originally a sample that Microsoft created for LINQ. Then, someone outside Microsoft created a NuGet package for it. As far as I know, it's not really supported. So, I know I'm skating on thin ice on that. It's apparent lack of support for the StringComparison overloaded methods is an issue with that, not Pomelo.

I was able to use the following interceptor to remove the extra convert.

        public class MySqlCommandInterceptor : DbCommandInterceptor
        {
            public override InterceptionResult<DbDataReader> ReaderExecuting(DbCommand command, CommandEventData eventData, InterceptionResult<DbDataReader> result)
            {
                command.CommandText = Regex.Replace(command.CommandText, @"CONVERT\(('.*?') USING utf8mb4\) COLLATE utf8mb4_bin", "$1");
                return base.ReaderExecuting(command, eventData, result);
            }
        }

It's a little dangerous doing a text replace like that, but, it seems to be working.

If you could add a configuration option to the provider to use the old behavior, it would be appreciated. Otherwise, I'm happy to use the interceptor for the time being. Failing that, I can use the expression tree fix that you provided. That's a more reliable way to do it. Thanks for all the effort you put in writing that up. It's greatly appreciated, and I greatly appreciate you considering adding a configuration option.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 23, 2019

If you could add a configuration option to the provider to use the old behavior, it would be appreciated. Otherwise, I'm happy to use the interceptor for the time being.

We are definitely adding some kind of option. It will depend on how methods like Contains are intended to be used by the EF Core team, whether we will make the old behavior the default again or whether it will just be opt-in.


I just tested the Pomelo 2.2 provider with a Contains() filter, and it did the search case-insensitive.

You are right. It seems the Contains change was introduced by me in my first PR to the repo when upgrading the provider and tests to 2.2.6. I remember that I did it to make one of the 13.000 EF Core tests run correctly. So I did some research on this again and I think it is MySqlSimpleQueryTest.String_Contains_Literal(bool isAsync).

Its base class SimpleQueryTestBase<TFixture> implements a query like this:

return AssertQuery<Customer>(
    isAsync,
    cs => cs.Where(c => c.ContactName.Contains("M")),
    entryCount: 19);

So it uses the Contains() method and expects 19 customers to be returned from the Northwind database, all of them containing the character M in their name.

The same query is also executed by LINQ to Objects against a local client sided store and then the results from the database query and the local query are compared with each other.

The Northwind Customers contain 15 names with only a lower case m, 18 names with only an upper case M and 1 name containing both cases.

So there are 19 names with an upper case M, which is what the test is validating. But the test is also using Contains(string), so the EF Core test implies that the Contains(string) call needs to search for the names in a case-sensitive manner. Otherwise it would return 34 customers, which is what happend in previous versions of Pomelo that implemented Contains(string).

Looking at this now though, the test does not explicitly state that Contains(string) has to always search in a case-sensitive manner (by implementing it this way). Only that it needs to do this for this particular Northwind database test.

So if Pomelo's Northwind database uses a case-insensitive collation by default, while SQL Server's uses a case-sensitive collation by default, this phenomenon could also be explained and would suddenly not lead to the conclusion, that Contains(string) searches in a case-sensitive manner, but that Contains(string) searches with whatever database collation is defined for the column and that it is supposed to be case-sensitive for this Northwind database column (and that the database collation Pomelo is using for the column is just wrong).

That is probably what happened here and is the reason the behavior was working in the same way as in other providers (using the columns collation) until 2.2.6.

I took another look into the Northwind.sql script that the SQL Server tests are using and I couldn't find any reference to any collation at all. So it might be a quite hidden implicit implementation detail, that made us implement the change in the first place.

Though only the EF Core team can say for sure. /cc @ajcvickers, @roji

This is also of interest to us now, because I already converted the Northwind.sql script to a MySQL compatible one, which we intend to use for future testing (currently we still rely on EF Core's database creation methods and manually add some views after that; see dotnet/efcore#18111). So we might need to add some default collation to the script.


I don't see a "using System.Linq.Dynamic" in your code. Maybe we are referring to different types of "Dynamic LINQ"?

I did not really test it against Dynamic LINQ, but because Dynamic LINQ needs to return an IQueryable at some point, it will work with that as well. Using the rewriter and extension from my previous sample program, your code snippet could be modified like the following and should just work:

var q = oleContext.Bibs
    .AsQueryable()
    .Where("(BibExt.Contains(\"death\"))")
    .MakeCaseInsensitive();

I was able to use the following interceptor to remove the extra convert.
It's a little dangerous doing a text replace like that, but, it seems to be working.

Looks good, is easy to understand and simple enough to maintain, as long as you don't use any explicit case-sensitive querying in your code. I would stick with that until we decided how to solve this issue.


I apologize for the negativity in my last post.

Don't worry about it. We have all been there, where you urgently need to implement something and then realize that some other fundamental functionality isn't working. So you spent way too much time until you are sure, that it's actually some library that is the culprit. And then you have to spend even more time to ensure, that this issue hasn't been reported yet, wondering why such basic functionality is apparently not used by anybody else, and whether you are the only one using the library in production code. And then you still need to post a detailed bug report (by which time your blood sugar is already way too low), only to find out that either nobody answers at all, nobody is taking you seriously or somebody just denies that this is actually a problem.

I think Hal fixing a light bulb is the best example of this kind of yak shaving.

@jemiller0
Copy link
Author

Thanks again @lauxjpn.

On a separate note, it is on my to do list to try to get case-insensitive searches working on PostgreSQL. I'm working on another project that uses a PostgreSQL database that has a schema that I have no control over. As far as I know, there is no way to configure PostgreSQL at the database-level for case-insensitivity. PostgreSQL 12 has some new support for case-insensitive collations, but, there aren't predefined collations like MySQL and other databases have. Aa far as I can tell, there is no way to set it at the database-level, or even the table level. PostgreSQL does have a CITEXT data type. However, as I mentioned, it is an existing database schema that I have no control over. The way the database is setup, it has indexes on the columns that use the LOWER() function. So, where ever you have a WHERE with a filter, you need to use LOWER() for the index to be used. I've been wondering if there is a way to configure the PostgreSQL EF Core provider to automatically add those. Luckily, I think maybe I can do it using the new interceptor functionality. Though, again, using regexes with a text replace may not be reliable. I'm surprised that PostgreSQL doesn't have a way to configure case-insensitive collation at the database level like every other DBMS I've worked with does. I don't like the way it wants you to use a non-standard SQL data type (CITEXT) for it.

Actually, maybe I do something like what you did with the expression tree for that. I will have to look into that further.

Thanks again for you help and have a nice holiday.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 23, 2019

I think Npgsql does support ILIKE:

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

So depending on your scenario, if you are using EF Core directly in the traditional way, you can just use it like this:

.Where(c => EF.Functions.ILike(c.Name, "foo%"))

Or when using EF Core with Dynamic LINQ, you might be able to do some switcheroo in the spirit of the rewriter of my previous example, and replace Contains, StartsWith and EndsWith with corresponding EF.Functions.ILike() calls.

But @roji is the expert and maintainer of Npgsql and will know more about these implementation details.

I wish you some nice holidays as well!

@roji
Copy link

roji commented Dec 24, 2019

A bit of EF Core and PostgreSQL context on this complex and oft-discussed problem :)

While it's true that EF strives to mimic LINQ to Objects to object whenever possible, as a matter of principle we don't do that in all places and at all costs, especially where the SQL involved would be inferior performance-wise and/or too complicated/brittle. I'd add to that that in my opinion, if a translation of a very trivial/obvious LINQ expression (such as string equality) would result in a potentially badly performing SQL (i.e. no index use although that would be reasonably expected), there may be reason to avoid that translation even if it's more correct. Simply put, we tend to avoid coercing databases to behavior which isn't native/natural to them.

String comparisons is, unfortunately, a domain where databases have very different behaviors, and where C# behavior isn't always possible to replicate. In some databases case sensitivity is determined in advance on a database basis (via a collation definition), in others on column-by-column basis, in others (e.g. PostgreSQL) via a different database type. Determining comparison behavior in advance is typically required because databases have indexes (which need to be set up and maintained in advance), whereas C# doesn't. That's why EF Core providers typically don't attempt to provide case-insensitive comparisons where the database doesn't provide that naturally and efficiently. We also think that users of a specific database are minimally familiar with their database behavior, and some divergence from LINQ to Objects behavior to the database behavior is tolerable in these cases.

Despite the above, IMHO it can still make sense to support comparison operations that explicitly specify StringComparison - even when the SQL translation would be inefficient - since the user is very explicitly opting into that specific behavior. But what's important IMHO is that the default behavior (without StringComparison) results in efficient SQL, otherwise we're creating a pretty huge pit of failure for users.

@lauxjpn I'm not familiar enough with MySQL and user expectations. But if the translation of the equals operator (or the StartsWith overload without a StringComparison) involves inserting additional SQL nodes (UPPER/LOWER/CONVERT) to force case-sensitivity, and those nodes would typically prevent an index from being used, then it may be worth reconsidering... In any case I'm sure @ajcvickers will have some thoughts on this.

PostgreSQL does have a CITEXT data type. However, as I mentioned, it is an existing database schema that I have no control over. The way the database is setup, it has indexes on the columns that use the LOWER() function. So, where ever you have a WHERE with a filter, you need to use LOWER() for the index to be used. I've been wondering if there is a way to configure the PostgreSQL EF Core provider to automatically add those.

I don't think that's quite how the CITEXT type works: the whole point is that the details of executing LOWER are executed behind the scenes by PostgreSQL, so you (and the EF provider) don't have to worry about it. When you perform a string comparison on a CITEXT column, it will automatically use any indexes on that column without having LOWER specified in SQL. Here's some SQL to prove the point:

CREATE EXTENSION citext;
CREATE TABLE data (id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, text TEXT, citext CITEXT);
CREATE INDEX IX_text ON data (text);
CREATE INDEX IX_citext ON data (citext);

DO $$BEGIN
 FOR i IN 1..10000 LOOP
   INSERT INTO data (text, citext) VALUES ('text' || i, 'citext' || i);
 END LOOP;
END$$;

EXPLAIN SELECT id FROM data WHERE text = 'TEXT8';
EXPLAIN SELECT id FROM data WHERE citext = 'CITEXT8';

The output of EXPLAIN shows an index scan using ix_citext as desired, and the result is correct. ILIKE is an additional operator that works on both TEXT and CITEXT types, so it's not exactly related; LIKE on a CITEXT should also be case-insensitive. Note that version 3.1 of the Npgsql provider also added some important fixes for CITEXT handling (see npgsql/efcore.pg#388).

@jemiller0 are you seeing any different behavior?

@jemiller0 one last point about removing the CONVERT calls; rather than running a regex replace, it would probably be better to hook into EF Core's translation, either providing your own translation for StartsWith (or whatever) without CONVERT, or scanning the tree afterwards and removing unwanted Convert nodes. This would be much less brittle and would also be more efficient (no need to run a regex on each execution).

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 24, 2019

@roji So the general consensus would be then to keep the default string method behavior (for methods without explicit StringComparison parameter) to just use the current database, table, column or collation behavior (whatever applies to the DBMS in question) regarding case-sensitivity.

It would be consistent then across providers, that these methods depend on the actual settings of the underlying database (which might be under the control of the user) and therefore could always work in an optimized way (but might not return the same results across different DBMS).

I agree that if the database supports explicit case in-/sensitivity when comparing, while still making use of indices (which we have implemented in Pomelo for most cases), then implementing the methods with explicit StringComparison parameter makes sense.

If the database cannot make use of indices, then the user could also just add some .ToUpper()/.ToLower() conversion to his LINQ expression instead, though having all providers translating these methods would result in a better experience in multi provider projects.

Let's see what @ajcvickers thinks about that. In case this is how it should be, then it might either make sense for EF Core to document, that these methods likely dependent on the underlying database (though this is technically true for every method), or for providers to document their individual behavior of those methods.

Personally, I think it would be great to have a provider specific list of all the translatable methods (and document behaviors where it makes sense), so users can easily lookup what is supported.

@jemiller0 one last point about removing the CONVERT calls; rather than running a regex replace, it would probably be better to hook into EF Core's translation, either providing your own translation for StartsWith (or whatever) without CONVERT, or scanning the tree afterwards and removing unwanted Convert nodes. This would be much less brittle and would also be more efficient (no need to run a regex on each execution).

The two workarounds I posted above do basically that. One is translating the method calls differently in EF Core's query pipeline, while the other does not post-process the finished expression tree, but instead pre-processes the LINQ expression before it is handed off to the query pipeline.

@jemiller0
Copy link
Author

jemiller0 commented Dec 24, 2019

Thanks @roji. So far, I haven't used CITEXT or ILIKE. The database that I'm using was designed by someone else and they are using a convention where they are explicitly calling LOWER() on string columns for all the indexes and when they do filtering in a WHERE clause. They are also calling a custom function f_unaccent() which removes accented characters. I switched to using Dapper for some of my code and I was able to add calls to lower(f_unaccent()) where needed. It's on my to do list to get that working for EF Core. If I remember correctly, there is a way to define a DbFunction which gets converted to SQL. I don't remember if that was for EF 6 or EF Core off hand. Regarding CITEXT, one thing I don't like about that, is that it appears to not allow you to include a length value like with VARCHAR(n)? I like having length values since I use that info in a reverse engineering tool that I have which generates web pages for viewing the database. Having the length value is helpful so that I can know whether to use a text box or a multi-line text box in the UI. Personally, I am hoping that one day PostgreSQL adds case-insensitive collation support at the database-level. Ideally, I would like to use SQL standard data types and just use LIKE rather than ILIKE (which as far as I know is PostgreSQL specific). Version 12 has some new support, but, I wasn't able to get it to work, and I think it had to be set at the column-level. I tried it, but, it didn't seem to work. I was wondering if it was only intended to be used for sorting and not comparisons. There's a definite chance that I just didn't know what I was doing though and did something wrong. Ideally, I would like it at the database-level, so, that things work like other databases I work with such as SQL Server and MySQL. Historically, I have always configured the database as case-insensitive, which is usually the default from what I've seen. I think maybe Oracle defaults to case sensitive. I don't have much experience with Oracle. Admittedly, I guess comparison operations are slower for case-insensitive. So, using a LOWER() in an index would be faster. I have no idea, how much faster it is in practice. Personally, from an ease of use perspective, I prefer everything be case-insensitive by default. I can't think of a case off hand where a user would want to search a field in a case-sensitive manner. I could see a benefit to something like a foreign key value, but, I always use INTs for primary keys.

I did think that was pretty cool how you can explicitly do a case-sensitive search in MySQL though, using CONVERT() like how the new Pomelo provider does it. I didn't know that was possible. It would come in handy if I ever needed to do that. It's just that that isn't what I want by default.

@roji
Copy link

roji commented Dec 24, 2019

AFAIK, as a general rule, once you have an index defined on a string column, then simple string operations on that column will take advantage of that index: comparison, length, LIKE, etc. However, once a function such as LOWER/UPPER or CONVERT is applied to the column, the index will no longer be used. It's still possible for the user to define an expression index specifically for that function invocation on that column, but that's quite specific to the operation in question, requires relatively advanced user understanding and also has a cost during row modifications. This is why I prefer that users add ToLower/ToUpper very explicitly - expressing the added operation (which typically prevents index use) rather than for a provider to implicitly add them.

In any case, @lauxjpn as you say, let's see with @ajcvickers (and probably the rest of the team).

Personally, I think it would be great to have a provider specific list of all the translatable methods (and document behaviors where it makes sense), so users can easily lookup what is supported.

I agree. I tried to do this partially for Npgsql (e.g. this example, there's more in other pages). Personally, I think it would be great to have a provider specific list of all the translatable methods (and document behaviors where it makes sense), so users can easily lookup what is supported.)) but it's definitely not systematic and missing from SQL Server an Sqlite. On the other hand, since we dropped client evaluation in 3.0 users can simply try an expression out to see if it works.

Admittedly, I guess comparison operations are slower for case-insensitive. So, using a LOWER() in an index would be faster.

That doesn't make much sense to me. Sure, a pure case-sensitive comparison is faster than case-insensitive, since in principle characters can be simply compared (although Unicode introduces some important complications here - there can be multiple ways to encode the same character). However, as I wrote above, the most important factor in database performance is index usage, and using functions such as LOWER generally prevents index use.

@jemiller0,

If I remember correctly, there is a way to define a DbFunction which gets converted to SQL. I don't remember if that was for EF 6 or EF Core off hand

Custom user functions are supports in both EF6 and EF Core. But I'd examine the performance impact of using lower(f_unaccent()) everything (i.e. what it means for indices, as I wrote above).

Regarding CITEXT, one thing I don't like about that, is that it appears to not allow you to include a length value like with VARCHAR(n)? I like having length values since I use that info in a reverse engineering tool that I have which generates web pages for viewing the database

That's true, there's no CITEXT(n). I'm not sure that should be a decisive factor for choosing database types, although that's going to depend on your application/scenario. A typical way to work, is to have a code model where you have [StringLength] annotating maximum length, regardless of whether that impacts your database schema or not; a web framework such as ASP.NET Core would pick up this attribute for length validation. This decouples your database schema from your web framework validation, even as you continue using a single code model as a source of truth for both.

Personally, from an ease of use perspective, I prefer everything be case-insensitive by default. I can't think of a case off hand where a user would want to search a field in a case-sensitive manner.

I'm not sure I agree here; string comparisons aren't just about user-facing search boxes... It seems to me that in-database comparison should be default by case-insensitive just like most programming languages - but I agree that a case-insensitive option should be available and easy to use.

Version 12 has some new support

Yeah, I'm vaguely aware of this but haven't had yet time to look into it. I've opened #1175 to track looking into this.

@jemiller0
Copy link
Author

Regarding case-sensitive searches being faster then case-insensitive, that is not something that I would give a lot of weight to myself. It's an argument that I have heard from a number of PostgreSQL users. The argument that I have heard is that PostgreSQL should not support case-insensitive collations because it is slower and it is better to create all indexes using LOWER() and compare to lowercase. From an ease of use perspective, I don't agree with this. IMHO, in a perfect world, all databases would support case-insensitive and case-sensitive collations at the database level and developers could use the same code and swap out back end databases and not have to make a bunch of back end specific changes. Honestly, given how many advanced features PostgreSQL has, I'm amazed that it lacks this functionality. That would be the chief complaint I have about it. There are a few other annoyances, like having to quote mixed-case identifiers (I really wish they would fix the idiotic behavior where it folds things to lowercase and doesn't preserve case). Other than that, it works well.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 25, 2019

AFAIK, as a general rule, once you have an index defined on a string column, then simple string operations on that column will take advantage of that index: comparison, length, LIKE, etc. However, once a function such as LOWER/UPPER or CONVERT is applied to the column, the index will no longer be used.

I checked this against our string comparison operations, that are using CONVERT() to ensure the charset and then use either a case-insensitive or binary collation to force a CI or CS comparison, and what you state is also true for MySQL.

An index is only used, if the collation of the indexed column matches. This goes so far in MySQL, that even a conversion of the indexed column to the same charset and collation of itself will invalidate the index (though doing that to the other literal operand will work):

-- City uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci

-- Does use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` like (convert('M%' using utf8mb4) collate utf8mb4_0900_ai_ci);

-- Does not use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where (convert(`City` using utf8mb4) collate utf8mb4_0900_ai_ci) like 'M%';

-- Does not use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where (convert(`City` using utf8mb4) collate utf8mb4_0900_ai_ci) like (convert('M%' using utf8mb4) collate utf8mb4_0900_ai_ci);

But then, when doing an exact comparison, the index will additionally be used when the literal operand's collation is changed to a binary one:

-- City uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci

-- Does use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` = (convert('Berlin' using utf8mb4) collate utf8mb4_bin);

-- Does not use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` = (convert('Berlin' using utf8mb4) collate utf8mb4_0900_as_cs);

-- Does not use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` like (convert('Berlin' using utf8mb4) collate utf8mb4_bin);

-- Does not use index
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` like (convert('B%' using utf8mb4) collate utf8mb4_bin);

Finally, having an index column with a different collation than the default one and comparing it to a string literal that has not been assigned an explicit collation (though the implicit collation will be utf8mb4_0900_ai_ci on MySQL 8+) will use the index as well:

-- City uses CHARSET latin1 with COLLATION latin1_swedish_ci
-- MySQL 8.0+ uses CHARSET utf8mb4 with COLLATION utf8mb4_0900_ai_ci as a default.

-- Does use index
-- (latin1_swedish_ci, IMPLICIT) like (utf8mb4_0900_ai_ci, EXPLICIT) works
SELECT SQL_NO_CACHE * 
FROM `Customers`
where `City` like 'M%';

Anyway, it seems we will need to reoptimize our string method handling for the column-to-column and column-to-literal comparison scenarios again, now that we support charset and collation annotations, and might need to add some options to control their usage. And we need to make sure, that our default implementations honor the database collations again, instead of using the explicit implementations that (depending on the scenario) might not use indices.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 25, 2019

I thought about this a bit more, and I think we will categorize this divergence as a regression and will therefore revert it back to its pre-2.2.6 state right away as a patch release.

@lauxjpn lauxjpn changed the title Pomelo's string comparison methods are case-sensitive by default (in contrast to other providers) Pomelo's 2.2.6+ string comparison methods are case-sensitive by default (in contrast to other providers) Dec 25, 2019
@roji
Copy link

roji commented Dec 25, 2019

Thanks for looking into this @lauxjpn, it's great to see this kind of research on MySQL!

This goes so far in MySQL, that even a conversion of the indexed column to the same charset and collation of itself will invalidate the index (though doing that to the other literal operand will work)

That makes a lot of sense. Any operation on a literal is done only once when processing the query, and can even be cached in a query plan (assuming one exists in MySQL) since it's effectively a constant - so there's no reason it would affect index usage. However, operations on a column work very differently, and since they must be performed on each and every row value they typically preclude index usage.

Bottom line, for a simple C# comparison I'd expect to see the simple SQL equivalent:

db.Cities.Where(c => c.Name == "Berlin") => WHERE Name = 'Berlin'
db.Cities.Where(c => c.Name.StartsWith("B")) => WHERE Name LIKE 'B%'

... whose exact behavior would depend on how the database and/or column were set up, and which should always use a naively-configured index without any advanced techniques. But let's see if @ajcvickers confirms these ideas before making any big changes. @bricelam would also probably be interested.

Finally, for StartsWith specifically note that a simple translation with LIKE is usually insufficient since LIKE has escape characters (i.e. %, _), so some additional logic is needed to escape those (depending on whether the pattern is a constant or a pattern/column). Take a look at what SQL Server or PostgreSQL do.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 25, 2019

We had a working implementation prior to the 3.0.0 upgrade (or prior to the 2.2.6 upgrade regarding Contains()) and also a SQL Server code based one for some time in our 3.0.0 alphas before I "simplified" it by just calling the StringComparison implementations. So one of those two implementations should do.

While I am at it, I added some benchmarking tests, to measure different implementation performances. As it turns out, our Contains() implementation, that uses LOCATE('foo', 'barfoobar') > 0 takes twice as long as a traditional 'barfoobar' LIKE '%foo%' (none of the queries make use of indices when run against a table, which makes sense):

set @iterations = 100000000;

-- 14.516 sec.
SELECT BENCHMARK(@iterations, LOCATE('foo','barfoobar'));

-- 15.844 sec.
SELECT BENCHMARK(@iterations, LOCATE('foo','barfoobar') > 0);

-- 8.484 sec.
SELECT BENCHMARK(@iterations, 'barfoobar' like '%foo%');

So there is still some room for performance improvements.

@jemiller0
Copy link
Author

@lauxjpn Actually, in my database, it seemed like an index was still being used even though you were casting it to case-sensitive. The database is set for utf8mb4 case-insensitive, and you were casting it to bin for case-sensitive. I would need to go back and double-check, but, it's a table with 8 million rows in it and it was returning pretty fast. I was happy to see that it looked like it still was using the index. Maybe I'm wrong about that as I didn't do an EXPLAIN.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 25, 2019

@jemiller0 Yes, that is expected in your StartsWith() case.

This is a simplified version of your query:

SELECT *
FROM `uc_bib_ext` AS `u`
WHERE `u`.`title` IS NOT NULL
AND `u`.`title` LIKE CONCAT('death', '%')
AND LEFT(`u`.`title`, CHAR_LENGTH(CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin)) = 
    CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin

The conversion with the binary collation happens after the result got already filtered by the LIKE clause (that uses the index), so there is only little work left for the LEFT(utf8mb4_bin) = utf8mb4_bin part.

If on the other hand you would remove the pre-filtering (that is technically not necessary and would still lead to the same result), the query should perform significantly worse:

SELECT *
FROM `uc_bib_ext` AS `u`
WHERE `u`.`title` IS NOT NULL
/* AND `u`.`title` LIKE CONCAT('death', '%') */ /* Disable the pre-filter. */
AND LEFT(`u`.`title`, CHAR_LENGTH(CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin)) = 
    CONVERT('death' USING utf8mb4) COLLATE utf8mb4_bin;

@jemiller0
Copy link
Author

@lauxjpn As far as I know, LOCATE() would never use an index, even without the CONVERT()? If it can use an index, that's new news to me. I know with PostgreSQL, you can use a GIN index. I haven't heard of MySQL having that, but, I will be happy to hear if it does.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Dec 25, 2019

@jemiller0 I changed to query in my original post to be more in line with your original one.

The point I am making here exactly what you are saying. The moment you remove the pre-filtering (or make it useless as in the LOCATE() example, with a LIKE '%' before the LOCATE()), your query over 8 Mio. rows should perform significantly worse.

But it would definitely be interesting if you test those queries (or something similar) on you large database, compare the performance and post the results. Maybe I am missing something!

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 2, 2020

The FunkyDataQueryTestBase.String_contains_on_argument_with_wildcard_column() test assumes, that searching for an empty string in a NULL value should result in a match.

I consider this unexpected, even though the SQL Server provider does implement this behavior.

We will therefore override theses tests.

(This applies only to the String_contains_on_argument_with_wildcard_* tests, the String_starts_with_* and String_ends_with_* tests work as expected.)

/cc @ajcvickers, @roji

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 2, 2020

@smitpatel There is another issue as well:

The following code does not handle NULL values for predicates:

https://github.com/dotnet/efcore/blob/2aaaa35a2e50a9106bbe0f98052bb90486fd4a26/src/EFCore.Relational/Query/SqlExpressions/SelectExpression.cs#L228-L240

public void ApplyPredicate([NotNull] SqlExpression expression)
{
    Check.NotNull(expression, nameof(expression));


    if (expression is SqlConstantExpression sqlConstant
        && (bool)sqlConstant.Value)
    {
        return;
    }


    if (Limit != null
        || Offset != null)
    {

In SQL, predicates can however result in NULL values. The following code is an example, where the SQL Server provider creates predicate, that will result in NULL when executed on the server:

https://github.com/dotnet/efcore/blob/2aaaa35a2e50a9106bbe0f98052bb90486fd4a26/src/EFCore.SqlServer/Query/Internal/SqlServerStringMethodTranslator.cs#L312-L319

// The pattern is constant. Aside from null or empty, we escape all special characters (%, _, \)
// in C# and send a simple LIKE
if (!(constantExpression.Value is string constantString))
{
    return _sqlExpressionFactory.Like(
        instance,
        _sqlExpressionFactory.Constant(null, stringTypeMapping));
}

So if we know in advance, that an expression will result in NULL, we should be able to optimize/replace it by returning a constant expression of NULL. Unfortunately ApplyPredicate currently throws in this case, if I return something like the following:

return _sqlExpressionFactory.Constant(null, RelationalTypeMapping.NullMapping);

@ajcvickers
Copy link

@lauxjpn I'll let @smitpatel or @roji comment on the specifics here, but see also this discussion about handling empty strings: dotnet/efcore#19402 We had made the call that the empty string is always contained in any other string, but this turned out to be a big can of worms, so we're kind of back to basics here discussing what we should do.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 2, 2020

but see also this discussion about handling empty strings: dotnet/efcore#19402

@ajcvickers Ah, the good old ignorance of trailing spaces in SQL Server :) Thanks for the heads-up!

@smitpatel
Copy link

Can you post exception message and stack trace too?

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 2, 2020

Can you post exception message and stack trace too?

@smitpatel

System.NullReferenceException: Object reference not set to an instance of an object.
  at Microsoft.EntityFrameworkCore.Query.SqlExpressions.SelectExpression.ApplyPredicate(SqlExpression expression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore.Relational\Query\SqlExpressions\SelectExpression.cs:212
  at Microsoft.EntityFrameworkCore.Query.RelationalQueryableMethodTranslatingExpressionVisitor.TranslateWhere(ShapedQueryExpression source, LambdaExpression predicate) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore.Relational\Query\RelationalQueryableMethodTranslatingExpressionVisitor.cs:943
  at Microsoft.EntityFrameworkCore.Query.QueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\QueryableMethodTranslatingExpressionVisitor.cs:427
  at Microsoft.EntityFrameworkCore.Query.RelationalQueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore.Relational\Query\RelationalQueryableMethodTranslatingExpressionVisitor.cs:71
  at at System.Linq.Expressions.MethodCallExpression.Accept(ExpressionVisitor visitor)
  at at System.Linq.Expressions.ExpressionVisitor.Visit(Expression node)
  at Microsoft.EntityFrameworkCore.Query.QueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\QueryableMethodTranslatingExpressionVisitor.cs:58
  at Microsoft.EntityFrameworkCore.Query.RelationalQueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore.Relational\Query\RelationalQueryableMethodTranslatingExpressionVisitor.cs:71
  at at System.Linq.Expressions.MethodCallExpression.Accept(ExpressionVisitor visitor)
  at at System.Linq.Expressions.ExpressionVisitor.Visit(Expression node)
  at Microsoft.EntityFrameworkCore.Query.QueryCompilationContext.CreateQueryExecutor[TResult](Expression query) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\QueryCompilationContext.cs:70
  at Microsoft.EntityFrameworkCore.Storage.Database.CompileQuery[TResult](Expression query, Boolean async) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Storage\Database.cs:71
  at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.CompileQueryCore[TResult](IDatabase database, Expression query, IModel model, Boolean async) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\QueryCompiler.cs:106
  at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.<>c__DisplayClass9_0`1.<Execute>b__0() in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\QueryCompiler.cs:96
  at Microsoft.EntityFrameworkCore.Query.Internal.CompiledQueryCache.GetOrAddQueryCore[TFunc](Object cacheKey, Func`1 compiler) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\CompiledQueryCache.cs:84
  at Microsoft.EntityFrameworkCore.Query.Internal.CompiledQueryCache.GetOrAddQuery[TResult](Object cacheKey, Func`1 compiler) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\CompiledQueryCache.cs:59
  at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.Execute[TResult](Expression query) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\QueryCompiler.cs:92
  at Microsoft.EntityFrameworkCore.Query.Internal.EntityQueryProvider.Execute[TResult](Expression expression) in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\EntityQueryProvider.cs:79
  at Microsoft.EntityFrameworkCore.Query.Internal.EntityQueryable`1.GetEnumerator() in E:\Sources\EntityFrameworkCore-3.1\src\EFCore\Query\Internal\EntityQueryable`.cs:94
  at at System.Collections.Generic.LargeArrayBuilder`1.AddRange(IEnumerable`1 items)
  at at System.Collections.Generic.EnumerableHelpers.ToArray[T](IEnumerable`1 source)
  at at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
  at Microsoft.EntityFrameworkCore.TestUtilities.QueryAsserter`1.AssertQuery[TResult](Func`2 actualQuery, Func`2 expectedQuery, Func`2 elementSorter, Action`2 elementAsserter, Boolean assertOrder, Int32 entryCount, Boolean isAsync, String testMethodName) in E:\Sources\EntityFrameworkCore-3.1\test\EFCore.Specification.Tests\TestUtilities\QueryAsserter.cs:90

Should be enough for (bool)sqlConstant.Value to check for null first (null cannot be cast to bool) or something similar.

@smitpatel
Copy link

Filed dotnet/efcore#20498

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 2, 2020

@ajcvickers I think it is important that two tests for each "optimized" version of Contains(), StartsWith(), EndsWith() and IndexOf() (e.i. the version without StringComparison parameter, that just uses whatever option/collation is defined in the database) are added to the EF Core test suite.

One test of each test pair should execute against a case insensitive database, table or column, while the other test should do the same against a case sensitive database, table or column.

This should prevent other providers in the future from making the same mistake I did, by assuming the results of those "optimized" methods is based on the documented default behavior of their .NET System.String method counterparts, instead of the implicit default collation assumed by the EF Core tests.

@mguinness
Copy link
Collaborator

The MySQL docs state the follollowing in 3.3.4.7 Pattern Matching:

In MySQL, SQL patterns are case-insensitive by default.

This turns out to be false. I checked all major MySQL versions since 5.6 for their pattern behavior regarding LIKE, and they use the underlying collation for comparison.

Correct, that is misleading but is clarified in Case Sensitivity in String Searches:

The default character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, so nonbinary string comparisons are case-insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case-sensitive, make sure that one of the operands has a case-sensitive or binary collation.

The MariaDB documentation for LIKE is a little clearer about case sensitivity:

LIKE performs case-insensitive substring matches if the collation for the expression and pattern is case-insensitive. For case-sensitive matches, declare either argument to use a binary collation using COLLATE, or coerce either of them to a BINARY string using CAST.

@ajcvickers
Copy link

@lauxjpn I filed dotnet/efcore#20501

@roji
Copy link

roji commented Apr 20, 2020

@lauxjpn, I've seen your question in dotnet/efcore#20610 and your recent change in #1057 and wanted to ask a couple questions. I recently took a deep loop at collations (include some MySQL experiments) and want to make sure we're more or less aligned on things.

First, you wrote above that "when doing an exact comparison, the index will additionally be used when the literal operand's collation is changed to a binary one", i.e. that equality on a column using some non-binary collation will use an index if the column is compared to a literal with binary collation. I've done a quick test and I couldn't see this behavior... From my tests a column index only ever gets used if the literal has the exact same collation as the column.

Code used to test index usage
SELECT VERSION(); -- 8.0.19-0ubuntu0.19.10.3

DROP TABLE data;
CREATE TABLE data (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ci VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
    bin VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_bin);
CREATE INDEX ix_ci ON data(ci);
CREATE INDEX ix_bin ON data(bin);

DROP PROCEDURE IF EXISTS myloop;
DELIMITER //
CREATE PROCEDURE myloop()
BEGIN
    DECLARE i INT DEFAULT 0;
    
    START TRANSACTION;
    WHILE (i < 50000) DO
            INSERT INTO data (ci, bin) values (i, i);
            SET i = i+1;
        END WHILE;
    COMMIT;
END;
//
CALL myloop(); 

-- No explicit collations, indexes always used
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello';
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello';

-- Collate on column, indexes never used
EXPLAIN ANALYZE SELECT id FROM data WHERE ci COLLATE utf8mb4_0900_ai_ci='hello';
EXPLAIN ANALYZE SELECT id FROM data WHERE bin COLLATE utf8mb4_bin='hello';

-- Collate on literal, uses index only when the collation matches the column's
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello' COLLATE utf8mb4_0900_ai_ci; -- yes
EXPLAIN ANALYZE SELECT id FROM data WHERE ci='hello' COLLATE utf8mb4_bin; -- no
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello' COLLATE utf8mb4_0900_ai_ci; -- no
EXPLAIN ANALYZE SELECT id FROM data WHERE bin='hello' COLLATE utf8mb4_bin; -- yes

-- Same as above, explicitly specifying the charset
EXPLAIN ANALYZE SELECT id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci);
EXPLAIN ANALYZE SELECT id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin);
EXPLAIN ANALYZE SELECT id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci);
EXPLAIN ANALYZE SELECT id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin);

Now, looking at #1057, the provider seems to continue to translate string operations with StringComparison, applying the LCASE function for StringComparison values that have IgnoresCase, and a COLLATE utfmb4_bin collation for the others (case-sensitive). The problems I see with this are:

  • LCASE causes indexes to not be used. Note that this seems to be the case even if a case-insensitive collation is already specified on the column (and therefore StringComparison isn't necessary).
  • utf8mb4_bin also causes indexes to not be used (when the column has another collation).
  • In addition, utf8mb4_bin isn't just case-insensitive - it is a binary comparison. For at least some values of StringComparison (CurrentCulture), this will cause binary comparison even when the column's collation isn't binary at all. I'm referring to cases where a collation is case-sensitive, but still disregards certain binary differences (e.g. accents, German ß vs. ss, etc); it seems like this would produce unexpected results, as the user simply specified CurrentCulture and did not request binary comparison.

We discussed the question of translating StringComparison overloads (see dotnet/efcore#1222 (comment)), but in the end decided against it (at least unless new information emerges). First, IMHO translating a very common/widely-used method such as string.Equals with StringComparison, which will prevent index usage, seems like a pretty big big of failure - I'd expect users to happily use it without realizing the big perf implication it brings. In addition, choosing exactly which collation to use is tricky (see my comment above about non-case differences), so I think it's preferable for users to explicitly specify the collation they want.

The alternative to StringComparison is for users to use the new EF.Functions.Collate. It makes users specify collations in all cases, and hopefully to also think about index/perf implications (because it's a special function, with documentation, and they've had to discover it).

Of course, your provider can do whatever it wants and diverge from other providers :) But I'd just like to see if there's anything MySQL-specific here that points this way, and see if we can possibly align things if that makes sense.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 20, 2020

@lauxjpn, I've seen your question in dotnet/efcore#20610 and your recent change in #1057 and wanted to ask a couple questions. I recently took a deep loop at collations (include some MySQL experiments) and want to make sure we're more or less aligned on things.

First, you wrote above that "when doing an exact comparison, the index will additionally be used when the literal operand's collation is changed to a binary one", i.e. that equality on a column using some non-binary collation will use an index if the column is compared to a literal with binary collation. I've done a quick test and I couldn't see this behavior... From my tests a column index only ever gets used if the literal has the exact same collation as the column.

I slightly changed your code (probably insignificant here) and ran it on my end, with different results (or possibly a different interpretation of the same results):

SELECT VERSION(); -- 8.0.18

drop database if exists `CollationTests`;
create database if not exists `CollationTests`;
use `CollationTests`;

DROP TABLE if exists data;
CREATE TABLE data (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cs VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_as_cs,
    ci VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
    bin VARCHAR(256) CHARSET utf8mb4 COLLATE utf8mb4_bin);
CREATE INDEX ix_cs ON data(cs);
CREATE INDEX ix_ci ON data(ci);
CREATE INDEX ix_bin ON data(bin);

DROP PROCEDURE IF EXISTS myloop;
DELIMITER //
CREATE PROCEDURE myloop()
BEGIN
    DECLARE i INT DEFAULT 0;
    
    START TRANSACTION;
    WHILE (i < 50000) DO
            INSERT INTO data (cs, ci, bin) values (i, i, i);
            SET i = i+1;
        END WHILE;
    COMMIT;
END;
//
CALL myloop(); 

-- No explicit collations, indexes always used
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello'; -- yes

-- Collate on column, all indices ARE always used (Non-Unique Key Lookup)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs COLLATE utf8mb4_0900_as_cs='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci COLLATE utf8mb4_0900_ai_ci='hello'; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin COLLATE utf8mb4_bin='hello'; -- yes

-- Collate on literal, uses the index when the collation matches the column's (fully), or a binary collation is used (partially).
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello' COLLATE utf8mb4_0900_as_cs; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs='hello' COLLATE utf8mb4_bin; -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello' COLLATE utf8mb4_0900_ai_ci; -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci='hello' COLLATE utf8mb4_bin; -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello' COLLATE utf8mb4_0900_ai_ci; -- no (expected): ai/ci on bin column
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin='hello' COLLATE utf8mb4_bin; -- yes

-- Same as above, explicitly specifying the charset
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_as_cs); -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE cs=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci); -- yes
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE ci=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes* (Index Range Scan/Partial Index Scan)
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_0900_ai_ci); -- no (expected): ai/ci on bin column
EXPLAIN ANALYZE SELECT SQL_NO_CACHE id FROM data WHERE bin=(CONVERT('hello' USING utf8mb4) COLLATE utf8mb4_bin); -- yes

As you can see, 'hello' COLLATE utf8mb4_bin results in an index range scan (which is a partial index scan). This is more efficient than a full index scan, but less efficient than a lookup.

Please verify my results again on your end, if you've got the time.

Now, looking at #1057, the provider seems to continue to translate string operations with StringComparison, applying the LCASE function for StringComparison values that have IgnoresCase, and a COLLATE utfmb4_bin collation for the others (case-sensitive). The problems I see with this are:

That is correct. The methods have already been translated in 2.2, so removing support for them between 3.1.1 and 3.2.0 would not be SemVer compliant, even if we wanted to remove them (which I am not convinced of right now).

LCASE causes indexes to not be used. Note that this seems to be the case even if a case-insensitive collation is already specified on the column (and therefore StringComparison isn't necessary).

That is correct. There is room for optimization here, where we would check for a collation specified for a used property and suppress the LCASE() in such cases. But this is advanced stuff I usually would only look into if requested by the community. Methods with a StringComparison parameter are likely to not use indices. We should document this, but this is not unexpected by us.

utf8mb4_bin also causes indexes to not be used.

If utf8mb4_bin is specified as a columns collation, then it is used. Also, the index on a column using a different collation is partially used, if compared to a utf8mb4_bin string literal.

In addition, utf8mb4_bin isn't just case-insensitive - it is a binary comparison. For at least some values of StringComparison (CurrentCulture), this will cause binary comparison even when the column's collation isn't binary at all. I'm referring to cases where a collation is case-sensitive, but still disregards certain binary differences (e.g. accents, German ß vs. ss, etc); it seems like this would produce unexpected results, as the user simply specified CurrentCulture and did not request binary comparison.

That is correct. It is the simplest way to implement a similar behavior, without the need to examine possible collations of possible columns and maintaining a list of known mappings between CS/CI collations. So there is room for optimization here as well, if the community requests it, but the current behavior can at least partially use an existing index.

We discussed the question of translating StringComparison overloads (see dotnet/efcore#1222 (comment)), but in the end decided against it (at least unless new information emerges). First, IMHO translating a very common/widely-used method such as string.Equals with StringComparison, which will prevent index usage, seems like a pretty big big of failure - I'd expect users to happily use it without realizing the big perf implication it brings. In addition, choosing exactly which collation to use is tricky (see my comment above about non-case differences), so I think it's preferable for users to explicitly specify the collation they want.

Yes, this is a valid concern. We could think about issuing a warning or introducing an option to disable this behavior. At least, this should be documented.

The alternative to StringComparison is for users to use the new EF.Functions.Collate. It makes users specify collations in all cases, and hopefully to also think about index/perf implications - because it's a special function, with documentation, and they've had to discover it).

This will definitely become a valid option, once we support this! But to use it efficiently will require advanced database performance knowledge, that only a few users will have. So providing a simple way to accomplish the same for common scenarios still seems the way to go here.

Of course, your provider can do whatever it wants and diverge from other providers :) But I'd just like to see if there's anything MySQL-specific here that points this way, and see if we can possibly align things if that makes sense.

Yes, having a standard documentation of supported functions (template) and their provider specific implementation, that every provider would fill out, would definitely make things easier for users that need to support multiple providers (or migrate projects from one provider to another).
The tables could then be diffed by the user.

Or a provider could just state, that all the default implementations as defined in the EF Core docs (template) are used, except for the following implementations, and then just lists the provider specific divergences.

@roji
Copy link

roji commented Apr 20, 2020

Please verify my results again on your end, if you've got the time.

Apologies, my bad... Am a bit new to MySQL and my database UI (Rider...) just shows the top line of the EXPLAIN result by default, which hides the index lookup 🤨 I can now see the results. I must say I'm surprised that a binary index can be used speed up a case-insensitive query in any way (the opposite is much less surprising).

The alternative to StringComparison is for users to use the new EF.Functions.Collate. It makes users specify collations in all cases, and hopefully to also think about index/perf implications - because it's a special function, with documentation, and they've had to discover it).

This will definitely become a valid option, once we support this!

FWIW support for EF.Functions.Collate is about to go into EF Core and will probably not require your provider to actually do anything.

But to use it efficiently will require advanced database performance knowledge, that only a few users will have. So providing a simple way to accomplish the same for common scenarios still seems the way to go here.

This is where I'm a bit confused... It seems to be that the current translations with StringComparisons are far more dangerous in terms of performance than EF.Functions.Collate will be, simply because the natural behavior for users would be to use them as-is. A warning could help with this, although our experience has shown that warnings haven't been extremely effective (a good example is the client evaluation we used to have). At the very least, it seems that whatever advanced/performance knowledge will be required to use EF.Functions.Collate is also needed in order to efficiently use the StringComparison overloads today.

At the end of the day, string comparison and collations is a good example of a big mismatch between the .NET world and the database world, among other things because of the subtle (and at the same time crucial) effect of indexes. In this kind of scenario we've generally tried to not hide the complexity, but to confront the user with it so they can make informed decisions. That's why I'm a bit scared about the StringComparison overloads.

Anyway, all that's just my opinion...

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 20, 2020

This is where I'm a bit confused... It seems to be that the current translations with StringComparisons are far more dangerous in terms of performance than EF.Functions.Collate will be, simply because the natural behavior for users would be to use them as-is. A warning could help with this, although our experience has shown that warnings haven't been extremely effective (a good example is the client evaluation we used to have).

At the end of the day, string comparison and collations is a good example of a big mismatch between the .NET world and the database world, among other things because of the subtle (and at the same time crucial) effect of indexes. In this kind of scenario we've generally tried to not hide the complexity, but to confront the user with it so they can make informed decisions. That's why I'm a bit scared about the StringComparison overloads.

I generally share your concerns here. If we hadn't had support for this before, I would probably think twice about introducing it now. But since we already have it in place, I tend to keep supporting it and to just document it properly.

I don't see any concrete dangers here however. The worst scenario should be, that the query does not use an existing index and that there could be corner cases in accent and special char handling, depending on the chosen StringComparison value.

At the very least, it seems that whatever advanced/performance knowledge will be required to use EF.Functions.Collate is also needed in order to efficiently use the StringComparison overloads today.

That is one of the main points I disagree with. Knowing e.g., that an index will partially be used, when doing a binary comparison, but only if used on the literal value, is not common knowledge and the information is not strait forward to obtain.

Getting this right by yourself when using EF.Functions.Collate(), is unrealistic for the general user base.

I think that the main body of this discussion here is about performance optimization. Pomelo provides both approaches:

  • the performance optimized approach by using the default string methods implementations, that users can optimize themselves by manually making sure, that the appropriate collation and index has been defined for the column
  • the freedom-in-choice approach, where users can choose different comparison modes in accordance with their needs, even if those do not match with the collation that has been defined for a given column

The ramifications:

If the performance of any of those two approaches is not sufficient in practice (which is more likely for the second approach), then this will be determined by the user when this becomes a problem.

Then the usual steps to optimize the slow performing query can be applied here, as they need to for other slow performing queries as well.

I do see the very real possibility, that people might choose a StringComparison parameter method over their default counterpart, because this is a best practice in general .NET development. But this might just be the price for providing a simple and natural toolkit, to compare strings.

We could think about marking these methods as obsolete and removing them for EF Core 5, or only translating them in the future on an opt-in basis. I am not sure this is necessary though. As long as we document the potential performance issues, not every translated method needs to perform as good as similar methods.

It should be noted, that there is also the opportunity for other providers to embrace the same approach that Pomelo uses, and provide additional (potentially not optimized) comparison mechanisms out-of-the-box and thereby make this method translation support the default (though I don't think this will happen). After all, these are very basic and common query scenarios that really any provider should somehow support.


I think the actual underlying issue here, that comes to light in this discussion is, that there is no natural way for users to tell, which methods can be translated and what the ramifications of using them are. And providers currently have only limited ways of naturally providing this information to the user.

I personally believe that this is one of the most important issues that needs to be tackled. Because currently, you either need serious development experience with EF Core to intuitively tell, what providers usually can translate, or you have to test every query with a trial-and-error approach, in hopes it will run as expected.

Having some kind of code-time support (like an analyzer or add-in, where providers can plugin their specifics, e.g. additional or different method descriptions), that will provide IntelliSense support for the supported methods and will warn if something unsupported is being used), would be a game-changer. If this could also show the SQL that is currently being generated for a given query, that would be the cherry on top. /cc @ajcvickers

@roji
Copy link

roji commented Apr 20, 2020

If we hadn't had support for this before, I would probably think twice about introducing it now. But since we already have it in place, I tend to keep supporting it and to just document it properly.

Fair enough, removing something and breaking people definitely has its own ramifications.

That is one of the main points I disagree with. Knowing e.g., that an index will partially be used, when doing a binary comparison, but only if used on the literal value, is not common knowledge and the information is not strait forward to obtain.

But isn't that knowledge (or at least most of it) required today as well? For one thing, your users have to somehow know that StringComparison.*IgnoresCase prevents any index from being used. For another, I'm not sure what the exact impact of partial index use is (i.e. how good it is compared to a full matching index). I do understand your point that there's value in applying a binary collation via StringComparison.Ordinal - it does seem to work partially, which is specific to MySQL AFAIK. But at the very least for the insensitive translation, IMHO there's little value in translating to LCASE - which the user could easily add themselves.

The way I see it, the complexity is there and IMHO can't really be hidden or glossed over... Each provider should probably have a guidance page specifically on case-sensitivity and collations.

I think that the main body of this discussion here is about performance optimization. Pomelo provides both approaches:
[...]
I do see the very real possibility, that people might choose a StringComparison parameter method over their default counterpart, because this is a best practice in general .NET development. But this might just be the price for providing a simple and natural toolkit, to compare strings.

That's a very important point, and the crux of the matter is what default/expected/naive code produces - ideally, idiomatic LINQ constructs which users produce naively should produce code that runs fast, and much as possible, doesn't need to be optimized. Users indeed always have the option of diving in and using EF.Functions.Collations and having full freedom there - we're in agreement about that.

Note that there are analyzers out there which actually flag string comparisons without StringComparison, guiding the user to add them; if this causes no index to be used (or even a perf degradation due to partial index use), that's quite a pit of failure.

It should be noted, that there is also the opportunity for other providers to embrace the same approach that Pomelo uses, and provide additional (potentially not optimized) comparison mechanisms out-of-the-box and thereby make this method translation support the default (though I don't think this will happen). After all, these are very basic and common query scenarios that really any provider should somehow support.
[...]
I think the actual underlying issue here, that comes to light in this discussion is, that there is no natural way for users to tell, which methods can be translated and what the ramifications of using them are. And providers currently have only limited ways of naturally providing this information to the user.

I think we differ here quite considerably. I don't think EF (or its providers) should have as a goal to translate anything the user can express in LINQ. If a translation maps well and provides reasonable perf that's great, otherwise IMHO it's better to avoid providing a translation, while ideally pointing the user (via the exception message) to their options. One of the weak points we've had IMO is uninformative "translation failed" messages which points users towards workarounds/solutions; FWIW that's exactly what we're doing in dotnet/efcore#20663. So to summarize, the strategy I have in mind is that if something is translated, the user should be able to assume that the translation is both correct and reasonably efficient.

Re code-time support for knowing what's translatable, I agree that would be great; I'm a bit scared of the perf implications of that, since running the EF Core query pipeline from an analyzer as people type code could be quite slow. Another point to keep in mind, is that we think that the vast majority of people are using one provider only (setting aside InMemory for testing); multi-provider scenarios seem rare. So it's expected for people to have a bit of a learning curve as they learn what their provider supports, and then things should be relatively smooth. But of course documentation should definitely be improved in this area, at the very least.

Just to be clear, I'm very happy we're having this conversation even if don't agree on everything, and I've already learned quite a bit (e.g. the MySQL binary index and partial index use).

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 20, 2020

I think we differ here quite considerably. I don't think EF (or its providers) should have as a goal to translate anything the user can express in LINQ. If a translation maps well and provides reasonable perf that's great, otherwise IMHO it's better to avoid providing a translation, while ideally pointing the user (via the exception message) to their options.

That is a valid approach. In my opinion, the golden path leads along the balance between great perf and common user requirements support. If we can provide the later reasonably well, I am all game. What reasonably well means, is open to interpretation of course.

Re code-time support for knowing what's translatable, I agree that would be great; I'm a bit scared of the perf implications of that, since running the EF Core query pipeline from an analyzer as people type code could be quite slow.

I think a simple way to do this would be for providers to supply some kind of xml file, that could be read by IntelliSense, to mark the supported methods in some way (bold, colored star etc.) and provide custom method descriptions that can be displayed instead of the default ones (or in addition to them). This would be better than any documentation. (I haven't looked into this at all from a technical standpoint, so I have not the slightest idea, if this is supported, how challenging this would be or how much inter-team coordination this would require. I only know, that extending IntelliSense is generally possible, and that libraries ship their own xml files for their own classes/methods).

Just to be clear, I'm very happy we're having this conversation even if don't agree on everything, and I've already learned quite a bit (e.g. the MySQL binary index and partial index use).

Absolutely! I consider this kind of discussions as very interesting and productive. If everybody would agree on everything, this would probably not represent many of the relevant user views appropriately.

@jemiller0
Copy link
Author

@lauxjpn I lost track of what he current status of this is. Is the provider back to using case insensitive by default?

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 20, 2020

I lost track of what he current status of this is.

@jemiller0 Yeah, this issue has become a bit long.

Is the provider back to using case insensitive by default?

The provider is back to using the casing that is defined by the collation of the column (currently only nightly build, but will be part of 3.2.0).

If your collation for a given column is utf8mb4_0900_as_cs for example, the casing used (if no explicit StringComparison parameter is specified) is case sensitiv.

If on the other hand a utf8mb4_0900_ai_ci collation has been defined for the column, case insensitive comparisons will be used.

This is how other providers implement it, how we implemented it before 3.0.0 and how it should be implemented. See the code of #1057 for further details.

Feel free to take the nightly build out for a spin and report back, if you like to and have the time.

@jemiller0
Copy link
Author

Thanks @lauxjpn. I appreciate yours and everyone else's meticulous attention to details like this. Hopefully, this will be a 3.x release? I'm out of luck in 5 no thanks to Microsoft switching to .NET Standard 2.1 and dropping support for .NET Framework. I'm using this with several Web Forms projects that aren't getting completely rewritten. I hope the 3.x version of EF Core continues to get fixes.

@lauxjpn
Copy link
Collaborator

lauxjpn commented Apr 20, 2020

Hopefully, this will be a 3.x release?

Yes, this will be part of the next Pomelo release, which will be 3.2.0:

The provider is back to using the casing that is defined by the collation of the column (currently only nightly build, but will be part of 3.2.0).


I'm out of luck in 5 no thanks to Microsoft switching to .NET Standard 2.1 and dropping support for .NET Framework.

I think it was a good decision of the EF Core team to switch back to .NET Standard 2.0 for EF Core 3.1 (remember, EF Core 3.0 was already using .NET Standard 2.1).

But at some point, we all need to move on. Personally, I am a big fan of Microsoft's paradigm shift to actually evolve .NET from now on, instead of carrying around all the previous dead weight until eternity. But of course this comes at the price of keeping up with its evolution.

I'm using this with several Web Forms projects that aren't getting completely rewritten.

.NET Framework 4.8 will be with us forever as well and will get security fixes, though there will be no active development there anymore. But if you just need to maintain legacy projects, staying on .NET Framework 4.8 should be fine.

I hope the 3.x version of EF Core continues to get fixes.

Take a look at the official .NET Core Support Policy:

Version End of Support
.NET Core 3.1 (LTS) December 3, 2022

But even after December 3, 2022, if EF Core is your only serious dependency, as long as you hide the EF Core access behind your WebForms layer, there should be not reason why EF Core 3.1 should not continue to run as before (and as secure as before).


Depending on how well layered your app is, you should also be able to split your app into the WebForms specific stuff and everything else without too much effort.

Then you should at least be able to run both components from different processes, one using .NET Framwork 4.8 (WebForms) and the other using the latest version of .NET Core/.NET 5+.

This should only be necessary however, if you need to continue actively development on the project. This should not be necessary for a legacy app that is solely on life support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment