Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Sargability of string literals in combination with non-Unicode columns #4686

Closed
divega opened this issue Mar 2, 2016 · 29 comments
Closed
Assignees
Labels
area-perf closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@divega
Copy link
Contributor

divega commented Mar 2, 2016

PR #4667 "fixed" issue #4622 by making all string literals Unicode in our SQL generation.

But it has been claimed that SQL Server cannot leverage indexes when comparing a Unicode literal or parameter against a non-Unicode column. For an example, see http://stackoverflow.com/questions/5828621/.

Rather than reopening #4622, I am creating a new issue to discuss the priority of sargability in this scenario on its own.

Whenever we decide to improve this we should verify in which scenarios the claim is valid and then we can avoid creating Unicode literals in those cases.

@rowanmiller rowanmiller added this to the Backlog milestone Mar 7, 2016
@rowanmiller rowanmiller changed the title Sargability of string literals in combination with non-Unicode columns Query: Sargability of string literals in combination with non-Unicode columns Mar 7, 2016
@mikes-gh
Copy link
Contributor

Just came across this with latest vnext rc2 bits. Wondered why my query was suddenly so slow
In the following example CustSuppRef is varchar and indexed in SQL server 2008 r2

It was because
Generated SQL was

SELECT [c].[TransRef], [c].[CommentLineNo], [c].[Comments], [c].[TimeStamp]
FROM [CommentDetails] AS [c]
INNER JOIN (
    SELECT DISTINCT [t].[TransRef]
    FROM [Transactions] AS [t]
    WHERE [t].[CustSuppRef] IN (N'5802685') AND (([t].[TransCode] = 0) OR ([t].[TransCode] = 1))
) AS [t2] ON [c].[TransRef] = [t2].[TransRef]
ORDER BY [t2].[TransRef]

Taking 2 seconds

In SSMS I altered the query to

SELECT [c].[TransRef], [c].[CommentLineNo], [c].[Comments], [c].[TimeStamp]
FROM [CommentDetails] AS [c]
INNER JOIN (
    SELECT DISTINCT [t].[TransRef]
    FROM [Transactions] AS [t]
    WHERE [t].[CustSuppRef] IN ('5802685') AND (([t].[TransCode] = 0) OR ([t].[TransCode] = 1))
) AS [t2] ON [c].[TransRef] = [t2].[TransRef]
ORDER BY [t2].[TransRef

Taking a few miliseconds. (Note no N on literal '5802685')

So it looks like treating all string literals as Unicode will hurt performance in this scenario dramatically
I would change this from enhancement to bug please!!.

Due to CustSuppRef being varchar and indexed
Any workarounds in the meantime?

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 11, 2016

Is custsuppref mapped AS varchar In Your model?

@mikes-gh
Copy link
Contributor

  entity.Property(e => e.CustSuppRef)
                    .HasColumnType("varchar(15)");

Yes.

@mikes-gh
Copy link
Contributor

Heres the code

  public ICollection<Invoice> GetInvoices(int[] invoiceNumber)
        {
            if (invoiceNumber == null) return null;

            string[] invoiceNoStrings = invoiceNumber.Select(x => x.ToString()).ToArray();
            ICollection<Models.Transactions> invoiceTransactions = _tradingCompanyDbContext.Transactions
                .Include(t => t.AccountNoNavigation).ThenInclude(a => a.InvTypeNavigation)
                .Include(t => t.AccountNoNavigation).ThenInclude(a => a.PaymentFreqNavigation)
                .Include(t => t.InvoiceDetails).ThenInclude(i => i.StockNoNavigation)
                .Include(t => t.DiscountDetails)
                .Include(t => t.CommentDetails)
                .Where(t => invoiceNoStrings.Contains(t.CustSuppRef)
                    && (t.TransCode == 0 || t.TransCode == 1))
                .AsNoTracking()
                .ToList();

            if (invoiceTransactions.Count != 0)
            {
                Models.CompanyDetails companyDetail = _tradingCompanyDbContext.CompanyDetails
                    .AsNoTracking()
                    .FirstOrDefault();

                return GetInvoiceViewModel(companyDetail, invoiceTransactions);
            }
            else
            {
                return null;
            }

        }

@divega divega removed this from the Backlog milestone Mar 11, 2016
@divega
Copy link
Contributor Author

divega commented Mar 11, 2016

Clearing up milestone based on the new data.

@mikes-gh
Copy link
Contributor

BTW the generated sql I posted is just part of 4 queries generated by the code. They all exhibit the same issue.
So code that used to be almost instant now takes over 8 secs.

@mikes-gh
Copy link
Contributor

Will this fix make it to rc2?
If I can be of any further help let me know

@mikes-gh
Copy link
Contributor

@divega @ErikEJ Are there any workarounds.
I'm currently blocked because of this issue.

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 16, 2016

Custom version of EFCore.Relational, or use FromSQL ?

@gdoron
Copy link

gdoron commented Mar 16, 2016

Can't EF check what DB type is the string being filtered, and if it's an NVARCHAR use unicode literal and if's VARCHAR is non-unicode literal?

And there should be something similar to EntityFunctions.AsNonUnicode in EF core that people can use to customize it for weird cases where this rule of thumb isn't good because the index is (for God knows why) VARCHAR and the column in the table is NVARCHAR.

@mikes-gh
Copy link
Contributor

@rowanmiller
Are you saying (from the labels) no support for varchar columns with indexes until RTM #4667 broke this scenario very recently.
Varchar is still common place in database first senarios.

@divega
Copy link
Contributor Author

divega commented Mar 16, 2016

Can't EF check what DB type is the string being filtered, and if it's an NVARCHAR use unicode literal and if's VARCHAR is non-unicode literal?

Yes, that is a good description of the fix that we want to implement.

Ideally we would recognize all C# relational operators and (as we already saw in @mikes-gh's example) usages of Enumerable.Contains() in both positive and negated form, between a property (for which we may know it is non-Unicode) and a literal, parameter or arbitrary expression including those (for which we only know it is a string), etc.

In practice I would expect the solution to do a reasonable effort to recognize those patterns and to leave the rest for some explicit way to specify that a string should be treated as non-Unicode (in EF 6 we had DbFunctions.AsNonUnicode() for that).

We can try to get something in soon for this but I doubt it will be ready for RC2 (which is rather close).

@ajcvickers
Copy link
Contributor

@divega @rowanmiller Maybe we could do just the AsNonUnicode part for RC2 so that there is a reasonable workaround?

@mikes-gh
Copy link
Contributor

@ajcvickers that would be great.
This is the first bug to block me with no easy workaround.

@mikes-gh
Copy link
Contributor

This is interesting.
http://sqlblog.com/blogs/marco_russo/archive/2006/10/21/unicode-varchar-nvarchar-and-index-usage-in-sql-server.aspx

It seems we can supply a non-Unicode literal or parameter as a predicate against an nvarchar field and index can be used.
It just can't apply a Unicode literal as a predicate to a varchar indexed field.

Sounds like a strong argument for reverting #4667

@divega
Copy link
Contributor Author

divega commented Mar 17, 2016

It seems we can supply a non-Unicode literal or parameter as a predicate against an nvarchar field and index can be used.

Even if I that is the case I don't think reverting #4667 is the right solution. That change actually fixed a bug. See #4781.

@ajcvickers Sounds good to me if that is viable.

@mikes-gh
Copy link
Contributor

Ah sorry yes sometimes forgot the world is not all on the same alphabet.

@mikes-gh
Copy link
Contributor

This make interesting reading

http://www.olcot.co.uk/sql-blogs/revised-difference-between-collation-sql_latin1_general_cp1_ci_as-and-latin1_general_ci_as

It appears if we use a windows collation (Latin1_General_CI_AS) on the database, index seek can still be used when comparing an nvarchar predicate to a varchar column.
My current collation is SQL_Latin1_General_CP1_CI which would cause an index scan on my large table hence the dramatic performance drop.

Why is this? The reason behind all of this is down to the actual differences between the two collations. The SQL_Latin1_General_CP1_CI_AS collation is a SQL collation and the rules around sorting data for unicode and non-unicode data are different. The Latin1_General_CI_AS collation is a Windows collation and the rules around sorting unicode and non-unicode data are the same. A Windows collation as per this example can still use an index if comparing unicode and non-unicode data albeit with a slight performance hit. A SQL collation cannot do this as shown above and comparing nvarchar data to varchar removes the ability to perform an index seek.

Of course this is not something to be done lightly but in my case I think it will be a good option as I see no downsides from moving form the SQL collation to the Wndows collation which seems to be the recommended approach anyway now.

I will report back.

@mikes-gh
Copy link
Contributor

OK good news to report

SELECT [c].[TransRef], [c].[CommentLineNo], [c].[Comments], [c].[TimeStamp]
FROM [CommentDetails] AS [c]
INNER JOIN (
    SELECT DISTINCT [t].[TransRef]
    FROM [Transactions] AS [t]
    WHERE [t].[CustSuppRef] IN (N'4802685') AND (([t].[TransCode] = 0) OR ([t].[TransCode] = 1))
) AS [t2] ON [c].[TransRef] = [t2].[TransRef]
ORDER BY [t2].[TransRef]

Now runs similar times for N'4802685'' or '4802685''

So basically changing database, table and column collation from

SQL_Latin1_General_CP1_CI
to
Latin1_General_CI_AS
Helps with this issue. (Although generating correct sql depending on data type is the ultimate solution)

BTW My sql server installation is set to Latin1_General_CI_AS so any new databases would be this anyway.
The databases in question are quite old so the SQL collation has followed them from a previous install of SQL server.

Changing collation is a bit time consuming but for me its worth it as I now have a more future proof database. I can also still benefit from smaller index size of varchar for large 3mil plus row table and SQL Server now knows how to compare nvarchar to varchar efficiently.
EFCore now works for me again 😃

Hope this helps someone else.

@smitpatel
Copy link
Contributor

#4937 added functionality to infer the Unicode-ness of string literal & parameters for common cases. (when one side of operator is Column/property).

@divega
Copy link
Contributor Author

divega commented Mar 31, 2016

@smitpatel should we close this issue now? From my understanding of what you checked in, it doesn't seem that the explicit method is compelling anymore.

@smitpatel
Copy link
Contributor

There are complex cases which may not be inferred easily which may need method.
one example is this disabled test.
https://github.com/aspnet/EntityFramework/blob/dev/test/Microsoft.EntityFrameworkCore.FunctionalTests/GearsOfWarQueryTestBase.cs#L1014
The issue is we can infer the Unicode-ness when one side is property and traverse the other side like that but we still do not have concept of Unicode-ness of return value. There can be other complex cases like that which may need explicit method.

We can track it in new issue or here only. Either way is fine with me

@mikes-gh
Copy link
Contributor

mikes-gh commented Apr 1, 2016

General question... How do I know when this fix goes to cidev.
I would like to test my senario.

@smitpatel
Copy link
Contributor

@mikes-gh - 20401 was successful build number which took this commit in. It should be in cidev packages now.

@smitpatel
Copy link
Contributor

filed #4978 for complex cases. Closing this.

@smitpatel smitpatel modified the milestones: 1.0.0-rc2, 1.0.0 Apr 4, 2016
@mikes-gh
Copy link
Contributor

mikes-gh commented Apr 5, 2016

@smitpatel thanks on hols at mo but will report back in this thread in acouple of days

@mikes-gh
Copy link
Contributor

@smitpatel
ComfrimedTSQL now without 'N and optimal execution plan as a result from my example above

.Where(t => invoiceNoStrings.Contains(t.CustSuppRef)

Many thanks

@mk9753
Copy link

mk9753 commented May 3, 2017

I think I'm coming across something related to this issue...

There is a column set up as varchar:
modelBuilder.Entity<Quotes>().Property(s => s.StripInsuredCorporationName).HasColumnType("varchar(500)");

Query:
var tmp = (from q in context.Quotes where (q.StripInsuredCorporationName.Contains(strippedName)) select new { q.QuoteID });

The resulting SQL:
SELECT [q].[QuoteID] FROM [tblQuotes] AS [q] WHERE CHARINDEX(N'Test', [q].[StripInsuredCorporationName]) > 0

Since the column is defined as varchar, not sure what is causing the N to be there.

Thanks

@divega
Copy link
Contributor Author

divega commented May 3, 2017

@mk9753 could you create a new issue, complete with information about the EF Core version you are using?

@ajcvickers ajcvickers modified the milestones: 1.0.0-rc2, 1.0.0 Oct 15, 2022
@ajcvickers ajcvickers added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-perf closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

No branches or pull requests

8 participants