Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlString comparison regression bug #99507

Closed
clement911 opened this issue Mar 10, 2024 · 23 comments
Closed

SqlString comparison regression bug #99507

clement911 opened this issue Mar 10, 2024 · 23 comments

Comments

@clement911
Copy link

Description

The SqlString comparison logic returns incorrect result for some string

Reproduction Steps

using System;
using System.Data.SqlTypes;

namespace ConsoleApp2
{
    internal class Program
    {
        static void Main(string[] args)
        {
            bool areEqual = new SqlString("").Equals(new SqlString("�"));

            //Expecting true
            //Getting true on .NET framework 4.7.2
            //Getting false on .NET 8
            Console.WriteLine(areEqual);
        }
    }
}

Expected behavior

Console prints true

Actual behavior

Console prints false

Regression?

It's a regression.
It works fine in .NET framework 4.7.2 so it appears the bug is limited to .NET core

Known Workarounds

Use NLS instead of ICU but this has other consequences throughout the app that we don't want.

Configuration

.NET 8 Windows x64

Other information

The class SqlString is designed to behave like a SQL string and it doesn't behave like SQL anymore, which causes some issues for us and could cause similar issues for any apps that rely on this class and upgrades to .NET5+.

As an example, if one runs the following command on Sql Server or Sql Azure, it will print 'Equal', which proves that, from a sql perspective, both string should be equal (using the default latin collation). The SqlString should reflect this same behavior.

SELECT CASE WHEN N'' = N'' THEN 'Equal' ELSE 'Unequal' END

SqlString relies on System.Globalization.CompareOptions internally.
I understand that there is a flag to use NLS instead of ICU, however we don't want to apply this flag globally.

What we probably need are new options System.Globalization.CompareOptions.UseNLS and System.Data.SqlTypes.SqlCompareOptions.UseNLS so that we can opt-in to use NLS on a per-call basis (as opposed to the current global flag)

Related to this issue.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 10, 2024
@clement911
Copy link
Author

Will this bug be addressed?

@clement911
Copy link
Author

Can we please get an update on this bug?

We continue to find more characters that are not handled properly and that we must hard code and handle differently.

For example, the characters in the following list are equal to empty string in sql server. But as far as SqlString is concerned, they are different.
✅ ʻ ػ

It's all the more frustrating because SqlString didn't use to have this bug before ICU was introduced.

Copy link
Contributor

Tagging subscribers to this area: @roji, @ajcvickers
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented May 22, 2024

Globalization tables are moving target. If SqlString assumes that the globalization tables are fixed, it sounds like a problem that needs to be fixed in SqlString. @roji @ajcvickers Thoughts?

@roji
Copy link
Member

roji commented May 22, 2024

SqlString is a SqlClient support type, @David-Engel any input on the above?

@SamMonoRT
Copy link
Member

@cheenamalhotra - is this something you can take a look at or re-assign?

@cheenamalhotra
Copy link
Member

I see SqlString is part of System.Data.Common package, we could have taken a look if it were in System.Data.SqlClient or the Microsoft.Data.SqlClient namespaces.

@roji
Copy link
Member

roji commented Aug 27, 2024

@cheenamalhotra @David-Engel @saurabh500 we probably need to clarify ownership here... As far as I know, the Sql* types (SqlString, SqlBoolean) are used only by SqlClient, and by no other database provider (both Microsoft.Data.SqlClient and System.Data.SqlClient reference it). It is also referenced in System.Data itself - by DataAdapter/DataTable - which is why (in my mental model) these types couldn't be extracted out to Microsoft.Data.SqlClient (this is all an unfortunate situation and an original lack of a clear separation between System.Data and SqlClient).

Is the above correct?

@roji roji modified the milestone: Future Aug 27, 2024
@roji
Copy link
Member

roji commented Aug 27, 2024

@clement911

As an example, if one runs the following command on Sql Server or Sql Azure, it will print 'Equal', which proves that, from a sql perspective, both string should be equal (using the default latin collation). The SqlString should reflect this same behavior.

SELECT CASE WHEN N'�' = N'' THEN 'Equal' ELSE 'Unequal' END

I think this assumption is incorrect: SQL Server/Azure can be configured with any default collation, so the above query will return different results based on which server/database you'd execute it against. So I don't think it makes sense to expect SqlString to employ a specific collation.

At the end of the day, SqlString really is just a string wrapper with some specific database behavior, but I'm not convinced it should be trying to do something specific with regards to collation.

There's of course the question of the behavioral change from .NET Framework - that may be worth looking into; but I suspect whatever explains the change is much wider than SqlString.

@jkotas
Copy link
Member

jkotas commented Aug 27, 2024

There's of course the question of the behavioral change from .NET Framework - that may be worth looking into

This is expected consequence of switching from NLS to ICU for globalization. https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu has more details.

@roji
Copy link
Member

roji commented Aug 27, 2024

Thanks @jkotas, everything is clear.

I'm going to close this as by by-design, but @cheenamalhotra @David-Engel @saurabh500 @clement911 if you have any comments let me know, we can always reopen.

@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024
@roji roji removed the untriaged New issue has not been triaged by the area owner label Aug 27, 2024
@David-Engel
Copy link
Contributor

@cheenamalhotra @David-Engel @saurabh500 we probably need to clarify ownership here... As far as I know, the Sql* types (SqlString, SqlBoolean) are used only by SqlClient, and by no other database provider (both Microsoft.Data.SqlClient and System.Data.SqlClient reference it). It is also referenced in System.Data itself - by DataAdapter/DataTable - which is why (in my mental model) these types couldn't be extracted out to Microsoft.Data.SqlClient (this is all an unfortunate situation and an original lack of a clear separation between System.Data and SqlClient).

Is the above correct?

@roji
I believe that's correct, yes.

@clement911
Copy link
Author

@roji
Yes this is an "expected" consequence of the transition of NLS to ICU for .NET string, but that doesn't mean that the new behavior is "by-design" for SqlString.

The point of SqlString is to provide an API that can compare strings in the same way that Sql Server does.
The bulk of the implementation code of SqlString deals with comparing string together, in a way that mimics Sql Server.

So IMO this is still a regression.

And yes, the comparison in SQL server depends on the collation.
SqlString also already supports the behaviour of all collations, by allowing multiple options to be passed in the SqlCompareOptions.

See

public enum SqlCompareOptions
{
None = 0,
IgnoreCase = 1,
IgnoreNonSpace = 2,
IgnoreKanaType = 8,
IgnoreWidth = 16,
BinarySort2 = 16384,
BinarySort = 32768,
}

These options are the same options that are used by the various collations, and that determine the comparison logic.

For example, to use Latin1_General_100_BIN2, we use SqlCompareOptions.BinarySort2.

To use collation Latin1_General_100_CI_AI_SC, we can use SqlCompareOptions.IgnoreCase | SqlCompareOptions.IgnoreNonSpace | SqlCompareOptions.IgnoreKanaType | SqlCompareOptions.IgnoreWidth. Or at least, we could use i. This doesn't always work anymore because of this bug.

In EF core, it poses a problem for us.
For example, "ABC�" and "ABC" are the same as far as SQL server is concerned (in default collation).
If there is a key or unique index on this value, EF will happily let us create two rows with each value. But this will create an exception when trying to persist the changes, because SQL server will complain that it cannot insert duplicate values.
We have worked around this limitation by using SqlString to compare the values before persisting. If some values are considered equal, we can create a single row instead of 2. But this doesn't work anymore since SqlString doesn't reflect the Sql Server behaviour anymore.

@roji
Copy link
Member

roji commented Aug 28, 2024

SqlString also already supports the behaviour of all collations, by allowing multiple options to be passed in the SqlCompareOptions.

The SqlCompareOptions enum is far, far away from "supporting the behavior of all collations". Collations go way beyond stuff like case sensitivity and Kana width - a collation fully defines how text is compared (and sorted), and rules really do differ across languages. For example, a case-insensitive comparison of i and I would yield true in most cases (it's the same letter i, in lower and upper case), but in Turkish it would yield false (these are two different letters).

So once again, you SqlString comparisons may have been mimicking the SQL Server behavior in your very specific scenario (with a specific collation), but it certainly does not (and cannot) do that in the general case, where arbitrary collations may be used.

The point of SqlString is to provide an API that can compare strings in the same way that Sql Server does.
The bulk of the implementation code of SqlString deals with comparing string together, in a way that mimics Sql Server.

As written above, that isn't something that can be done 100% reliably, among other things precisely because SqlString can't possibly know which collation is actually going to be used in the database.

Stepping back, I generally find the approach of trying to reimplement database operations in a 100% reliable way rather dubious, and unfeasible in the general case. Maintaining .NET types which replicate the database comparison behavior is impossible because of collations - I'm sure that if I look into it I'll find various other subtle discrepancies between the client- and server-side behavior. No other database drivers attempt to provide this, and for good reason.

If there is a key or unique index on this value, EF will happily let us create two rows with each value. But this will create an exception when trying to persist the changes, because SQL server will complain that it cannot insert duplicate values.

I really do think trying to anticipate when SQL Server will throw a unique constraint violation is the wrong way to go, and am interested in why you think it's important to do that, rather than just attempting to insert, getting an error and handling it.

@clement911
Copy link
Author

SqlString can actually handle turkish locale and other locales.

Example:

using System.Data.SqlTypes;
using System.Globalization;

var compareOptions = SqlCompareOptions.IgnoreCase;
var turkishLCID = new CultureInfo("tr-TR").LCID;

SqlString string1 = new SqlString("i", turkishLCID, compareOptions);
SqlString string2 = new SqlString("I", turkishLCID, compareOptions);

if (string1.Equals(string2))
    Console.WriteLine("The strings are equal.");
else
    Console.WriteLine("The strings are not equal.");

This will print The strings are not equal.

I really do think trying to anticipate when SQL Server will throw a unique constraint violation is the wrong way to go, and am interested in why you think it's important to do that, rather than just attempting to insert, getting an error and handling it.

@roji
Sure, I can expand on our specific scenario.
I will simplify the code but here is the gist.
We have the concept of Customer and CustomerTag in our EF model.

public class Customer
{
    [Key]
    public long Id { get; private set; }
    
    //other properties
}

public class CustomerTag
{
    [Key]
    public long CustomerId { get; private set; }

    [Key]
    public string TagName {get; private set;}
    
    //other properties
}

The CustomerTag key is the composite (CustomerId, TagName) and the collation of TagName is Latin1_General_100_CI_AI_SC

When a user uses the app, they update the tags that apply to the customer and the app will save them in the DB.

public async Task UpdateCustomer(long customerId, string[] tags)
{
   var ctx = //get a context
   foreach (string tag in tags)
   {
      ctx.Add(new CustomerTag(customerId, tag);
   }
   await ctx.SaveChangesAsync();
}

This works well most of the time, but it can fail.
For example, if the given tags are ["ABC�", "VIP, "123", "ABC"], sql server with throw an error because both ABC� and ABC are considered equal.
We use the SqlString class to deduplicate the array. But this doesn't work anymore because of the bug mentioned in this issue.

If we were to retry, it would keep failing, as we don't know how to deduplicate the array!

Note that if instead of TagName, we had instead TagNumber (int), EF would be able catch the duplicate when calling Add(tagNumber) with the same number twice. But for strings, it doesn't realize the duplicate.

@roji
Copy link
Member

roji commented Aug 29, 2024

SqlString can actually handle turkish locale and other locales.

That's good to know, but the main point is that for Turkish users, the previous (.NET Framework) wasn't working out of the box, since their database collation wasn't the default; they had to manually tell SqlString how to compare the strings (just like you can manually do so now, if you want). In other words, it is impossible for SqlString to reliably replicate the comparison behavior of your database, so whatever default behavior it has is going to be wrong for a lot of users. That default may have been good for you in .NET Framework, but that's just because of your specific scenario, and there's no reason to consider the behavioral change in SqlString as different, special or worse compared to any other consequence of the switch to ICU.

If we were to retry, it would keep failing, as we don't know how to deduplicate the array!

If I were to write this code, I'd simply execute SaveChangesAsync after each Add, and catch and ignore unique constraint violations. If performance is important in this particular code path, you can keep your current code which attempts to batch all the inserts, and only if a unique constraint violation is thrown, fall back to the slower one-by-one insert.

I do understand that this is more code and is annoying... But I really do believe that it's the only way to do this properly; trying to anticipate and replicate database behavior (of any kind, not just comparison!) at the client simply can't be done reliably in the general case.

@clement911
Copy link
Author

@roji

Yes, a database has a default collation and SqlString of course doesn't know what collation that is, but you can provide the SqlCompareOptions and the culture code, which in a sense provides the collation. Indeed, the collation names follow a specific pattern that indicates which options are turned on, for example the "_CI_AI" suffix means Case-insensitive, accent-insensitive, kana-insensitive, width-insensitive.
For simplicity, I didn't set any culture or options in my repro for this issue, but one can reproduce the bug with a specific culture and options as well. That is, even if you provide the options and culture that match a specific collation, the comparison behavior between Sql Server and SqlString doesn't always match. Not anymore at least. It used to work before the switch to ICU.

If I were to write this code, I'd simply execute SaveChangesAsync after each Add, and catch and ignore unique constraint violations.

I understand but this is a work around that doesn't always fit with certain patterns. I provided a simplified example. In reality, we use the Unit of work pattern (that EF is very good at handling). Several components add their changes to the context, and once that's done, all the changes are to be persisted at the same time in a single transaction (with SaveChanges). So it's not that simple to use this work around. But this in a EF tangent. If SqlString behaved correctly, we wouldn't have to resort to this kind of work around.

But I really do believe that it's the only way to do this properly; trying to anticipate and replicate database behavior (of any kind, not just comparison!) at the client simply can't be done reliably in the general case.

I agree it's not easy but I'm not sure why it can't be done reliably. Sql server follows some sorting rules, which can be replicated.

For example, SqlString implements the binary and binary2 collation sorting algorithms, as they are defined in sql server. (I agree that binary comparisons are easier because they are not culture dependent).
It also knows to ignore trailing spaces, which is again specific to sql server.
(SqlDecimal also has deep knowledge on how exactly to manipule sql decimals.)

It used to work. And now it doesn't.

If there is no appetite or resources to fix it, so be it. But then maybe SqlString's comparison logic should be deprecated. IMO it's probably better to have no API than to have an API that doesn't work as expected.

BTW, I always appreciate your thoughtful comments, even if I don't always agree 👍

@jkotas
Copy link
Member

jkotas commented Aug 30, 2024

you can provide the SqlCompareOptions and the culture code, which in a sense provides the collation

The collation rules are not set in stone. They change over the time like other globalization data. You would also need to provide some identification of the collation rules to use.

@roji
Copy link
Member

roji commented Aug 30, 2024

[...] but you can provide the SqlCompareOptions and the culture code, which in a sense provides the collation. [...]

To make sure we're not conflating things, here's a quick summary:

First, there's "things used to just work out-of-the-box with the default, without me having to configure anything explicitly on SqlString". I think it's clear that this argument doesn't hold; if this was the case for you, it was only the case because the previous .NET default behavior happened to match your SQL Server default - that was simply lucky, and anyone changing their SQL Server collation wouldn't get this. So in the general case, users wanting to replicate server string comparison must tell SqlString which collation to use - this was the case before the change to ICU, and is the case after as well. Therefore, the switch to ICU was certainly a behavioral breaking change, but is not a regression or something that requires addressing in SqlString.

Which brings us to the second point:

For simplicity, I didn't set any culture or options in my repro for this issue, but one can reproduce the bug with a specific culture and options as well. That is, even if you provide the options and culture that match a specific collation, the comparison behavior between Sql Server and SqlString doesn't always match. Not anymore at least. It used to work before the switch to ICU.

Here, it seems that you're saying that since the switch to ICU, it is impossible to get SqlString to compare like the SQL Server behavior, for at least some SQL Server collation. ICU is a very comprehensive collations library, and it's pretty unlikely that there's such a case; maybe SqlString doesn't allow enough flexibility in specifying the precise collation, or maybe SQL Server has some very specific behavior which happened to also exist in the previous, pre-ICU .NET behavior - I'm not sure. To understand this better, I'd want a full repro, with the exact character in question and a minimal .NET repro (console program).

I understand but this is a work around that doesn't always fit with certain patterns. I provided a simplified example. In reality, we use the Unit of work pattern (that EF is very good at handling). Several components add their changes to the context, and once that's done, all the changes are to be persisted at the same time in a single transaction (with SaveChanges). So it's not that simple to use this work around. But this in a EF tangent. If SqlString behaved correctly, we wouldn't have to resort to this kind of work around.

The point is that your workaround simply doesn't work in general case. Sure, you can dedpulicate tags cilent-side before attempting to store them, but how do you make sure the tags you're inserting doesn't conflict with tags that already exist in the database? You can load the existing tags before saving to again deduplicate the tags about to be inserted, but then how do you make sure that someone doesn't concurrently insert a new tag which conflicts with one of the tags, after you've loaded the database tags but before insertion?

I'm trying to say that regardless of collations or SqlString, the idea of trying to anticipate and avoid a unique constraint violation simply doesn't work in the general case. It might work in constrained scenarios - e.g. you know that nobody else is updating the table concurrently - but that's generally not a good way to write a database application.

Note that databases usually have some mechanism to "insert or ignore", i.e. instruct the database to simply silently ignore unique constraint violations, which is what you're really looking for; PostgreSQL has INSERT ... ON CONFLICT DO NOTHING, and on SQL Server that's typically achieved via MERGE (though that's not atomic/concurrency-safe). EF unfortunately doesn't expose a way of leveraging that at the moment: that's tracked by dotnet/efcore#16949.

But I really do believe that it's the only way to do this properly; trying to anticipate and replicate database behavior (of any kind, not just comparison!) at the client simply can't be done reliably in the general case.

I agree it's not easy but I'm not sure why it can't be done reliably. Sql server follows some sorting rules, which can be replicated.

My point wasn't necessarily about strings, more about the general case - databases support a large range of complex types, which can be stored and have a unique index over them. Imagine types that represent spatial data (complex polygons), or any arbitrary such type (see the basic PG type list); we'd need to have corresponding .NET types which 100% reliably replicate the database comparison logic - that's not really feasible.

@tannergooding
Copy link
Member

Just wondering, what was the behavior here for SQL Server installed on Linux?

Did it carry some private copy of NLS (or its own localization library) or did it default to using ICU as that’s the primary localization library on Linux systems and so the behavior would’ve already differed based on the OS that SQL Server was running against? — I expect that even for NLS everywhere, breaking changes have happened over time to support newer Unicode standards as well, or other breaks such as https://learn.microsoft.com/en-us/windows/win32/win7appqual/nls-sorting-changes have happened and which may be impactful to such scenarios and have caused differences over time. I would expect the same if SQL Server carried its own localization library as it would not have perfect parity with the older NLS behavior in all cases

@roji
Copy link
Member

roji commented Aug 30, 2024

@tannergooding yeah, SQL Server on Linux runs over its own OS abstraction layer where many components were carried over from Windows for maximum compatibility; so that in general it behaves the same with high fidelity regardless of whether it's running on Linux or Windows. I suspect that means it doesn't run on ICU on Linux (just as it doesn't on Windows) - but I'm not sure.

@jkotas
Copy link
Member

jkotas commented Aug 30, 2024

Yes. SQL Server carries multiple private copies of NLS data. You would not want OS update or SQL server update that came with NLS data update to invalidate your database. https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support explains it in detail.

BTW: You can see this in .NET Framework sources too. SQL Server CLR integration had to follow the same regime as the rest of the SQL Server and it count use the OS default globalization data that are a moving target. .NET Framework implementation of System.Globalization had hooks to override the native globalization implementation per AppDomain that SQL CLR used to ensure that the globalization behavior matches the configuration of your database.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants