-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SqlString comparison regression bug #99507
Comments
Tagging subscribers to this area: @dotnet/area-system-globalization |
Will this bug be addressed? |
Can we please get an update on this bug? We continue to find more characters that are not handled properly and that we must hard code and handle differently. For example, the characters in the following list are equal to empty string in sql server. But as far as SqlString is concerned, they are different. It's all the more frustrating because SqlString didn't use to have this bug before ICU was introduced. |
Tagging subscribers to this area: @roji, @ajcvickers |
Globalization tables are moving target. If SqlString assumes that the globalization tables are fixed, it sounds like a problem that needs to be fixed in SqlString. @roji @ajcvickers Thoughts? |
SqlString is a SqlClient support type, @David-Engel any input on the above? |
@cheenamalhotra - is this something you can take a look at or re-assign? |
I see SqlString is part of System.Data.Common package, we could have taken a look if it were in System.Data.SqlClient or the Microsoft.Data.SqlClient namespaces. |
@cheenamalhotra @David-Engel @saurabh500 we probably need to clarify ownership here... As far as I know, the Sql* types (SqlString, SqlBoolean) are used only by SqlClient, and by no other database provider (both Microsoft.Data.SqlClient and System.Data.SqlClient reference it). It is also referenced in System.Data itself - by DataAdapter/DataTable - which is why (in my mental model) these types couldn't be extracted out to Microsoft.Data.SqlClient (this is all an unfortunate situation and an original lack of a clear separation between System.Data and SqlClient). Is the above correct? |
I think this assumption is incorrect: SQL Server/Azure can be configured with any default collation, so the above query will return different results based on which server/database you'd execute it against. So I don't think it makes sense to expect SqlString to employ a specific collation. At the end of the day, SqlString really is just a string wrapper with some specific database behavior, but I'm not convinced it should be trying to do something specific with regards to collation. There's of course the question of the behavioral change from .NET Framework - that may be worth looking into; but I suspect whatever explains the change is much wider than SqlString. |
This is expected consequence of switching from NLS to ICU for globalization. https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu has more details. |
Thanks @jkotas, everything is clear. I'm going to close this as by by-design, but @cheenamalhotra @David-Engel @saurabh500 @clement911 if you have any comments let me know, we can always reopen. |
@roji |
@roji The point of So IMO this is still a regression. And yes, the comparison in SQL server depends on the collation. See runtime/src/libraries/System.Data.Common/ref/System.Data.Common.cs Lines 3062 to 3071 in 5d3bac1
These options are the same options that are used by the various collations, and that determine the comparison logic. For example, to use Latin1_General_100_BIN2, we use To use collation Latin1_General_100_CI_AI_SC, we can use In EF core, it poses a problem for us. |
The SqlCompareOptions enum is far, far away from "supporting the behavior of all collations". Collations go way beyond stuff like case sensitivity and Kana width - a collation fully defines how text is compared (and sorted), and rules really do differ across languages. For example, a case-insensitive comparison of i and I would yield true in most cases (it's the same letter i, in lower and upper case), but in Turkish it would yield false (these are two different letters). So once again, you SqlString comparisons may have been mimicking the SQL Server behavior in your very specific scenario (with a specific collation), but it certainly does not (and cannot) do that in the general case, where arbitrary collations may be used.
As written above, that isn't something that can be done 100% reliably, among other things precisely because SqlString can't possibly know which collation is actually going to be used in the database. Stepping back, I generally find the approach of trying to reimplement database operations in a 100% reliable way rather dubious, and unfeasible in the general case. Maintaining .NET types which replicate the database comparison behavior is impossible because of collations - I'm sure that if I look into it I'll find various other subtle discrepancies between the client- and server-side behavior. No other database drivers attempt to provide this, and for good reason.
I really do think trying to anticipate when SQL Server will throw a unique constraint violation is the wrong way to go, and am interested in why you think it's important to do that, rather than just attempting to insert, getting an error and handling it. |
Example: using System.Data.SqlTypes;
using System.Globalization;
var compareOptions = SqlCompareOptions.IgnoreCase;
var turkishLCID = new CultureInfo("tr-TR").LCID;
SqlString string1 = new SqlString("i", turkishLCID, compareOptions);
SqlString string2 = new SqlString("I", turkishLCID, compareOptions);
if (string1.Equals(string2))
Console.WriteLine("The strings are equal.");
else
Console.WriteLine("The strings are not equal."); This will print
@roji public class Customer
{
[Key]
public long Id { get; private set; }
//other properties
}
public class CustomerTag
{
[Key]
public long CustomerId { get; private set; }
[Key]
public string TagName {get; private set;}
//other properties
} The When a user uses the app, they update the tags that apply to the customer and the app will save them in the DB. public async Task UpdateCustomer(long customerId, string[] tags)
{
var ctx = //get a context
foreach (string tag in tags)
{
ctx.Add(new CustomerTag(customerId, tag);
}
await ctx.SaveChangesAsync();
} This works well most of the time, but it can fail. If we were to retry, it would keep failing, as we don't know how to deduplicate the array! Note that if instead of |
That's good to know, but the main point is that for Turkish users, the previous (.NET Framework) wasn't working out of the box, since their database collation wasn't the default; they had to manually tell SqlString how to compare the strings (just like you can manually do so now, if you want). In other words, it is impossible for SqlString to reliably replicate the comparison behavior of your database, so whatever default behavior it has is going to be wrong for a lot of users. That default may have been good for you in .NET Framework, but that's just because of your specific scenario, and there's no reason to consider the behavioral change in SqlString as different, special or worse compared to any other consequence of the switch to ICU.
If I were to write this code, I'd simply execute SaveChangesAsync after each Add, and catch and ignore unique constraint violations. If performance is important in this particular code path, you can keep your current code which attempts to batch all the inserts, and only if a unique constraint violation is thrown, fall back to the slower one-by-one insert. I do understand that this is more code and is annoying... But I really do believe that it's the only way to do this properly; trying to anticipate and replicate database behavior (of any kind, not just comparison!) at the client simply can't be done reliably in the general case. |
Yes, a database has a default collation and
I understand but this is a work around that doesn't always fit with certain patterns. I provided a simplified example. In reality, we use the Unit of work pattern (that EF is very good at handling). Several components add their changes to the context, and once that's done, all the changes are to be persisted at the same time in a single transaction (with SaveChanges). So it's not that simple to use this work around. But this in a EF tangent. If SqlString behaved correctly, we wouldn't have to resort to this kind of work around.
I agree it's not easy but I'm not sure why it can't be done reliably. Sql server follows some sorting rules, which can be replicated. For example, It used to work. And now it doesn't. If there is no appetite or resources to fix it, so be it. But then maybe BTW, I always appreciate your thoughtful comments, even if I don't always agree 👍 |
The collation rules are not set in stone. They change over the time like other globalization data. You would also need to provide some identification of the collation rules to use. |
To make sure we're not conflating things, here's a quick summary: First, there's "things used to just work out-of-the-box with the default, without me having to configure anything explicitly on SqlString". I think it's clear that this argument doesn't hold; if this was the case for you, it was only the case because the previous .NET default behavior happened to match your SQL Server default - that was simply lucky, and anyone changing their SQL Server collation wouldn't get this. So in the general case, users wanting to replicate server string comparison must tell SqlString which collation to use - this was the case before the change to ICU, and is the case after as well. Therefore, the switch to ICU was certainly a behavioral breaking change, but is not a regression or something that requires addressing in SqlString. Which brings us to the second point:
Here, it seems that you're saying that since the switch to ICU, it is impossible to get SqlString to compare like the SQL Server behavior, for at least some SQL Server collation. ICU is a very comprehensive collations library, and it's pretty unlikely that there's such a case; maybe SqlString doesn't allow enough flexibility in specifying the precise collation, or maybe SQL Server has some very specific behavior which happened to also exist in the previous, pre-ICU .NET behavior - I'm not sure. To understand this better, I'd want a full repro, with the exact character in question and a minimal .NET repro (console program).
The point is that your workaround simply doesn't work in general case. Sure, you can dedpulicate tags cilent-side before attempting to store them, but how do you make sure the tags you're inserting doesn't conflict with tags that already exist in the database? You can load the existing tags before saving to again deduplicate the tags about to be inserted, but then how do you make sure that someone doesn't concurrently insert a new tag which conflicts with one of the tags, after you've loaded the database tags but before insertion? I'm trying to say that regardless of collations or SqlString, the idea of trying to anticipate and avoid a unique constraint violation simply doesn't work in the general case. It might work in constrained scenarios - e.g. you know that nobody else is updating the table concurrently - but that's generally not a good way to write a database application. Note that databases usually have some mechanism to "insert or ignore", i.e. instruct the database to simply silently ignore unique constraint violations, which is what you're really looking for; PostgreSQL has INSERT ... ON CONFLICT DO NOTHING, and on SQL Server that's typically achieved via MERGE (though that's not atomic/concurrency-safe). EF unfortunately doesn't expose a way of leveraging that at the moment: that's tracked by dotnet/efcore#16949.
My point wasn't necessarily about strings, more about the general case - databases support a large range of complex types, which can be stored and have a unique index over them. Imagine types that represent spatial data (complex polygons), or any arbitrary such type (see the basic PG type list); we'd need to have corresponding .NET types which 100% reliably replicate the database comparison logic - that's not really feasible. |
Just wondering, what was the behavior here for SQL Server installed on Linux? Did it carry some private copy of NLS (or its own localization library) or did it default to using ICU as that’s the primary localization library on Linux systems and so the behavior would’ve already differed based on the OS that SQL Server was running against? — I expect that even for NLS everywhere, breaking changes have happened over time to support newer Unicode standards as well, or other breaks such as https://learn.microsoft.com/en-us/windows/win32/win7appqual/nls-sorting-changes have happened and which may be impactful to such scenarios and have caused differences over time. I would expect the same if SQL Server carried its own localization library as it would not have perfect parity with the older NLS behavior in all cases |
@tannergooding yeah, SQL Server on Linux runs over its own OS abstraction layer where many components were carried over from Windows for maximum compatibility; so that in general it behaves the same with high fidelity regardless of whether it's running on Linux or Windows. I suspect that means it doesn't run on ICU on Linux (just as it doesn't on Windows) - but I'm not sure. |
Yes. SQL Server carries multiple private copies of NLS data. You would not want OS update or SQL server update that came with NLS data update to invalidate your database. https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support explains it in detail. BTW: You can see this in .NET Framework sources too. SQL Server CLR integration had to follow the same regime as the rest of the SQL Server and it count use the OS default globalization data that are a moving target. .NET Framework implementation of System.Globalization had hooks to override the native globalization implementation per AppDomain that SQL CLR used to ensure that the globalization behavior matches the configuration of your database. |
Description
The SqlString comparison logic returns incorrect result for some string
Reproduction Steps
Expected behavior
Console prints true
Actual behavior
Console prints false
Regression?
It's a regression.
It works fine in .NET framework 4.7.2 so it appears the bug is limited to .NET core
Known Workarounds
Use NLS instead of ICU but this has other consequences throughout the app that we don't want.
Configuration
.NET 8 Windows x64
Other information
The class
SqlString
is designed to behave like a SQL string and it doesn't behave like SQL anymore, which causes some issues for us and could cause similar issues for any apps that rely on this class and upgrades to .NET5+.As an example, if one runs the following command on Sql Server or Sql Azure, it will print 'Equal', which proves that, from a sql perspective, both string should be equal (using the default latin collation). The
SqlString
should reflect this same behavior.SqlString
relies onSystem.Globalization.CompareOptions
internally.I understand that there is a flag to use NLS instead of ICU, however we don't want to apply this flag globally.
What we probably need are new options
System.Globalization.CompareOptions.UseNLS
andSystem.Data.SqlTypes.SqlCompareOptions.UseNLS
so that we can opt-in to use NLS on a per-call basis (as opposed to the current global flag)Related to this issue.
The text was updated successfully, but these errors were encountered: