-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving docs with regard to how existing string APIs deal with globalization #21249
Comments
To clarify - there's no critical need to publish these contents as-is in its own article. But this text contains basically the sum of everything that developers need to know to be successful with these APIs and to migrate their applications. If this information can somehow find its way into the relevant docs and these docs could all be linked together, that should help improve the experience. I've linked to a few of the existing docs + breaking change notices throughout the draft text. |
@carlossanlop who is the right docs person to tag, who can perhaps help us get this into a doc relatively briskly, to help folks at GA time? |
Adding @gewarren |
I think this article should be separate to the breaking change article, and suggest that it lives in this section: https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/. Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK? |
In a perfect world I am going to take this information and immediately apply it to our existing code bases which are currently net472 to proactively fix these issues prior to dealing with it during the .NET 5 lift, anything that can get me there before it is an issue (due to the scale of our code base ~20 million lines) its super preferable to reactionary efforts. |
That's a good question. The gesture of installing the analyzers via NuGet was the only way I could figure out how to actually get them to light up over my existing code. |
I believe the package name is |
I think the breaking change section at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows should at least contain the It's pretty hard to make the connection from just reading the "Globalization APIs use ICU libraries on Windows" description to "this potentially changes string.IndexOf() behavior in my app" so having that code sample would be nice. I agree that for details we should link to this GitHub issue or an eventual docs article. |
We've received a few reports of customers being surprised by behavioral changes resulting from our NLS -> ICU conversion on Windows. In just the past few days both dotnet/runtime#43802 and dotnet/runtime#43736 were opened regarding this change. The change is called out at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows, but because it has the potential to be very impactful I wonder if it deserves its own article. This other article can provide more fleshed-out examples and go into further details.
/cc @tarekgh @jeffhandley @ericstj @adamsitnik @safern @danmosemsft, who all provided feedback offline during the creation of this document.
Docs folks, what do you think of including something like this? Is there a logical place to put this information?
Improving the developer experience with regard to default string globalization
Summary
.NET 5 introduces a runtime behavioral change where globalization APIs now use ICU by default across all supported platforms. This is a departure from earlier versions of .NET Framework and .NET Core, which utilized the operating system's NLS functionality when running on Windows. See .NET Globalization and ICU for more information on these changes, including compatibility switches which can revert the application to Windows's older behavior.
Why was this change introduced?
This change was introduced to unify .NET's globalization behavior across all supported operating systems. It also provides the ability for applications to bundle their own globalization libraries rather than depend on the OS's built-in libraries. See the breaking change notification for more information.
How might behavioral differences manifest themselves, and how can developers guard against these?
Developers might be using functions like
string.IndexOf(string)
without calling the overload which takes aStringComparison
argument, inadvertently taking a dependency on culture-specific behavior when they had intended instead to perform an ordinal search. Since NLS (used by earlier .NET Core and .NET Framework versions on Windows) and ICU implement different logic in their linguistic comparers, the results of methods likestring.IndexOf(string)
might return unexpected values.This can manifest itself even in places where developers aren't always expecting globalization facilities to be active. For example, the following code can produce a different answer depending on the current runtime.
Option 1: Enable code analyzers to help detect possibly-buggy call sites
To help guard against any surprising behaviors here, we recommend installing the Microsoft.CodeAnalysis.FxCopAnalyzers NuGet package into your project. This package includes the code analysis rules CA1307 and CA1309, which help flag code which might inadvertently be using a linguistic comparer when an ordinal comparer was likely intended.
For example:
Similarly, when instantiating a sorted collection of strings or sorting an existing string-based collection, specify an explicit comparer.
For more information on these code analyzer rules, including when it might be appropriate to suppress these rules in your own code base, consult the following articles.
Option 2: Revert back to NLS behaviors when running .NET 5 apps on Windows
Developers can also follow the steps in the .NET Globalization and ICU document to revert .NET 5 applications back to older NLS behaviors when running on Windows. This is an application-wide compatibility switch and must be set at the application level. Individual libraries cannot opt-in or opt-out of this behavior. We strongly recommend developers use the CA1307 and CA1309 analyzer rules mentioned above to help improve code hygiene and discover any existing latent bugs.
What APIs are affected?
Most .NET applications should not encounter any unexpected behaviors due to the .NET 5 changes. However, due to the number of affected APIs and how foundational these APIs are to the wider .NET ecosystem, developers should be aware of the potential for .NET 5 to introduce unwanted behaviors or to expose latent bugs which already exist in the application code.
A non-exhaustive list of affected APIs follows:
System.String.Compare
System.String.EndsWith
System.String.IndexOf
System.String.StartsWith
System.String.ToLower
System.String.ToLowerInvariant
System.String.ToUpper
System.String.ToUpperInvariant
System.Globalization.TextInfo
(most members)System.Globalization.CompareInfo
(most members)System.Array.Sort
(when sorting arrays of strings)System.Collections.Generic.List<T>.Sort
(when the list elements are strings)System.Collections.Generic.SortedDictionary<TKey, TValue>
(when the keys are strings)System.Collections.Generic.SortedList<TKey, TValue>
(when the keys are strings)System.Collections.Generic.SortedSet<T>
(when the set contains strings)All of the above APIs use linguistic string searching and comparison using the thread's current culture by default. The differences between linguistic and ordinal searching and comparison are called out in the section Ordinal vs. linguistic search and comparison below.
Because ICU implements linguistic string comparisons differently from NLS, Windows-based applications which upgrade to .NET 5 from an earlier version of .NET Core or .NET Framework and which call one of the above APIs may notice that the above APIs begin exhibiting different behaviors.
Exceptions:
StringComparison
orCultureInfo
parameter, that parameter will override the API's default behavior.System.String
members where the first parameter is of typechar
(e.g.,string.IndexOf(char)
) use ordinal searching by default unless the caller passes an explicitStringComparison
parameter which specifiesCurrentCulture[IgnoreCase]
orInvariantCulture[IgnoreCase]
.Ordinal vs. linguistic search and comparison
Ordinal (also known as non-linguistic) search and comparison decomposes a string into its individual
char
elements and performs a char-by-char search or comparison. For example, the strings"dog"
and"dog"
compare as equal under anOrdinal
comparer since the two strings consist of the exact same sequence of chars. However,"dog"
and"Dog"
will compare as not equal under anOrdinal
comparer because they do not consist of the exact same sequence of chars (uppercase'D'
's code pointU+0044
occurs before lowercase'd'
's code pointU+0064
, resulting in"dog"
sorting before"Dog"
).An
OrdinalIgnoreCase
comparer still operates on a char-by-char basis, but it eliminates case differences while performing the operation. Under anOrdinalIgnoreCase
comparer, the char pairs'd'
and'D'
compare as equal, as do the char pairs'á'
and'Á'
. But the unaccented char'a'
will compare as not equal to the accented char'á'
.Some examples of this are provided in the table below:
Ordinal
comparisonOrdinalIgnoreCase
comparison"dog"
"dog"
"dog"
"Dog"
"resume"
"Resume"
"resume"
"résumé"
Unicode also allows strings to have several different in-memory representations. For example, an e-acute (é) can be represented in two possible ways:
'é'
character (also written as'\u00E9'
).'e'
character, followed by a combining accent modifier character'\u0301'
.This means that the following four strings will all result in
"résumé"
when displayed, even though their constituent pieces are different. The strings use a combination of literal'é'
characters or literal unaccented'e'
characters plus the combining accent modifier'\u0301'
."r\u00E9sum\u00E9"
"r\u00E9sume\u0301"
"re\u0301sum\u00E9"
"re\u0301sume\u0301"
Under an ordinal comparer, none of these strings will compare as equal to each other. This is because they all contain different underlying char sequences, even though when they're rendered to the screen they all look the same.
When performing a
string.IndexOf(..., StringComparison.Ordinal)
operation, the runtime will look for an exact substring match. This results in the following results.Ordinal search and comparison routines are never affected by the current thread's culture setting.
Linguistic search and comparison routines decompose a string into collation elements and perform searches or comparisons on these elements. There's not necessarily a 1:1 mapping between a string's chars and its constituent collation elements. For example, a string of length 2 may consist of only a single collation element. When two strings are compared in a linguistic-aware fashion, the comparer is checking whether the two strings' collation elements have the same semantic meaning, even if the string's literal chars are different.
Consider again the string
"résumé"
and its four different representations. The table below shows each representation broken down into its collation elements."r\u00E9sum\u00E9"
"r" + "\u00E9" + "s" + "u" + "m" + "\u00E9"
"r\u00E9sume\u0301"
"r" + "\u00E9" + "s" + "u" + "m" + "e\u0301"
"re\u0301sum\u00E9"
"r" + "e\u0301" + "s" + "u" + "m" + "\u00E9"
"re\u0301sume\u0301"
"r" + "e\u00E9" + "s" + "u" + "m" + "e\u0301"
Roughly speaking, a collation element corresponds loosely to what readers would think of as a single character or cluster of characters. It's conceptually similar to a grapheme cluster but encompasses a somewhat larger umbrella.
Under a linguistic comparer, exact matches are not necessary. Collation elements are instead compared based on their semantic meaning. For example, a linguistic comparer will treat the substrings
"\u00E9"
and"e\u0301"
as equal since they both semantically mean "a lowercase e with an acute accent modifier". This allows theIndexOf
method to match the substring"e\u0301"
within a larger string containing the semantically equivalent substring"\u00E9"
, as shown in the sample below.Culture-aware search and comparison routines are a special form of linguistic search and comparison routines. Under a culture-aware comparer, the concept of a collation element is extended to include information specific to the specified culture.
For example, in the Hungarian alphabet, when the two characters <dz> appear back-to-back they are considered their own unique letter distinct from either <d> or <z>. This means that when <dz> is seen in a string, a Hungarian culture-aware comparer will treat it as a single collation element.
"endz"
"e" + "n" + "d" + "z"
"endz"
"e" + "n" + "dz"
When using a Hungarian culture-aware comparer, this means that the string
"endz"
does not end with the substring"z"
, as <\dz> and <\z> are considered collation elements with different semantic meaning..NET also offers the invariant globalization mode. This opt-in mode disables code paths which deal with linguistic search and comparison routines. In this mode, all operations use Ordinal or OrdinalIgnoreCase behaviors, regardless of what
CultureInfo
orStringComparison
argument the caller provides. See the articles Run-time configuration options for globalization and .NET Core Globalization Invariant Mode for more information.Security implications
If an application is using an affected API for filtering, we recommend enabling the CA1307 and CA1309 rules mentioned above to help locate places where a linguistic search may have inadvertently been used in place of an ordinal search. Code patterns like the following may be susceptible to security exploits.
Because the
string.IndexOf(string)
method uses a linguistic search by default, it is possible for a string to contain a literal'<'
or'&'
character and for thestring.IndexOf(string)
routine to return -1, indicating that the search substring was not found. The code analyzer rules CA1307 and CA1309 will flag such call sites and alert the developer that there is a potential problem.Default search and comparison types
The table below lists the default search and comparison types for various string and string-like APIs. If the caller provides an explicit
CultureInfo
orStringComparison
parameter, that parameter will be honored over any default.string.Compare
string.CompareTo
string.Contains
string.EndsWith
char
)string.EndsWith
string
)string.Equals
string.GetHashCode
string.IndexOf
char
)string.IndexOf
string
)string.IndexOfAny
string.LastIndexOf
char
)string.LastIndexOf
string
)string.LastIndexOfAny
string.Replace
string.Split
string.StartsWith
char
)string.StartsWith
string
)string.ToLower
string.ToLowerInvariant
string.ToUpper
string.ToUpperInvariant
string.Trim
string.TrimEnd
string.TrimStart
string == string
string != string
Unlike
string
APIs, allMemoryExtensions
APIs perform Ordinal searches and comparisons by default, with the following exceptions.MemoryExtensions.ToLower
CultureInfo
argument)MemoryExtensions.ToLowerInvariant
MemoryExtensions.ToUpper
CultureInfo
argument)MemoryExtensions.ToUpperInvariant
A consequence of the above is that when converting code from consuming
string
to consumingReadOnlySpan<char>
, behavioral changes may be introduced inadvertently. An example of this follows.The recommended way to address this is to pass an explicit
StringComparison
parameter to these APIs. The code analyzer rules CA1307 and CA1309 can assist with this.The text was updated successfully, but these errors were encountered: