Improving docs with regard to how existing string APIs deal with globalization #21249

GrabYourPitchforks · 2020-10-27T21:56:32Z

We've received a few reports of customers being surprised by behavioral changes resulting from our NLS -> ICU conversion on Windows. In just the past few days both dotnet/runtime#43802 and dotnet/runtime#43736 were opened regarding this change. The change is called out at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows, but because it has the potential to be very impactful I wonder if it deserves its own article. This other article can provide more fleshed-out examples and go into further details.

/cc @tarekgh @jeffhandley @ericstj @adamsitnik @safern @danmosemsft, who all provided feedback offline during the creation of this document.

Docs folks, what do you think of including something like this? Is there a logical place to put this information?

Improving the developer experience with regard to default string globalization

Summary

.NET 5 introduces a runtime behavioral change where globalization APIs now use ICU by default across all supported platforms. This is a departure from earlier versions of .NET Framework and .NET Core, which utilized the operating system's NLS functionality when running on Windows. See .NET Globalization and ICU for more information on these changes, including compatibility switches which can revert the application to Windows's older behavior.

Why was this change introduced?

This change was introduced to unify .NET's globalization behavior across all supported operating systems. It also provides the ability for applications to bundle their own globalization libraries rather than depend on the OS's built-in libraries. See the breaking change notification for more information.

How might behavioral differences manifest themselves, and how can developers guard against these?

Developers might be using functions like string.IndexOf(string) without calling the overload which takes a StringComparison argument, inadvertently taking a dependency on culture-specific behavior when they had intended instead to perform an ordinal search. Since NLS (used by earlier .NET Core and .NET Framework versions on Windows) and ICU implement different logic in their linguistic comparers, the results of methods like string.IndexOf(string) might return unexpected values.

This can manifest itself even in places where developers aren't always expecting globalization facilities to be active. For example, the following code can produce a different answer depending on the current runtime.

string s = "Hello\r\nworld!";
int idx = s.IndexOf("\n");
Console.WriteLine(idx);

// The above snippet prints:
// '6' when running on .NET Framework (Windows)
// '6' when running on .NET Core 2.x - 3.x (Windows)
// '-1' when running on .NET 5 (Windows)
// '-1' when running on .NET Core 2.x - 3.x or .NET 5 (non-Windows)
// '6' when running on .NET Core 2.x or .NET 5 (in invariant mode)

Option 1: Enable code analyzers to help detect possibly-buggy call sites

To help guard against any surprising behaviors here, we recommend installing the Microsoft.CodeAnalysis.FxCopAnalyzers NuGet package into your project. This package includes the code analysis rules CA1307 and CA1309, which help flag code which might inadvertently be using a linguistic comparer when an ordinal comparer was likely intended.

For example:

//
// Potentially incorrect code - answer might vary based on locale
//
string s = GetString();
int idx = s.IndexOf(","); // produces analyzer warning CA1307
Console.WriteLine(idx);

//
// Corrected code - matches the literal substring ","
//
string s = GetString();
int idx = s.IndexOf(",", StringComparison.Ordinal);
Console.WriteLine(idx);

//
// Corrected code (alternative) - searches for the literal ',' character
//
string s = GetString();
int idx = s.IndexOf(',');
Console.WriteLine(idx);

Similarly, when instantiating a sorted collection of strings or sorting an existing string-based collection, specify an explicit comparer.

//
// Potentially incorrect code - behavior might vary based on locale
//
SortedSet<string> mySet = new SortedSet<string>();
List<string> list = GetListOfStrings();
list.Sort();

//
// Corrected code - uses ordinal sorting; doesn't vary by locale
//
SortedSet<string> mySet = new SortedSet<string>(StringComparer.Ordinal);
List<string> list = GetListOfStrings();
list.Sort(StringComparer.Ordinal);

For more information on these code analyzer rules, including when it might be appropriate to suppress these rules in your own code base, consult the following articles.

Option 2: Revert back to NLS behaviors when running .NET 5 apps on Windows

Developers can also follow the steps in the .NET Globalization and ICU document to revert .NET 5 applications back to older NLS behaviors when running on Windows. This is an application-wide compatibility switch and must be set at the application level. Individual libraries cannot opt-in or opt-out of this behavior. We strongly recommend developers use the CA1307 and CA1309 analyzer rules mentioned above to help improve code hygiene and discover any existing latent bugs.

What APIs are affected?

Most .NET applications should not encounter any unexpected behaviors due to the .NET 5 changes. However, due to the number of affected APIs and how foundational these APIs are to the wider .NET ecosystem, developers should be aware of the potential for .NET 5 to introduce unwanted behaviors or to expose latent bugs which already exist in the application code.

A non-exhaustive list of affected APIs follows:

System.String.Compare
System.String.EndsWith
System.String.IndexOf
System.String.StartsWith
System.String.ToLower
System.String.ToLowerInvariant
System.String.ToUpper
System.String.ToUpperInvariant
System.Globalization.TextInfo (most members)
System.Globalization.CompareInfo (most members)
System.Array.Sort (when sorting arrays of strings)
System.Collections.Generic.List<T>.Sort (when the list elements are strings)
System.Collections.Generic.SortedDictionary<TKey, TValue> (when the keys are strings)
System.Collections.Generic.SortedList<TKey, TValue> (when the keys are strings)
System.Collections.Generic.SortedSet<T> (when the set contains strings)

All of the above APIs use linguistic string searching and comparison using the thread's current culture by default. The differences between linguistic and ordinal searching and comparison are called out in the section Ordinal vs. linguistic search and comparison below.

Because ICU implements linguistic string comparisons differently from NLS, Windows-based applications which upgrade to .NET 5 from an earlier version of .NET Core or .NET Framework and which call one of the above APIs may notice that the above APIs begin exhibiting different behaviors.

Exceptions:

If an API accepts an explicit StringComparison or CultureInfo parameter, that parameter will override the API's default behavior.
System.String members where the first parameter is of type char (e.g., string.IndexOf(char)) use ordinal searching by default unless the caller passes an explicit StringComparison parameter which specifies CurrentCulture[IgnoreCase] or InvariantCulture[IgnoreCase].

See the section Default search and comparison types later in this document for a more detailed analysis of each string API's default behavior.

Ordinal vs. linguistic search and comparison

See the article Best Practices for Using Strings in .NET for further information.

Ordinal (also known as non-linguistic) search and comparison decomposes a string into its individual char elements and performs a char-by-char search or comparison. For example, the strings "dog" and "dog" compare as equal under an Ordinal comparer since the two strings consist of the exact same sequence of chars. However, "dog" and "Dog" will compare as not equal under an Ordinal comparer because they do not consist of the exact same sequence of chars (uppercase 'D''s code point U+0044 occurs before lowercase 'd''s code point U+0064, resulting in "dog" sorting before "Dog").

An OrdinalIgnoreCase comparer still operates on a char-by-char basis, but it eliminates case differences while performing the operation. Under an OrdinalIgnoreCase comparer, the char pairs 'd' and 'D' compare as equal, as do the char pairs 'á' and 'Á'. But the unaccented char 'a' will compare as not equal to the accented char 'á'.

Some examples of this are provided in the table below:

String 1	String 2	`Ordinal` comparison	`OrdinalIgnoreCase` comparison
`"dog"`	`"dog"`	equal	equal
`"dog"`	`"Dog"`	not equal	equal
`"resume"`	`"Resume"`	not equal	equal
`"resume"`	`"résumé"`	not equal	not equal

Unicode also allows strings to have several different in-memory representations. For example, an e-acute (é) can be represented in two possible ways:

A single literal 'é' character (also written as '\u00E9').
A literal unaccented 'e' character, followed by a combining accent modifier character '\u0301'.

This means that the following four strings will all result in "résumé" when displayed, even though their constituent pieces are different. The strings use a combination of literal 'é' characters or literal unaccented 'e' characters plus the combining accent modifier '\u0301'.

"r\u00E9sum\u00E9"
"r\u00E9sume\u0301"
"re\u0301sum\u00E9"
"re\u0301sume\u0301"

Under an ordinal comparer, none of these strings will compare as equal to each other. This is because they all contain different underlying char sequences, even though when they're rendered to the screen they all look the same.

When performing a string.IndexOf(..., StringComparison.Ordinal) operation, the runtime will look for an exact substring match. This results in the following results.

Console.WriteLine("resume".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e", StringComparison.Ordinal)); // prints '-1'
Console.WriteLine("r\u00E9sume\u0301".IndexOf("e", StringComparison.Ordinal)); // prints '5'
Console.WriteLine("re\u0301sum\u00E9".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("re\u0301sume\u0301".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("resume".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '-1'
Console.WriteLine("r\u00E9sume\u0301".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '5'
Console.WriteLine("re\u0301sum\u00E9".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'
Console.WriteLine("re\u0301sume\u0301".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'

Ordinal search and comparison routines are never affected by the current thread's culture setting.

Linguistic search and comparison routines decompose a string into collation elements and perform searches or comparisons on these elements. There's not necessarily a 1:1 mapping between a string's chars and its constituent collation elements. For example, a string of length 2 may consist of only a single collation element. When two strings are compared in a linguistic-aware fashion, the comparer is checking whether the two strings' collation elements have the same semantic meaning, even if the string's literal chars are different.

Consider again the string "résumé" and its four different representations. The table below shows each representation broken down into its collation elements.

String	As collation elements
`"r\u00E9sum\u00E9"`	`"r" + "\u00E9" + "s" + "u" + "m" + "\u00E9"`
`"r\u00E9sume\u0301"`	`"r" + "\u00E9" + "s" + "u" + "m" + "e\u0301"`
`"re\u0301sum\u00E9"`	`"r" + "e\u0301" + "s" + "u" + "m" + "\u00E9"`
`"re\u0301sume\u0301"`	`"r" + "e\u00E9" + "s" + "u" + "m" + "e\u0301"`

Roughly speaking, a collation element corresponds loosely to what readers would think of as a single character or cluster of characters. It's conceptually similar to a grapheme cluster but encompasses a somewhat larger umbrella.

Under a linguistic comparer, exact matches are not necessary. Collation elements are instead compared based on their semantic meaning. For example, a linguistic comparer will treat the substrings "\u00E9" and "e\u0301" as equal since they both semantically mean "a lowercase e with an acute accent modifier". This allows the IndexOf method to match the substring "e\u0301" within a larger string containing the semantically equivalent substring "\u00E9", as shown in the sample below.

Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e")); // prints '-1' (not found)
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e\u00E9")); // prints '1'
Console.WriteLine("\u00E9".IndexOf("e\u00E9")); // prints '0'

As a consequence of this, two strings of different lengths may compare as equal if a linguistic comparison is used. Callers should take care not to special-case logic dealing with string length in such scenarios.

Culture-aware search and comparison routines are a special form of linguistic search and comparison routines. Under a culture-aware comparer, the concept of a collation element is extended to include information specific to the specified culture.

For example, in the Hungarian alphabet, when the two characters <dz> appear back-to-back they are considered their own unique letter distinct from either <d> or <z>. This means that when <dz> is seen in a string, a Hungarian culture-aware comparer will treat it as a single collation element.

String	As collation elements	Remarks
`"endz"`	`"e" + "n" + "d" + "z"`	(using a standard linguistic comparer)
`"endz"`	`"e" + "n" + "dz"`	(using a Hungarian culture-aware comparer)

When using a Hungarian culture-aware comparer, this means that the string "endz" does not end with the substring "z", as <\dz> and <\z> are considered collation elements with different semantic meaning.

// Set thread culture to Hungarian
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("hu-HU");
Console.WriteLine("endz".EndsWith("z")); // Prints 'False'

// Set thread culture to invariant culture
CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;
Console.WriteLine("endz".EndsWith("z")); // Prints 'True'

Behavioral note: Linguistic and culture-aware comparers can undergo behavioral adjustments from time to time. Both ICU and the older Windows NLS facility are updated to account for how world languages change. See the blog post Locale (culture) data churn for more information. The Ordinal comparer's behavior will never change since it's performing exact bitwise searching and comparison. However, the OrdinalIgnoreCase comparer's behavior may change as Unicode grows to encompass more character sets and corrects omissions in existing casing data.

Usage note: The comparers StringComparison.InvariantCulture and StringComparison.InvariantCultureIgnoreCase are linguistic comparers that are not culture-aware. That is, these comparers understand concepts such as the accented character é having multiple possible underlying representations and that all such representations should be treated equal. But non-culture-aware linguistic comparers won't contain special handling for <dz> as distinct from <d> or <z>, as shown above. They also won't special-case characters like the German Eszett (ß).

.NET also offers the invariant globalization mode. This opt-in mode disables code paths which deal with linguistic search and comparison routines. In this mode, all operations use Ordinal or OrdinalIgnoreCase behaviors, regardless of what CultureInfo or StringComparison argument the caller provides. See the articles Run-time configuration options for globalization and .NET Core Globalization Invariant Mode for more information.

Security implications

If an application is using an affected API for filtering, we recommend enabling the CA1307 and CA1309 rules mentioned above to help locate places where a linguistic search may have inadvertently been used in place of an ordinal search. Code patterns like the following may be susceptible to security exploits.

//
// THIS SAMPLE CODE IS INCORRECT.
// DO NOT USE IT IN PRODUCTION.
//
public bool ContainsHtmlSensitiveCharacters(string input)
{
    if (input.IndexOf("<") >= 0) { return true; }
    if (input.IndexOf("&") >= 0) { return true; }
    return false;
}

Because the string.IndexOf(string) method uses a linguistic search by default, it is possible for a string to contain a literal '<' or '&' character and for the string.IndexOf(string) routine to return -1, indicating that the search substring was not found. The code analyzer rules CA1307 and CA1309 will flag such call sites and alert the developer that there is a potential problem.

Default search and comparison types

The table below lists the default search and comparison types for various string and string-like APIs. If the caller provides an explicit CultureInfo or StringComparison parameter, that parameter will be honored over any default.

API	Default behavior	Remarks
`string.Compare`	CurrentCulture
`string.CompareTo`	CurrentCulture
`string.Contains`	Ordinal
`string.EndsWith`	Ordinal	(when the first parameter is a `char`)
`string.EndsWith`	CurrentCulture	(when the first parameter is a `string`)
`string.Equals`	Ordinal
`string.GetHashCode`	Ordinal
`string.IndexOf`	Ordinal	(when the first parameter is a `char`)
`string.IndexOf`	CurrentCulture	(when the first parameter is a `string`)
`string.IndexOfAny`	Ordinal
`string.LastIndexOf`	Ordinal	(when the first parameter is a `char`)
`string.LastIndexOf`	CurrentCulture	(when the first parameter is a `string`)
`string.LastIndexOfAny`	Ordinal
`string.Replace`	Ordinal
`string.Split`	Ordinal
`string.StartsWith`	Ordinal	(when the first parameter is a `char`)
`string.StartsWith`	CurrentCulture	(when the first parameter is a `string`)
`string.ToLower`	CurrentCulture
`string.ToLowerInvariant`	InvariantCulture
`string.ToUpper`	CurrentCulture
`string.ToUpperInvariant`	InvariantCulture
`string.Trim`	Ordinal
`string.TrimEnd`	Ordinal
`string.TrimStart`	Ordinal
`string == string`	Ordinal
`string != string`	Ordinal

Unlike string APIs, all MemoryExtensions APIs perform Ordinal searches and comparisons by default, with the following exceptions.

API	Default behavior	Remarks
`MemoryExtensions.ToLower`	CurrentCulture	(when passed a null `CultureInfo` argument)
`MemoryExtensions.ToLowerInvariant`	InvariantCulture
`MemoryExtensions.ToUpper`	CurrentCulture	(when passed a null `CultureInfo` argument)
`MemoryExtensions.ToUpperInvariant`	InvariantCulture

A consequence of the above is that when converting code from consuming string to consuming ReadOnlySpan<char>, behavioral changes may be introduced inadvertently. An example of this follows.

string str = GetString();
if (str.StartsWith("Hello")) { /* do something */ } // this is a CULTURE-AWARE (linguistic) comparison

ReadOnlySpan<char> span = s.AsSpan();
if (span.StartsWith("Hello")) { /* do something */ } // this is an ORDINAL (non-linguistic) comparison

The recommended way to address this is to pass an explicit StringComparison parameter to these APIs. The code analyzer rules CA1307 and CA1309 can assist with this.

string str = GetString();
if (str.StartsWith("Hello", StringComparison.Ordinal)) { /* do something */ } // ordinal comparison

ReadOnlySpan<char> span = s.AsSpan();
if (span.StartsWith("Hello", StringComparison.Ordinal)) { /* do something */ } // ordinal comparison

The text was updated successfully, but these errors were encountered:

danmoseley · 2020-10-27T22:10:38Z

cc @PriyaPurkayastha @marklio

GrabYourPitchforks · 2020-10-27T22:14:36Z

To clarify - there's no critical need to publish these contents as-is in its own article. But this text contains basically the sum of everything that developers need to know to be successful with these APIs and to migrate their applications. If this information can somehow find its way into the relevant docs and these docs could all be linked together, that should help improve the experience. I've linked to a few of the existing docs + breaking change notices throughout the draft text.

danmoseley · 2020-10-27T22:19:24Z

@carlossanlop who is the right docs person to tag, who can perhaps help us get this into a doc relatively briskly, to help folks at GA time?

PriyaPurkayastha · 2020-10-27T22:30:01Z

Adding @gewarren
Thanks for putting this together @GrabYourPitchforks
I think a lot of this is useful information that needs to be included in the breaking change doc https://docs.microsoft.com/en-us/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows. It probably makes the breaking change doc a bit lengthy but it clearly explains the change in behavior that customers will see due to the breaking change.

gewarren · 2020-10-27T22:44:49Z

I think this article should be separate to the breaking change article, and suggest that it lives in this section: https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/.

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

aolszowka · 2020-10-28T14:15:35Z

@gewarren

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

In a perfect world I am going to take this information and immediately apply it to our existing code bases which are currently net472 to proactively fix these issues prior to dealing with it during the .NET 5 lift, anything that can get me there before it is an issue (due to the scale of our code base ~20 million lines) its super preferable to reactionary efforts.

GrabYourPitchforks · 2020-10-28T17:33:25Z

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

That's a good question. The gesture of installing the analyzers via NuGet was the only way I could figure out how to actually get them to light up over my existing code.

safern · 2020-10-28T17:43:02Z

I believe the package name is Microsoft.CodeAnalysis.NetAnalyzers and we also include in the SDK Microsoft.CodeAnalysis.CSharp.CodeStyle -- I would think these analyzers would be in the first one though, but not sure.

akoeplinger · 2020-10-29T20:31:25Z

I think the breaking change section at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows should at least contain the How might behavioral differences manifest themselves, and how can developers guard against these? example from @GrabYourPitchforks description.

It's pretty hard to make the connection from just reading the "Globalization APIs use ICU libraries on Windows" description to "this potentially changes string.IndexOf() behavior in my app" so having that code sample would be nice.

I agree that for details we should link to this GitHub issue or an eventual docs article.

GrabYourPitchforks added doc-idea Indicates issues that are suggestions for new topics [org][type][category] breaking-change Indicates a .NET Core breaking change labels Oct 27, 2020

dotnet-bot added the ⌚ Not Triaged Not triaged label Oct 27, 2020

GrabYourPitchforks mentioned this issue Oct 28, 2020

Improving the developer experience with regard to default string globalization dotnet/runtime#43956

Open

gewarren self-assigned this Oct 29, 2020

gewarren added 🏁 Release: .NET 5 Work items for the .NET 5 release P1 and removed ⌚ Not Triaged Not triaged breaking-change Indicates a .NET Core breaking change labels Oct 29, 2020

gewarren mentioned this issue Nov 4, 2020

Behavior changes due to NLS -> ICU switch on Windows #21333

Merged

gewarren closed this as completed in #21333 Nov 4, 2020

gewarren mentioned this issue Nov 4, 2020

Update ICU breaking change with practical examples #21345

Merged

GrabYourPitchforks mentioned this issue Nov 9, 2020

Why does this code have different results on different OS dotnet/core#5522

Closed

danmoseley mentioned this issue Sep 28, 2021

FXVersion.TryParse fails to parse version on Thai culture dotnet/sdk#21518

Closed

ygoe mentioned this issue Feb 18, 2023

Typos #34169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving docs with regard to how existing string APIs deal with globalization #21249

Improving docs with regard to how existing string APIs deal with globalization #21249

GrabYourPitchforks commented Oct 27, 2020

danmoseley commented Oct 27, 2020

GrabYourPitchforks commented Oct 27, 2020

danmoseley commented Oct 27, 2020

PriyaPurkayastha commented Oct 27, 2020

gewarren commented Oct 27, 2020

aolszowka commented Oct 28, 2020

GrabYourPitchforks commented Oct 28, 2020

safern commented Oct 28, 2020

akoeplinger commented Oct 29, 2020 •

edited

Loading

Improving docs with regard to how existing string APIs deal with globalization #21249

Improving docs with regard to how existing string APIs deal with globalization #21249

Comments

GrabYourPitchforks commented Oct 27, 2020

Improving the developer experience with regard to default string globalization

Summary

Why was this change introduced?

How might behavioral differences manifest themselves, and how can developers guard against these?

Option 1: Enable code analyzers to help detect possibly-buggy call sites

Option 2: Revert back to NLS behaviors when running .NET 5 apps on Windows

What APIs are affected?

Ordinal vs. linguistic search and comparison

Security implications

Default search and comparison types

danmoseley commented Oct 27, 2020

GrabYourPitchforks commented Oct 27, 2020

danmoseley commented Oct 27, 2020

PriyaPurkayastha commented Oct 27, 2020

gewarren commented Oct 27, 2020

aolszowka commented Oct 28, 2020

GrabYourPitchforks commented Oct 28, 2020

safern commented Oct 28, 2020

akoeplinger commented Oct 29, 2020 • edited Loading

akoeplinger commented Oct 29, 2020 •

edited

Loading