Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving docs with regard to how existing string APIs deal with globalization #21249

Closed
GrabYourPitchforks opened this issue Oct 27, 2020 · 9 comments · Fixed by #21333
Closed
Assignees
Labels
🏁 Release: .NET 5 Work items for the .NET 5 release doc-idea Indicates issues that are suggestions for new topics [org][type][category]

Comments

@GrabYourPitchforks
Copy link
Member

We've received a few reports of customers being surprised by behavioral changes resulting from our NLS -> ICU conversion on Windows. In just the past few days both dotnet/runtime#43802 and dotnet/runtime#43736 were opened regarding this change. The change is called out at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows, but because it has the potential to be very impactful I wonder if it deserves its own article. This other article can provide more fleshed-out examples and go into further details.

/cc @tarekgh @jeffhandley @ericstj @adamsitnik @safern @danmosemsft, who all provided feedback offline during the creation of this document.

Docs folks, what do you think of including something like this? Is there a logical place to put this information?


Improving the developer experience with regard to default string globalization

Summary

.NET 5 introduces a runtime behavioral change where globalization APIs now use ICU by default across all supported platforms. This is a departure from earlier versions of .NET Framework and .NET Core, which utilized the operating system's NLS functionality when running on Windows. See .NET Globalization and ICU for more information on these changes, including compatibility switches which can revert the application to Windows's older behavior.

Why was this change introduced?

This change was introduced to unify .NET's globalization behavior across all supported operating systems. It also provides the ability for applications to bundle their own globalization libraries rather than depend on the OS's built-in libraries. See the breaking change notification for more information.

How might behavioral differences manifest themselves, and how can developers guard against these?

Developers might be using functions like string.IndexOf(string) without calling the overload which takes a StringComparison argument, inadvertently taking a dependency on culture-specific behavior when they had intended instead to perform an ordinal search. Since NLS (used by earlier .NET Core and .NET Framework versions on Windows) and ICU implement different logic in their linguistic comparers, the results of methods like string.IndexOf(string) might return unexpected values.

This can manifest itself even in places where developers aren't always expecting globalization facilities to be active. For example, the following code can produce a different answer depending on the current runtime.

string s = "Hello\r\nworld!";
int idx = s.IndexOf("\n");
Console.WriteLine(idx);

// The above snippet prints:
// '6' when running on .NET Framework (Windows)
// '6' when running on .NET Core 2.x - 3.x (Windows)
// '-1' when running on .NET 5 (Windows)
// '-1' when running on .NET Core 2.x - 3.x or .NET 5 (non-Windows)
// '6' when running on .NET Core 2.x or .NET 5 (in invariant mode)

Option 1: Enable code analyzers to help detect possibly-buggy call sites

To help guard against any surprising behaviors here, we recommend installing the Microsoft.CodeAnalysis.FxCopAnalyzers NuGet package into your project. This package includes the code analysis rules CA1307 and CA1309, which help flag code which might inadvertently be using a linguistic comparer when an ordinal comparer was likely intended.

For example:

//
// Potentially incorrect code - answer might vary based on locale
//
string s = GetString();
int idx = s.IndexOf(","); // produces analyzer warning CA1307
Console.WriteLine(idx);

//
// Corrected code - matches the literal substring ","
//
string s = GetString();
int idx = s.IndexOf(",", StringComparison.Ordinal);
Console.WriteLine(idx);

//
// Corrected code (alternative) - searches for the literal ',' character
//
string s = GetString();
int idx = s.IndexOf(',');
Console.WriteLine(idx);

Similarly, when instantiating a sorted collection of strings or sorting an existing string-based collection, specify an explicit comparer.

//
// Potentially incorrect code - behavior might vary based on locale
//
SortedSet<string> mySet = new SortedSet<string>();
List<string> list = GetListOfStrings();
list.Sort();

//
// Corrected code - uses ordinal sorting; doesn't vary by locale
//
SortedSet<string> mySet = new SortedSet<string>(StringComparer.Ordinal);
List<string> list = GetListOfStrings();
list.Sort(StringComparer.Ordinal);

For more information on these code analyzer rules, including when it might be appropriate to suppress these rules in your own code base, consult the following articles.

Option 2: Revert back to NLS behaviors when running .NET 5 apps on Windows

Developers can also follow the steps in the .NET Globalization and ICU document to revert .NET 5 applications back to older NLS behaviors when running on Windows. This is an application-wide compatibility switch and must be set at the application level. Individual libraries cannot opt-in or opt-out of this behavior. We strongly recommend developers use the CA1307 and CA1309 analyzer rules mentioned above to help improve code hygiene and discover any existing latent bugs.

What APIs are affected?

Most .NET applications should not encounter any unexpected behaviors due to the .NET 5 changes. However, due to the number of affected APIs and how foundational these APIs are to the wider .NET ecosystem, developers should be aware of the potential for .NET 5 to introduce unwanted behaviors or to expose latent bugs which already exist in the application code.

A non-exhaustive list of affected APIs follows:

All of the above APIs use linguistic string searching and comparison using the thread's current culture by default. The differences between linguistic and ordinal searching and comparison are called out in the section Ordinal vs. linguistic search and comparison below.

Because ICU implements linguistic string comparisons differently from NLS, Windows-based applications which upgrade to .NET 5 from an earlier version of .NET Core or .NET Framework and which call one of the above APIs may notice that the above APIs begin exhibiting different behaviors.

Exceptions:

  • If an API accepts an explicit StringComparison or CultureInfo parameter, that parameter will override the API's default behavior.
  • System.String members where the first parameter is of type char (e.g., string.IndexOf(char)) use ordinal searching by default unless the caller passes an explicit StringComparison parameter which specifies CurrentCulture[IgnoreCase] or InvariantCulture[IgnoreCase].

See the section Default search and comparison types later in this document for a more detailed analysis of each string API's default behavior.

Ordinal vs. linguistic search and comparison

See the article Best Practices for Using Strings in .NET for further information.

Ordinal (also known as non-linguistic) search and comparison decomposes a string into its individual char elements and performs a char-by-char search or comparison. For example, the strings "dog" and "dog" compare as equal under an Ordinal comparer since the two strings consist of the exact same sequence of chars. However, "dog" and "Dog" will compare as not equal under an Ordinal comparer because they do not consist of the exact same sequence of chars (uppercase 'D''s code point U+0044 occurs before lowercase 'd''s code point U+0064, resulting in "dog" sorting before "Dog").

An OrdinalIgnoreCase comparer still operates on a char-by-char basis, but it eliminates case differences while performing the operation. Under an OrdinalIgnoreCase comparer, the char pairs 'd' and 'D' compare as equal, as do the char pairs 'á' and 'Á'. But the unaccented char 'a' will compare as not equal to the accented char 'á'.

Some examples of this are provided in the table below:

String 1 String 2 Ordinal comparison OrdinalIgnoreCase comparison
"dog" "dog" equal equal
"dog" "Dog" not equal equal
"resume" "Resume" not equal equal
"resume" "résumé" not equal not equal

Unicode also allows strings to have several different in-memory representations. For example, an e-acute (é) can be represented in two possible ways:

  • A single literal 'é' character (also written as '\u00E9').
  • A literal unaccented 'e' character, followed by a combining accent modifier character '\u0301'.

This means that the following four strings will all result in "résumé" when displayed, even though their constituent pieces are different. The strings use a combination of literal 'é' characters or literal unaccented 'e' characters plus the combining accent modifier '\u0301'.

  • "r\u00E9sum\u00E9"
  • "r\u00E9sume\u0301"
  • "re\u0301sum\u00E9"
  • "re\u0301sume\u0301"

Under an ordinal comparer, none of these strings will compare as equal to each other. This is because they all contain different underlying char sequences, even though when they're rendered to the screen they all look the same.

When performing a string.IndexOf(..., StringComparison.Ordinal) operation, the runtime will look for an exact substring match. This results in the following results.

Console.WriteLine("resume".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e", StringComparison.Ordinal)); // prints '-1'
Console.WriteLine("r\u00E9sume\u0301".IndexOf("e", StringComparison.Ordinal)); // prints '5'
Console.WriteLine("re\u0301sum\u00E9".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("re\u0301sume\u0301".IndexOf("e", StringComparison.Ordinal)); // prints '1'
Console.WriteLine("resume".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '-1'
Console.WriteLine("r\u00E9sume\u0301".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '5'
Console.WriteLine("re\u0301sum\u00E9".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'
Console.WriteLine("re\u0301sume\u0301".IndexOf("E", StringComparison.OrdinalIgnoreCase)); // prints '1'

Ordinal search and comparison routines are never affected by the current thread's culture setting.

Linguistic search and comparison routines decompose a string into collation elements and perform searches or comparisons on these elements. There's not necessarily a 1:1 mapping between a string's chars and its constituent collation elements. For example, a string of length 2 may consist of only a single collation element. When two strings are compared in a linguistic-aware fashion, the comparer is checking whether the two strings' collation elements have the same semantic meaning, even if the string's literal chars are different.

Consider again the string "résumé" and its four different representations. The table below shows each representation broken down into its collation elements.

String As collation elements
"r\u00E9sum\u00E9" "r" + "\u00E9" + "s" + "u" + "m" + "\u00E9"
"r\u00E9sume\u0301" "r" + "\u00E9" + "s" + "u" + "m" + "e\u0301"
"re\u0301sum\u00E9" "r" + "e\u0301" + "s" + "u" + "m" + "\u00E9"
"re\u0301sume\u0301" "r" + "e\u00E9" + "s" + "u" + "m" + "e\u0301"

Roughly speaking, a collation element corresponds loosely to what readers would think of as a single character or cluster of characters. It's conceptually similar to a grapheme cluster but encompasses a somewhat larger umbrella.

Under a linguistic comparer, exact matches are not necessary. Collation elements are instead compared based on their semantic meaning. For example, a linguistic comparer will treat the substrings "\u00E9" and "e\u0301" as equal since they both semantically mean "a lowercase e with an acute accent modifier". This allows the IndexOf method to match the substring "e\u0301" within a larger string containing the semantically equivalent substring "\u00E9", as shown in the sample below.

Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e")); // prints '-1' (not found)
Console.WriteLine("r\u00E9sum\u00E9".IndexOf("e\u00E9")); // prints '1'
Console.WriteLine("\u00E9".IndexOf("e\u00E9")); // prints '0'

As a consequence of this, two strings of different lengths may compare as equal if a linguistic comparison is used. Callers should take care not to special-case logic dealing with string length in such scenarios.

Culture-aware search and comparison routines are a special form of linguistic search and comparison routines. Under a culture-aware comparer, the concept of a collation element is extended to include information specific to the specified culture.

For example, in the Hungarian alphabet, when the two characters <dz> appear back-to-back they are considered their own unique letter distinct from either <d> or <z>. This means that when <dz> is seen in a string, a Hungarian culture-aware comparer will treat it as a single collation element.

String As collation elements Remarks
"endz" "e" + "n" + "d" + "z" (using a standard linguistic comparer)
"endz" "e" + "n" + "dz" (using a Hungarian culture-aware comparer)

When using a Hungarian culture-aware comparer, this means that the string "endz" does not end with the substring "z", as <\dz> and <\z> are considered collation elements with different semantic meaning.

// Set thread culture to Hungarian
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("hu-HU");
Console.WriteLine("endz".EndsWith("z")); // Prints 'False'

// Set thread culture to invariant culture
CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;
Console.WriteLine("endz".EndsWith("z")); // Prints 'True'

Behavioral note: Linguistic and culture-aware comparers can undergo behavioral adjustments from time to time. Both ICU and the older Windows NLS facility are updated to account for how world languages change. See the blog post Locale (culture) data churn for more information. The Ordinal comparer's behavior will never change since it's performing exact bitwise searching and comparison. However, the OrdinalIgnoreCase comparer's behavior may change as Unicode grows to encompass more character sets and corrects omissions in existing casing data.

Usage note: The comparers StringComparison.InvariantCulture and StringComparison.InvariantCultureIgnoreCase are linguistic comparers that are not culture-aware. That is, these comparers understand concepts such as the accented character é having multiple possible underlying representations and that all such representations should be treated equal. But non-culture-aware linguistic comparers won't contain special handling for <dz> as distinct from <d> or <z>, as shown above. They also won't special-case characters like the German Eszett (ß).

.NET also offers the invariant globalization mode. This opt-in mode disables code paths which deal with linguistic search and comparison routines. In this mode, all operations use Ordinal or OrdinalIgnoreCase behaviors, regardless of what CultureInfo or StringComparison argument the caller provides. See the articles Run-time configuration options for globalization and .NET Core Globalization Invariant Mode for more information.

Security implications

If an application is using an affected API for filtering, we recommend enabling the CA1307 and CA1309 rules mentioned above to help locate places where a linguistic search may have inadvertently been used in place of an ordinal search. Code patterns like the following may be susceptible to security exploits.

//
// THIS SAMPLE CODE IS INCORRECT.
// DO NOT USE IT IN PRODUCTION.
//
public bool ContainsHtmlSensitiveCharacters(string input)
{
    if (input.IndexOf("<") >= 0) { return true; }
    if (input.IndexOf("&") >= 0) { return true; }
    return false;
}

Because the string.IndexOf(string) method uses a linguistic search by default, it is possible for a string to contain a literal '<' or '&' character and for the string.IndexOf(string) routine to return -1, indicating that the search substring was not found. The code analyzer rules CA1307 and CA1309 will flag such call sites and alert the developer that there is a potential problem.

Default search and comparison types

The table below lists the default search and comparison types for various string and string-like APIs. If the caller provides an explicit CultureInfo or StringComparison parameter, that parameter will be honored over any default.

API Default behavior Remarks
string.Compare CurrentCulture
string.CompareTo CurrentCulture
string.Contains Ordinal
string.EndsWith Ordinal (when the first parameter is a char)
string.EndsWith CurrentCulture (when the first parameter is a string)
string.Equals Ordinal
string.GetHashCode Ordinal
string.IndexOf Ordinal (when the first parameter is a char)
string.IndexOf CurrentCulture (when the first parameter is a string)
string.IndexOfAny Ordinal
string.LastIndexOf Ordinal (when the first parameter is a char)
string.LastIndexOf CurrentCulture (when the first parameter is a string)
string.LastIndexOfAny Ordinal
string.Replace Ordinal
string.Split Ordinal
string.StartsWith Ordinal (when the first parameter is a char)
string.StartsWith CurrentCulture (when the first parameter is a string)
string.ToLower CurrentCulture
string.ToLowerInvariant InvariantCulture
string.ToUpper CurrentCulture
string.ToUpperInvariant InvariantCulture
string.Trim Ordinal
string.TrimEnd Ordinal
string.TrimStart Ordinal
string == string Ordinal
string != string Ordinal

Unlike string APIs, all MemoryExtensions APIs perform Ordinal searches and comparisons by default, with the following exceptions.

API Default behavior Remarks
MemoryExtensions.ToLower CurrentCulture (when passed a null CultureInfo argument)
MemoryExtensions.ToLowerInvariant InvariantCulture
MemoryExtensions.ToUpper CurrentCulture (when passed a null CultureInfo argument)
MemoryExtensions.ToUpperInvariant InvariantCulture

A consequence of the above is that when converting code from consuming string to consuming ReadOnlySpan<char>, behavioral changes may be introduced inadvertently. An example of this follows.

string str = GetString();
if (str.StartsWith("Hello")) { /* do something */ } // this is a CULTURE-AWARE (linguistic) comparison

ReadOnlySpan<char> span = s.AsSpan();
if (span.StartsWith("Hello")) { /* do something */ } // this is an ORDINAL (non-linguistic) comparison

The recommended way to address this is to pass an explicit StringComparison parameter to these APIs. The code analyzer rules CA1307 and CA1309 can assist with this.

string str = GetString();
if (str.StartsWith("Hello", StringComparison.Ordinal)) { /* do something */ } // ordinal comparison

ReadOnlySpan<char> span = s.AsSpan();
if (span.StartsWith("Hello", StringComparison.Ordinal)) { /* do something */ } // ordinal comparison
@GrabYourPitchforks GrabYourPitchforks added doc-idea Indicates issues that are suggestions for new topics [org][type][category] breaking-change Indicates a .NET Core breaking change labels Oct 27, 2020
@dotnet-bot dotnet-bot added the ⌚ Not Triaged Not triaged label Oct 27, 2020
@danmoseley
Copy link
Member

cc @PriyaPurkayastha @marklio

@GrabYourPitchforks
Copy link
Member Author

To clarify - there's no critical need to publish these contents as-is in its own article. But this text contains basically the sum of everything that developers need to know to be successful with these APIs and to migrate their applications. If this information can somehow find its way into the relevant docs and these docs could all be linked together, that should help improve the experience. I've linked to a few of the existing docs + breaking change notices throughout the draft text.

@danmoseley
Copy link
Member

@carlossanlop who is the right docs person to tag, who can perhaps help us get this into a doc relatively briskly, to help folks at GA time?

@PriyaPurkayastha
Copy link

Adding @gewarren
Thanks for putting this together @GrabYourPitchforks
I think a lot of this is useful information that needs to be included in the breaking change doc https://docs.microsoft.com/en-us/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows. It probably makes the breaking change doc a bit lengthy but it clearly explains the change in behavior that customers will see due to the breaking change.

@gewarren
Copy link
Contributor

I think this article should be separate to the breaking change article, and suggest that it lives in this section: https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/.

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

@aolszowka
Copy link

@gewarren

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

In a perfect world I am going to take this information and immediately apply it to our existing code bases which are currently net472 to proactively fix these issues prior to dealing with it during the .NET 5 lift, anything that can get me there before it is an issue (due to the scale of our code base ~20 million lines) its super preferable to reactionary efforts.

@GrabYourPitchforks
Copy link
Member Author

Also, regarding the code analyzers, why are we recommending to install the FxCop NuGet package when the analyzers are included with the .NET 5.0 SDK?

That's a good question. The gesture of installing the analyzers via NuGet was the only way I could figure out how to actually get them to light up over my existing code.

@safern
Copy link
Member

safern commented Oct 28, 2020

I believe the package name is Microsoft.CodeAnalysis.NetAnalyzers and we also include in the SDK Microsoft.CodeAnalysis.CSharp.CodeStyle -- I would think these analyzers would be in the first one though, but not sure.

@akoeplinger
Copy link
Member

akoeplinger commented Oct 29, 2020

I think the breaking change section at https://docs.microsoft.com/dotnet/core/compatibility/3.1-5.0#globalization-apis-use-icu-libraries-on-windows should at least contain the How might behavioral differences manifest themselves, and how can developers guard against these? example from @GrabYourPitchforks description.

It's pretty hard to make the connection from just reading the "Globalization APIs use ICU libraries on Windows" description to "this potentially changes string.IndexOf() behavior in my app" so having that code sample would be nice.

I agree that for details we should link to this GitHub issue or an eventual docs article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏁 Release: .NET 5 Work items for the .NET 5 release doc-idea Indicates issues that are suggestions for new topics [org][type][category]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants