Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0 #43736

Closed
jbogard opened this issue Oct 22, 2020 · 78 comments · Fixed by #57078
Closed

Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0 #43736

jbogard opened this issue Oct 22, 2020 · 78 comments · Fixed by #57078
Labels
area-System.Globalization question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@jbogard
Copy link
Contributor

jbogard commented Oct 22, 2020

Description

I'm extending a package to support .NET 5.0 and ran into a breaking change. Given the console application:

using System;

namespace ConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            var actual = "Detail of supported commands\n============\n## Documentation produced for DelegateDecompiler, version 0.28.0 on Thursday, 22 October 2020 16:03\n\r\nThis file documents what linq commands **DelegateDecompiler** supports when\r\nworking with [Entity Framework Core](https://docs.microsoft.com/en-us/ef/core/) (EF).\r\nEF has one of the best implementations for converting Linq `IQueryable<>` commands into database\r\naccess commands, in EF's case T-SQL. Therefore it is a good candidate for using in our tests.\r\n\r\nThis documentation was produced by compaired direct EF Linq queries against the same query implemented\r\nas a DelegateDecompiler's `Computed` properties. This produces a Supported/Not Supported flag\r\non each command type tested. Tests are groups and ordered to try and make finding things\r\neasier.\r\n\r\nSo, if you want to use DelegateDecompiler and are not sure whether the linq command\r\nyou want to use will work then clone this project and write your own tests.\r\n(See [How to add a test](HowToAddMoreTests.md) documentation on how to do this). \r\nIf there is a problem then please fork the repository and add your own tests. \r\nThat will make it much easier to diagnose your issue.\r\n\r\n*Note: The test suite has only recently been set up and has only a handful of tests at the moment.\r\nMore will appear as we move forward.*\r\n\r\n\r\n### Group: Unit Test Group\n#### [My Unit Test1](../TestGroup01UnitTestGroup/Test01MyUnitTest1):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n#### [My Unit Test2](../TestGroup01UnitTestGroup/Test01MyUnitTest2):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n\r\n\nThe End\n";

            var expected = "\n#### [My Unit Test2](";

            Console.WriteLine($"actual.Contains(expected): {actual.Contains(expected)}");
            Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}");
        }
    }
}

I get different results based on the runtime from .NET Core 3.0 -> .NET 5.0:

.NET Core 3.0:

actual.Contains(expected): True
actual.IndexOf(expected): 1475

.NET 5.0:

actual.Contains(expected): True
actual.IndexOf(expected): -1

Configuration

Windows 10 Pro Build 19041 x64
.NET Core 3.1.9
.NET 5.0.0-rc.2.20475.5

Regression?

Yes, it worked through .NET Core 3.1.9

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Oct 22, 2020
@danmoseley
Copy link
Member

@tarekgh

@danmoseley danmoseley added this to the 5.0.0 milestone Oct 22, 2020
@tarekgh tarekgh added area-System.Globalization question Answer questions and provide assistance, not an issue with source code or documentation. and removed untriaged New issue has not been triaged by the area owner area-Meta labels Oct 22, 2020
@ghost
Copy link

ghost commented Oct 22, 2020

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

@tarekgh
Copy link
Member

tarekgh commented Oct 22, 2020

This is by design as in .NET 5.0 we have switched using ICU instead of NLS. You can look at https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu for more details.

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

@tarekgh tarekgh closed this as completed Oct 22, 2020
@tarekgh
Copy link
Member

tarekgh commented Oct 22, 2020

forgot to say, if you want IndexOf behave as Contains, you should use Ordinal comparisons at that time.

actual.IndexOf(expected, StringComparison.Ordinal)

@safern
Copy link
Member

safern commented Oct 22, 2020

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

Yeah, if you run this code in Unix targeting netcoreapp3.1 you will see this same behavior:

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $   /home/santifdezm/experimental/indexof/bin/Debug/netcoreapp3.1/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): -1

and as @tarekgh with Ordinal comparison it returns the expected result.

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $  /home/santifdezm/experimental/indexof/bin/Debug/net5.0/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): 1475

@ericstj
Copy link
Member

ericstj commented Oct 22, 2020

I think this is failing because the mix of \r\n and \n in the source string. If I replace all instances of \r\n with \n it works. Same is true if I make everything \r\n. It's just the mix that is resulting in different results from ICU's comparison.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Oct 22, 2020

Issue as reported on Twitter was that within the 5.0 runtime, repeated calls to string.IndexOf with the same inputs was giving different results on each call.

https://twitter.com/jbogard/status/1319381273585061890?s=21

Edit: Above was a misunderstanding.

@safern
Copy link
Member

safern commented Oct 22, 2020

@GrabYourPitchforks can we update the issue title then? As this is technical a breaking change, but this happens on Windows and Unix... right?

@GrabYourPitchforks
Copy link
Member

I pinged Jimmy offline for clarification. It's possible I misunderstood his original issue report. 280-char forums aren't always efficient at communicating bugs clearly. ;)

@tarekgh
Copy link
Member

tarekgh commented Oct 22, 2020

Just to clarify, Contains API is performing ordinal operation. IndexOf without any string comparison flags is linguistic operation and not ordinal. If want to compare Contains behavior with IndexOf, need to use IndexOf(expected, StringComparison.Ordinal).
If need to learn more about the difference, https://docs.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings is useful link.

@GrabYourPitchforks
Copy link
Member

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

@tarekgh
Copy link
Member

tarekgh commented Oct 22, 2020

@GrabYourPitchforks could you share the link https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu on your twitter replies and mention we have a config switch to go back to old behavior?

@safern
Copy link
Member

safern commented Oct 22, 2020

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

Thanks, @GrabYourPitchforks... based on this, closing it as by design.

@safern safern closed this as completed Oct 22, 2020
@tarekgh
Copy link
Member

tarekgh commented Oct 22, 2020

To add more here, if you want to get the old behavior without switching back to NLS, you can do

	CultureInfo.CurrentCulture.CompareInfo.IndexOf(actual, expected, CompareOptions.IgnoreSymbols)

or

    actual.IndexOf(expected, StringComparison.Ordinal)

instead of

	actual.IndexOf(expected)

and you should get the desired behavior.

@danmoseley danmoseley changed the title Breaking change with string.IndexOf from .NET Core 3.0 -> .NET 5.0 Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0 Oct 23, 2020
@ForNeVeR
Copy link
Contributor

I can't see anything about \r\n vs \n in the linked ICU-related documentation (https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu).

Was it really a planned change?

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2020

@ForNeVeR it will be hard to list every single difference between ICU and NLS. the document is talking about the main change of switching to ICU. As I pointed before earlier, it is not right to compare the results of Contains with IndexOf without the StringComparison parameters. I have listed above a couple of ways you can get the previous behavior if desired to. From the report of this issue, it looks to me the usage of IndexOf should use Ordinal option and it is incorrect to use the linguistic comparisons in such case. Using linguistic comparison in such case can depend on the current culture which possible to give different results on different environments.

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons. Windows currently promoting using ICU over NLS. ICU is the future anyway. Also, ICU will give the opportunity to have consistent behaviors across Windows/Linux/OSX or any supported platforms. Using ICU will give the opportunity to the apps to customize the globalization behavior if they desired to.

As the document indicated, you still have the option to switch back to old behavior if you want to.

@hypersw
Copy link

hypersw commented Oct 23, 2020

Ouch, the referenced doc says that ICU/NLS behavior on Windows might silently switch based on the icu.dll availability in the actual environment. This might come as a big surprise for published self-contained apps. I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

@tarekgh
Copy link
Member

tarekgh commented Oct 23, 2020

I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

The ICU now is published as NuGet package. The apps can use such packages to have the self contained app ensure having ICU. look at the app-local section in the doc. In short, the app has full control on the behavior want to get.

@ForNeVeR
Copy link
Contributor

ForNeVeR commented Oct 24, 2020

@tarekgh, I agree that the different results between Contains and IndexOf aren't the problem per se.

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

This isn't something I would expect from any locale/NLS/ICU-related changes; in fact, I couldn't think of any other programming language/runtime behaving like that.

Here's a simplified test case, broken (I mean, giving me totally unexpected result) on .NET 5 RC 2:

var actual = "\n\r\nTest";
var expected = "\nTest";

Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}"); // => -1

Should it really work like that? Also, why? What does it trying to do actually?

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone planning such a change. Like, folks from .NET team sat together and discussed:

Does string "\n\r\nTest" contain "\nTest" with ICU enabled? No, clearly doesn't!

And nobody complained? Not a chance!

This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

P.S. Yes, you could argue that one should really always call some variant of culture-independent IndexOf, like with the ordinal flag. But, if you've decided to break it that hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're tons of code in the wild that use IndexOf without the ordinal flags, and that code used to work (in some/most cases, at least). And it will stop working after .NET 5 update.

@naine
Copy link

naine commented Nov 5, 2020

@petarrepac Don't get me wrong, I understand that. But as has been pointed out multiple times in this thread:

  1. There is a plan, and it is to provide a runtime configuration switch.
  2. The behaviour claimed to be breaking is the existing behaviour of .NET on all non-windows platforms.
  3. This should only effect code that performs culture-sensitive operations where ordinal was intended.

Given the last two points it is probably reasonable to assume this affects a fairly small percentage of projects.

100% it is fair to ask about this, but the people who write comments like the one I quoted are often just assuming that no consideration was put in and written before trying to understand the bigger picture behind the change.

@safern
Copy link
Member

safern commented Nov 9, 2020

Hello, all. We wanted to give a brief summary of the actions that we took when this issue was open and at the end why we decided to keep the default on Windows 10 May 2019 Update or later to be ICU for .NET 5.0.

When the issue was opened, we started some internal discussions about the potential impact and pain that this could've had with our customers given the inconsistency in between Contains(string) which is Ordinal and IndexOf(string) being culture-aware, plus having other APIs being culture-aware by default when operating over a string, but being Ordinal when operating over Span<char> or ReadOnlySpan<char>. So after discussions over this issue, we started doing analysis on NuGet packages that might be affected and gathered data to get a clear picture. This were our findings:

  • "\n" is #​30 on the "most-used string literals" passed to the IndexOf, LastIndexOf, EndsWith, StartsWith, Compare, and CompareTo.
  • 1% of callsites to String.EndsWith are with a search string that starts with \n. Any of these callsites where the string being tested contains "windows-style line endings" would be broken.
  • There are 2040 package ids hosted on NuGet.org where a version contains an assembly with an at-risk callsite.
  • The vast majority of these callsites are to EndsWith and IndexOf, which is consistent with the hypothesis that this pattern is used often for naive line-breaking algorithms.

From these 2040 packages that had an at-risk callsite, only 539 support some version of .NET Standard or .NET Core, so that means that only 0.54% packages listed in NuGet.org are likely to being exposed to the break.

We looked at packages in the list of 539 potentially-affected package ids to get an idea of the actual impact on those. We looked at the top 70 (by download counts), 20 didn't expose the pattern in the latest version, from the ones that exposed the pattern we could only look at 32 that had a permisive license:

  • 14 were not subject to the bug
  • 13 were potentially broken
  • 5 were uncertain (there was no clear indication of breaks due to defensive coding for diverse line break patterns or other mitigations).

So this means, that from the top 70 packages by download only 18% were potentially impacted.

These percentages are stacked and not against the total number of packages on NuGet.org which is 229,536. So if we used the total number of packages and total number of downloads in NuGet.org, we would be looking at 539 potentially affected packages out of 229,536 which is 0.24%.

And while it's great for us to analyze libraries, nuget represents only a small fraction of the C# code out there. And even if someone owns the code, a) bugs may not be easy to track down, and b) they may not actually have the source anymore.

However this was a good source of data to conclude that, while this could be a very notable change in behavior, it is a break that already happened on Unix when reading inputs that might contain Windows Line Endings (which might not be that common).

In .NET Core and .NET 5+, we're striving towards consistency across OSs and given the impact of this change, it felt like the right thing to do. We do care about compatibility and hence are providing a compat runtime switch for people to be able to go back to legacy behavior.

Also, a conclusion from the packages that we could inspect, given that the behavior is different on Unix already, we did see a lot of defensive programming against this issue, to mitigate potential breaks across OSs.

To add to this, globalization can change any time as we're a thin wrapper across the OS, so it felt like the right thing at the moment to just be the same wrapper on all OSs that we support.

As part of this we improved our documentation with practical examples, roslyn analyzer rules and how affected code can mitigate this.

Thank you for all your valuable feedback since it always takes us to a better place and we will try to keep improving this experience for .NET 6, as discussed on: #43956

Since we understand the pain that this may cause because of the differences in between line endings across Unix and Windows, we're keeping this issue open and we will investigate a possible way to mitigate the \r\n case for .NET 5.0.x which should be part of a servicing release.

@mattleibow
Copy link
Member

There is also a difference with char and string:

Console.WriteLine("com/test/test/test/a/b/ʹ$ʹ".IndexOf("$"));
Console.WriteLine("com/test/test/test/a/b/ʹ$ʹ".IndexOf('$'));
-1
24

@tarekgh
Copy link
Member

tarekgh commented Nov 20, 2020

@mattleibow when using character search, it perform ordinal search. The doc https://docs.microsoft.com/en-us/dotnet/api/system.string.indexof?view=net-5.0#System_String_IndexOf_System_Char_ which is stating This method performs an ordinal (culture-insensitive) search. If you do the string search with the ordinal option, you'll get the same result as the character.

Console.WriteLine("com/test/test/test/a/b/ʹ$ʹ".IndexOf("$", StringComparison.Ordinal));

@adegeo
Copy link

adegeo commented Dec 7, 2020

It seems that CA1307 only triggers on indexof(char) but not indexof(string). Is there a rule for the string version? This seems more important as the default of char automatically uses ordinal and the breaking change doesn't really affect this, while the behavior of string has changed and you need to specify ordinal to restore behavior in some cases.

How do you detect indexof(string)?

Found it, it's rule CA1310. Our docs are wrong for https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu#use-nls-instead-of-icu and don't mention this specific variation. I'll update those docs.

@tarekgh
Copy link
Member

tarekgh commented Jan 5, 2021

@xanatos I believe @mattleibow report was regarding when using the NLS and not ICU. This should repro consistently on .NET Core when running on Windows.
to be able to see it on 5.0, try add the following to your csproj to enable NLS mode.

    <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />

@tarekgh
Copy link
Member

tarekgh commented Aug 11, 2021

#10689

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 17, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Sep 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.