Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][non-icu] HybridGlobalization indexing #85254

Merged
merged 21 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions docs/design/features/hybrid-globalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,50 @@ Web API does not expose locale-sensitive endsWith/startsWith function. As a work

- `IgnoreSymbols`
Only comparisons that do not skip character types are allowed. E.g. `IgnoreSymbols` skips symbol-chars in comparison/indexing. All `CompareOptions` combinations that include `IgnoreSymbols` throw `PlatformNotSupportedException`.


**String indexing**

Affected public APIs:
- CompareInfo.IndexOf
- CompareInfo.LastIndexOf
- String.IndexOf
- String.LastIndexOf

Web API does not expose locale-sensitive indexing function. There is a discussion on adding it: https://github.com/tc39/ecma402/issues/506. In the current state, as a workaround, locale-sensitive string segmenter combined with locale-sensitive comparison is used. This approach, beyond having the same compare option limitations as described under **String comparison**, has additional limitations connected with the workaround used. Information about additional limitations:

- Support depends on [`Intl.segmenter's support`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter#browser_compatibility).

- `IgnoreSymbols`

Only comparisons that ignore types of characters but do not skip them are allowed. E.g. `IgnoreCase` ignores type (case) of characters but `IgnoreSymbols` skips symbol-chars in comparison/indexing. All `CompareOptions` combinations that include `IgnoreSymbols` throw `PlatformNotSupportedException`.

- Some letters consist of more than one grapheme.

Using locale-sensitive segmenter `Intl.Segmenter(locale, { granularity: "grapheme" })` does not guarantee that string will be segmented by letters but by graphemes. E.g. in `cs-CZ` and `sk-SK` "ch" is 1 letter, 2 graphemes. The following code with `HybridGlobalization` switched off returns -1 (not found) while with `HybridGlobalization` switched on, it returns 1.

``` C#
new CultureInfo("sk-SK").CompareInfo.IndexOf("ch", "h"); // -1 or 1
```

- Some graphemes consist of more than one character.
E.g. `\r\n` that represents two characters in C#, is treated as one grapheme by the segmenter:

``` JS
const segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" });
Array.from(segmenter.segment("\r\n")) // {segment: '\r\n', index: 0, input: '\r\n'}
```

Because we are comparing grapheme-by-grapheme, character `\r` or character `\n` will not be found in `\r\n` string when `HybridGlobalization` is switched on.

- Some graphemes have multi-grapheme equivalents.
E.g. in `de-DE` ß (%u00DF) is one letter and one grapheme and "ss" is one letter and is recognized as two graphemes. Web API's equivalent of `IgnoreNonSpace` treats them as the same letter when comparing. Similar case: dz (%u01F3) and dz.
``` JS
"ß".localeCompare("ss", "de-DE", { sensitivity: "case" }); // 0
```

Using `IgnoreNonSpace` for these two with `HybridGlobalization` off, also returns 0 (they are equal). However, the workaround used in `HybridGlobalization` will compare them grapheme-by-grapheme and will return -1.

``` C#
new CultureInfo("de-DE").CompareInfo.IndexOf("strasse", "stra\u00DFe", 0, CompareOptions.IgnoreNonSpace); // 0 or -1
```
1 change: 1 addition & 0 deletions docs/workflow/trimming/feature-switches.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ configurations but their defaults might vary as any SDK can set the defaults dif
| EventSourceSupport | System.Diagnostics.Tracing.EventSource.IsSupported | Any EventSource related code or logic is trimmed when set to false |
| InvariantGlobalization | System.Globalization.Invariant | All globalization specific code and data is trimmed when set to true |
| PredefinedCulturesOnly | System.Globalization.PredefinedCulturesOnly | Don't allow creating a culture for which the platform does not have data |
| HybridGlobalization | System.Globalization.Hybrid | Properties connected with the mixed: platform-specific + icu-based globalization will be trimmed |
| UseSystemResourceKeys | System.Resources.UseSystemResourceKeys | Any localizable resources for system assemblies is trimmed when set to true |
| HttpActivityPropagationSupport | System.Net.Http.EnableActivityPropagation | Any dependency related to diagnostics support for System.Net.Http is trimmed when set to false |
| UseNativeHttpHandler | System.Net.Http.UseNativeHttpHandler | HttpClient uses by default platform native implementation of HttpMessageHandler if set to true. |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,8 @@ internal static unsafe partial class JsGlobalization

[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal static extern unsafe bool EndsWith(out string exceptionMessage, in string culture, char* str1, int str1Len, char* str2, int str2Len, global::System.Globalization.CompareOptions options);

[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal static extern unsafe int IndexOf(out string exceptionMessage, in string culture, char* str1, int str1Len, char* str2, int str2Len, global::System.Globalization.CompareOptions options, bool fromBeginning);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,16 @@ public static IEnumerable<object[]> IndexOf_TestData()
yield return new object[] { s_invariantCompare, "foobardzsdzs", "rddzs", 0, 12, CompareOptions.Ordinal, -1, 0 };

// Slovak
yield return new object[] { s_slovakCompare, "ch", "h", 0, 2, CompareOptions.None, -1, 0 };
// Android has its own ICU, which doesn't work well with slovak
if (!PlatformDetection.IsAndroid && !PlatformDetection.IsLinuxBionic)
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
{
yield return new object[] { s_slovakCompare, "chodit hore", "HO", 0, 11, CompareOptions.IgnoreCase, 7, 2 };
yield return new object[] { s_slovakCompare, "ch", "h", 0, 2, CompareOptions.None, -1, 0 };
// Android has its own ICU, which doesn't work well with slovak
if (!PlatformDetection.IsAndroid && !PlatformDetection.IsLinuxBionic)
{
yield return new object[] { s_slovakCompare, "chodit hore", "HO", 0, 11, CompareOptions.IgnoreCase, 7, 2 };
}
yield return new object[] { s_slovakCompare, "chh", "h", 0, 3, CompareOptions.None, 2, 1 };
}
yield return new object[] { s_slovakCompare, "chh", "h", 0, 3, CompareOptions.None, 2, 1 };

// Turkish
// Android has its own ICU, which doesn't work well with tr
Expand All @@ -63,16 +66,24 @@ public static IEnumerable<object[]> IndexOf_TestData()
yield return new object[] { s_invariantCompare, "Exhibit \u00C0", "a\u0300", 0, 9, CompareOptions.IgnoreCase, 8, 1 };
yield return new object[] { s_invariantCompare, "Exhibit \u00C0", "a\u0300", 0, 9, CompareOptions.OrdinalIgnoreCase, -1, 0 };
yield return new object[] { s_invariantCompare, "FooBar", "Foo\u0400Bar", 0, 6, CompareOptions.Ordinal, -1, 0 };
yield return new object[] { s_invariantCompare, "TestFooBA\u0300R", "FooB\u00C0R", 0, 11, CompareOptions.IgnoreNonSpace, 4, 7 };
yield return new object[] { s_invariantCompare, "TestFooBA\u0300R", "FooB\u00C0R", 0, 11, supportedIgnoreNonSpaceOption, 4, 7 };
yield return new object[] { s_invariantCompare, "o\u0308", "o", 0, 2, CompareOptions.None, -1, 0 };
yield return new object[] { s_invariantCompare, "\r\n", "\n", 0, 2, CompareOptions.None, 1, 1 };
if (PlatformDetection.IsHybridGlobalizationOnBrowser)
{
yield return new object[] { s_invariantCompare, "\r\n", "\n", 0, 2, CompareOptions.None, -1, 0 };
}
else
{
yield return new object[] { s_invariantCompare, "\r\n", "\n", 0, 2, CompareOptions.None, 1, 1 };
}

// Weightless characters
yield return new object[] { s_invariantCompare, "", "\u200d", 0, 0, CompareOptions.None, 0, 0 };
yield return new object[] { s_invariantCompare, "hello", "\u200d", 1, 3, CompareOptions.IgnoreCase, 1, 0 };

// Ignore symbols
yield return new object[] { s_invariantCompare, "More Test's", "Tests", 0, 11, CompareOptions.IgnoreSymbols, 5, 6 };
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
yield return new object[] { s_invariantCompare, "More Test's", "Tests", 0, 11, CompareOptions.IgnoreSymbols, 5, 6 };
yield return new object[] { s_invariantCompare, "More Test's", "Tests", 0, 11, CompareOptions.None, -1, 0 };
yield return new object[] { s_invariantCompare, "cbabababdbaba", "ab", 0, 13, CompareOptions.None, 2, 2 };

Expand Down Expand Up @@ -127,19 +138,22 @@ public static IEnumerable<object[]> IndexOf_TestData()
}

// Inputs where matched length does not equal value string length
yield return new object[] { s_invariantCompare, "abcdzxyz", "\u01F3", 0, 8, CompareOptions.IgnoreNonSpace, 3, 2 };
yield return new object[] { s_invariantCompare, "abc\u01F3xyz", "dz", 0, 7, CompareOptions.IgnoreNonSpace, 3, 1 };
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "stra\u00DFe", 0, 23, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, 4, 7 };
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "xtra\u00DFe", 0, 23, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, -1, 0 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Strasse", 0, 21, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, 4, 6 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Xtrasse", 0, 21, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, -1, 0 };
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
{
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "stra\u00DFe", 0, 23, supportedIgnoreCaseIgnoreNonSpaceOptions, 4, 7 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Strasse", 0, 21, supportedIgnoreCaseIgnoreNonSpaceOptions, 4, 6 };
yield return new object[] { s_invariantCompare, "abcdzxyz", "\u01F3", 0, 8, supportedIgnoreNonSpaceOption, 3, 2 };
yield return new object[] { s_invariantCompare, "abc\u01F3xyz", "dz", 0, 7, supportedIgnoreNonSpaceOption, 3, 1 };
}
yield return new object[] { s_germanCompare, "abc Strasse Strasse xyz", "xtra\u00DFe", 0, 23, supportedIgnoreCaseIgnoreNonSpaceOptions, -1, 0 };
yield return new object[] { s_germanCompare, "abc stra\u00DFe stra\u00DFe xyz", "Xtrasse", 0, 21, supportedIgnoreCaseIgnoreNonSpaceOptions, -1, 0 };
}

public static IEnumerable<object[]> IndexOf_Aesc_Ligature_TestData()
{
bool useNls = PlatformDetection.IsNlsGlobalization;
// Searches for the ligature \u00C6
string source1 = "Is AE or ae the same as \u00C6 or \u00E6?";
string source1 = "Is AE or ae the same as \u00C6 or \u00E6?"; // 3 failures here
yield return new object[] { s_invariantCompare, source1, "AE", 8, 18, CompareOptions.None, useNls ? 24 : -1, useNls ? 1 : 0};
yield return new object[] { s_invariantCompare, source1, "ae", 8, 18, CompareOptions.None, 9 , 2};
yield return new object[] { s_invariantCompare, source1, "\u00C6", 8, 18, CompareOptions.None, 24, 1 };
Expand All @@ -157,7 +171,7 @@ public static IEnumerable<object[]> IndexOf_Aesc_Ligature_TestData()
public static IEnumerable<object[]> IndexOf_U_WithDiaeresis_TestData()
{
// Searches for the combining character sequence Latin capital letter U with diaeresis or Latin small letter u with diaeresis.
string source = "Is \u0055\u0308 or \u0075\u0308 the same as \u00DC or \u00FC?";
string source = "Is \u0055\u0308 or \u0075\u0308 the same as \u00DC or \u00FC?"; // 7 failures here
yield return new object[] { s_invariantCompare, source, "U\u0308", 8, 18, CompareOptions.None, 24, 1 };
yield return new object[] { s_invariantCompare, source, "u\u0308", 8, 18, CompareOptions.None, 9, 2 };
yield return new object[] { s_invariantCompare, source, "\u00DC", 8, 18, CompareOptions.None, 24, 1 };
Expand Down Expand Up @@ -229,8 +243,11 @@ static void RunSpanIndexOfTest(CompareInfo compareInfo, ReadOnlySpan<char> sourc
valueBoundedMemory.MakeReadonly();

Assert.Equal(expected, compareInfo.IndexOf(sourceBoundedMemory.Span, valueBoundedMemory.Span, options));
Assert.Equal(expected, compareInfo.IndexOf(sourceBoundedMemory.Span, valueBoundedMemory.Span, options, out int actualMatchLength));
Assert.Equal(expectedMatchLength, actualMatchLength);
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
{
Assert.Equal(expected, compareInfo.IndexOf(sourceBoundedMemory.Span, valueBoundedMemory.Span, options, out int actualMatchLength));
Assert.Equal(expectedMatchLength, actualMatchLength);
}

if (TryCreateRuneFrom(value, out Rune rune))
{
Expand Down Expand Up @@ -273,7 +290,7 @@ public void IndexOf_UnassignedUnicode()
bool useNls = PlatformDetection.IsNlsGlobalization;
int expectedMatchLength = (useNls) ? 6 : 0;
IndexOf_String(s_invariantCompare, "FooBar", "Foo\uFFFFBar", 0, 6, CompareOptions.None, useNls ? 0 : -1, expectedMatchLength);
IndexOf_String(s_invariantCompare, "~FooBar", "Foo\uFFFFBar", 0, 7, CompareOptions.IgnoreNonSpace, useNls ? 1 : -1, expectedMatchLength);
IndexOf_String(s_invariantCompare, "~FooBar", "Foo\uFFFFBar", 0, 7, supportedIgnoreNonSpaceOption, useNls ? 1 : -1, expectedMatchLength);
}

[Fact]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ public static IEnumerable<object[]> IsPrefix_TestData()

// Platform differences
// in HybridGlobalization on Browser we use TextEncoder that is not supported for v8 and the manual decoding works like NLS
bool behavesLikeNls = PlatformDetection.IsNlsGlobalization ||
bool behavesLikeNls = PlatformDetection.IsNlsGlobalization ||
(PlatformDetection.IsHybridGlobalizationOnBrowser && !PlatformDetection.IsBrowserDomSupportedOrNodeJS);
if (behavesLikeNls)
{
Expand Down Expand Up @@ -114,9 +114,6 @@ public static IEnumerable<object[]> IsPrefix_TestData()
{
yield return new object[] { s_invariantCompare, "dzxyz", "\u01F3", supportedIgnoreNonSpaceOption, true, 2 };
yield return new object[] { s_invariantCompare, "\u01F3xyz", "dz", supportedIgnoreNonSpaceOption, true, 1 };
}
if (!PlatformDetection.IsHybridGlobalizationOnBrowser)
{
yield return new object[] { s_germanCompare, "Strasse xyz", "stra\u00DFe", supportedIgnoreCaseIgnoreNonSpaceOptions, true, 7 };
yield return new object[] { s_germanCompare, "stra\u00DFe xyz", "Strasse", supportedIgnoreCaseIgnoreNonSpaceOptions, true, 6 };
}
Expand Down
Loading