Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][non-icu] HybridGlobalization indexing #84471

Closed
wants to merge 10 commits into from

Conversation

ilonatommy
Copy link
Member

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_IndexOf, GlobalizationNative_LastIndexOf

New, non-icu private API: IndexOf.

Affected public API (see: tests in CompareInfoTests.IndexOf and CompareInfoTests.LastIndexOf.cs):

  • CompareInfo.IndexOf
  • CompareInfo.LastIndexOf
  • String.IndexOf
  • String.LastIndexOf

All changes in behavior are listed in docs\design\features\hybrid-globalization.md

cc @SamMonoRT

@ilonatommy ilonatommy self-assigned this Apr 7, 2023
@ilonatommy ilonatommy requested a review from kg as a code owner April 7, 2023 09:54
@ghost
Copy link

ghost commented Apr 7, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_IndexOf, GlobalizationNative_LastIndexOf

New, non-icu private API: IndexOf.

Affected public API (see: tests in CompareInfoTests.IndexOf and CompareInfoTests.LastIndexOf.cs):

  • CompareInfo.IndexOf
  • CompareInfo.LastIndexOf
  • String.IndexOf
  • String.LastIndexOf

All changes in behavior are listed in docs\design\features\hybrid-globalization.md

cc @SamMonoRT

Author: ilonatommy
Assignees: ilonatommy
Labels:

arch-wasm, area-System.Globalization

Milestone: -

@kg
Copy link
Member

kg commented Apr 7, 2023

Something I should have noticed in previous reviews: In HybridGlobalization right now you pass strings as ptr+length pairs. This means that any time an interned string is passed across the boundary to JS, it has to be decoded to a new JS string that gets thrown away.

If you were to pass MonoString ** (i.e. ref str from C#) instead and use the existing string decoding API in the JS bindings, that API supports interning. This could be significant performance-wise for things like a needle-in-haystack search if the needle is interned, because you would be able to search for i.e. the literal "hello" from a C# source file (literals are interned) a thousand times while only decoding it once. It is a common pattern for either the needle or haystack to be interned. You can access this functionality safely via the conv_string_root API

export function conv_string_root(root: WasmRoot<MonoString>): string | null {
paired with an external root.

For cases where you are indexing into the char* buffer one element at a time and never making a temporary JS string, it's fine to not use this. It seems like it would probably be worth applying this to IndexOf/LastIndexOf and comparisons, but case conversion doesn't seem likely to need interning.

@ilonatommy
Copy link
Member Author

Something I should have noticed in previous reviews: In HybridGlobalization right now you pass strings as ptr+length pairs. This means that any time an interned string is passed across the boundary to JS, it has to be decoded to a new JS string that gets thrown away.

If you were to pass MonoString ** (i.e. ref str from C#) instead and use the existing string decoding API in the JS bindings, that API supports interning. This could be significant performance-wise for things like a needle-in-haystack search if the needle is interned, because you would be able to search for i.e. the literal "hello" from a C# source file (literals are interned) a thousand times while only decoding it once. It is a common pattern for either the needle or haystack to be interned. You can access this functionality safely via the conv_string_root API

export function conv_string_root(root: WasmRoot<MonoString>): string | null {

paired with an external root.
For cases where you are indexing into the char* buffer one element at a time and never making a temporary JS string, it's fine to not use this. It seems like it would probably be worth applying this to IndexOf/LastIndexOf and comparisons, but case conversion doesn't seem likely to need interning.

Thank you, I did not know about interning. The original motivation for using pointers was optimizing memory usage - to avoid creating new "return" strings on JS side when we can update the string allocated by the managed code. I would like to ask @pavelsavara for opinion here because he hinted me to the current approach. Changing the way HybridGlobalization handles strings is not a lot of work and we can do it at any point.

@pavelsavara
Copy link
Member

I don't remember the details of that conversation anymore. Looking at the PR now, I wonder why we are using icall instead of the new generated interop. With the later, you would get the string interning without any effort and trouble with GC roots.

@ilonatommy
Copy link
Member Author

ilonatommy commented Apr 12, 2023

I don't remember the details of that conversation anymore. Looking at the PR now, I wonder why we are using icall instead of the new generated interop. With the later, you would get the string interning without any effort and trouble with GC roots.

That is true but using new Interop inside of System.Private.CoreLib requires referencing System.Runtime.InteropServices.JavaScript and we cannot reference any assembly from CoreLib. That was why new Interop was not used.

@pavelsavara
Copy link
Member

requires referencing System.Runtime.InteropServices.JavaScript ...

I see. In that case Kate's suggesting is better.

/// <returns></returns>
// ToDo: clean up with IndexOfOrdinalHelper from .Icu
private unsafe int IndexOfOrdinalIgnoreCaseHelperJS(out string exceptionMessage, ReadOnlySpan<char> source, ReadOnlySpan<char> target, CompareOptions options, int* matchLengthPtr, bool fromBeginning)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it better to move this function https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Icu.cs#L95 to CompareInfo.cs and call from ICU and WebAssembly? From first look implementations look identical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It differs with one goto: InteropCall. I tried to merge these two implementations: icu + wasm by passing the delegate to the InteropCall but unfortunately delegates do not support unsafe parameters (pointers), so the cleanup failed. Generally all CompareInfo should get refactored because it's a huge class and I was thinking about doing it in follow-up, when collations on HybridGlobalization would be already working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I have some concerns about refactoring these functions. Originally, ICU has 2 very similar functions that differ only with lines

                        if (char.IsAsciiLetterLower(valueChar))
                            valueChar = (char)(valueChar - 0x20);
                        if (char.IsAsciiLetterLower(targetChar))
                            targetChar = (char)(targetChar - 0x20);

but they have this comment at the top:

        /// Duplicate of IndexOfOrdinalHelper that also handles ignore case. Can't converge both methods
        /// as the JIT wouldn't be able to optimize the ignoreCase path away.

An attempt to merge Wasm + Icu functions is similar to the situation described. @ViktorHofer, you are the author of the comment, could you help us? Can you think about an effective way of refactoring e.g. IndexOfOrdinalIgnoreCaseHelper and IndexOfOrdinalIgnoreCaseHelperJS into one function, when they differ only with the contents of goto InteropCall instruction?

}
}

private unsafe int IndexOfOrdinalHelperJS(out string exceptionMessage, ReadOnlySpan<char> source, ReadOnlySpan<char> target, CompareOptions options, int* matchLengthPtr, bool fromBeginning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this. Looks like most of the logic can be reused.

@ilonatommy
Copy link
Member Author

I see. In that case Kate's suggesting is better.

Planning to first fix it for main (#84695), then here. Converting to draft.

@ilonatommy ilonatommy marked this pull request as draft April 12, 2023 15:26
@kg
Copy link
Member

kg commented Apr 22, 2023

Minor note (not important): The locale.split("-")[0] === "ja" in compare_strings probably creates gc pressure and could be hurting performance. I wouldn't worry about it much unless split and/or GCs are eating time in profiles though.

You could probably replace it with locale.startsWith("ja-") I think?

@ilonatommy
Copy link
Member Author

You could probably replace it with locale.startsWith("ja-") I think?

You are right, startsWith is correct in this context. We are using ~0.1% on Split currently (in comparison, not indexing, this implementation of indexing is very slow and needs to be re-designed). I don't see any impact on the length of compare_string execution when I change the function to startsWith, though.

export function segment_string_locale_sensitive(string: string, locale: string | undefined) : Intl.SegmentData[]
{
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" });
return Array.from(segmenter.segment(string));
Copy link
Member

@pavelsavara pavelsavara Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the enumerator instead. It would be allocating less objects on average, when we find earlier match.

Copy link
Member Author

@ilonatommy ilonatommy Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here: https://github.com/ilonatommy/runtime/tree/hg-indexing-faster. This PR actually should be already closed, not drafted. Sorry if it confused you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexer is stll slow, thoug. For 64kB it's ~2100ms, which is unacceptable. That's why the branch is still WIP and not posted as a PR.

@ilonatommy ilonatommy closed this Apr 24, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants