Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): devolve normalization to js 🙀 #11541

Merged
merged 11 commits into from
Jun 13, 2024

Conversation

srl295
Copy link
Member

@srl295 srl295 commented May 24, 2024

  • major redo of actions_normalize - into UTF-32 and not using ICU directly
  • add some utilities: u32len, u32dup, context_items_from_utf32

Wasm not working as I write this, but:

Future items for this issue:

  • TODO: regex for wasm - coming soon
  • TODO: remove ICU from wasm (later PR)

(for now, probably should redo context tests eventually)

For: #9467

@keymanapp-test-bot skip

…urn off ICU

- always set to 0 for now (keep ICU around)
- set KMN_IN_LDML_TESTS in tests to keep ICU there for test and comparison
- add core_icu.cpp and put some utils there.

#9467
- add a normalize_nfd() which takes a single codepoint
- temporarily keep ICU in actions_normalize.cpp and ldml_transforms.cpp
- expand wasm opts in unit tests
- major redo of actions_normalize - into UTF-32 and not using ICU directly
- add some utilities: u32len, u32dup, context_items_from_utf32

#9467
@srl295 srl295 self-assigned this May 24, 2024
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label May 24, 2024
@keymanapp-test-bot keymanapp-test-bot bot added this to the A18S3 milestone May 24, 2024
@github-actions github-actions bot added core/ Keyman Core feat labels May 24, 2024
Base automatically changed from feat/core/9467-devolve-norm-to-js2-epic-ldml to master May 27, 2024 15:26
- add core/tools build tree with custom targets
- add to core/build.sh to generate nfd_table.h
- test_unicode to validate Unicode version and compare NFD to actual ICU
- currently, linear search of the table.
@srl295 srl295 force-pushed the feat/core/9467-devolve-norm-to-js3-epic-ldml branch from c53bcac to 53a6638 Compare June 4, 2024 16:15
@github-actions github-actions bot added common/resources/ Build infrastructure common/web/ and removed common/web/ labels Jun 4, 2024
- use RLE encoding, thanks @mcdurdin
- much smaller table and faster lookup

Fixes: #9467
@srl295
Copy link
Member Author

srl295 commented Jun 4, 2024

meson compile -C ${KEYMAN_ROOT}/core/build/mac-x86_64/$BUILDER_CONFIGURATION tools/norm_data && mv -v ${KEYMAN_ROOT}/core/build/mac-x86_64/$BUILDER_CONFIGURATION/tools/nfd_table.h ${KEYMAN_ROOT}/resources/standards-data/unicode-character-database/

any ideas on how to best call into meson for updating this file?

i looked at build-fixtures in kmc-ldml but that's in developer which doesn't have architectures.

@srl295
Copy link
Member Author

srl295 commented Jun 5, 2024

I think this could be ready for review. Will split regex to a followon PR.

@srl295 srl295 marked this pull request as ready for review June 7, 2024 03:35
@srl295 srl295 requested a review from rc-swag as a code owner June 7, 2024 03:35
@srl295 srl295 requested a review from mcdurdin June 7, 2024 03:35
@srl295 srl295 requested a review from jahorton June 7, 2024 13:56
@mcdurdin mcdurdin modified the milestones: A18S3, A18S4 Jun 7, 2024
Copy link
Contributor

@jahorton jahorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less familiar with this area of the codebase, so forgive me if I have a lot of concerns and/or questions.

  1. What role does the NFD boundary table serve?

If we're not in WASM mode, then ICU should be available. If we are in WASM mode, we'll be leveraging JS normalization. My naive first instinct says that neither mode should need the table as a result, but it's obvious you wouldn't have written the boundary-table code if that were true.

  1. I see that a number of changes are moving to std::u32string as the intermediary string type so that WASM won't need the icu::UnicodeString type in the future. That makes sense.

* Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer,
* nul-terminated
*/
inline km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm tracking the changes correctly, this removal is fine due to no longer using icu::UnicodeString& as the standard intermediary string type.

Replacing it with std::u32string here and elsewhere... appears to avoid the need for all the ICU-related assertions, etc that previously existed here?

assert(!cached_context_string.isBogus());
assert(!app_context_string.isBogus());
if(output.isBogus() || cached_context_string.isBogus() || app_context_string.isBogus()) {
if (!context_items_to_unicode_string(app_context, app_context_string)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new format of the method - returning an error code/flag instead of the ICU-string object directly - allows us to reuse its internal assertions + error-throwing instead of needing the previously-WET style being cleaned up here.

Assuming I'm parsing things correctly. (Took a while to ferret this detail out.)

@@ -49,16 +48,20 @@ km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) {
* @return false if failure
*/
bool normalize(const icu::Normalizer2 *n, std::u16string &str, UErrorCode &status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving things to std::u32string when std::u16string is what we're using on the primary normalize method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, most processing is in UTF 32… The 16 bit stuff was only because that's ICUs native API…

Comment on lines 230 to 236
/**
* Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer,
* nul-terminated
*/
km_core_usv *string_to_usv(const std::u32string& src) {
return km::core::kmx::u32dup(src.c_str());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh... so std::u32string doesn't look the same as icu::UnicodeString to me. Is the comment saying that we're basically just using the type to store icu::UnicodeString values without the need to link in its type for WASM, with this method doing the actual conversion?

Without the comment, my naive sight-reading of the code suggests that this is simply cloning the string. It looks like km_core_usv is almost just an alias for std::u32string, but something tells me that's probably not entirely right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong here… If this function is retained, it's just a way to make the ICU and non-ICU versions a little bit more parallel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to take your word for it - I don't have sufficient context to review details related to your comment.

Comment on lines +119 to +123
km_core_usv *km::core::kmx::u32dup(const km_core_usv *src) {
km_core_usv *dup = new km_core_usv[u32len(src) + 1];
memcpy(dup, src, (u32len(src) + 1) * sizeof(src[0]));
return dup;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the .c_str() returned by std::u32string is equivalent to km_core_usv? Meaning it's std-string wrapper stuff that makes the difference between the two types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though... I think that this means we're allocating new memory here that lacks an inherent memory management scheme. I only see one new free call in this PR, and a new delete or two. One of the deletes is in cleanup after a failed assertion, which is good, but I'm not having an easy time tracing all the codepaths that will need memory management based on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is patterned after C stdlib strdup. Caller is reponsible for memory management.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right and in both of the callers, the ICU and non-ICU, the buffer is deleted when it's done

@mcdurdin
Copy link
Member

  1. What role does the NFD boundary table serve?

NFD boundary data is not available in the web Intl libraries.

@srl295
Copy link
Member Author

srl295 commented Jun 11, 2024

  1. The NFD boundary is part of the marker normalization algorithm. JavaScript normalization doesn't expose the NFD boundary so the table is needed.

  2. Unicodestring is a heavy class and not a simple type. Removing it is part of shedding the icu dependency for wasm.

A c++ wasm wrapper of icu4c may be of general interest. May write this up later.


// collect the raw list of chars that do NOT have a boundary before them.
std::vector<km_core_usv> noBoundary;
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
for (km_core_usv ch = 0; ch < km::core::kmx::Uni_MAX_CODEPOINT; ch++) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are starting at 0, why are we finishing at 0x10FFFE instead of 0x10FFFF? 😁

Copy link
Contributor

@jahorton jahorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I wasn't able to cover everything you'd want in a review, but I get the feeling I'd need significant additional context to be provided in order to provide more in-depth feedback.

What I can review, LGTM.

@srl295
Copy link
Member Author

srl295 commented Jun 13, 2024

Apologies if I wasn't able to cover everything you'd want in a review, but I get the feeling I'd need significant additional context to be provided in order to provide more in-depth feedback.

What I can review, LGTM.

Thanks… If a walk-through would be profitable at some point, let me know

- per review comments

Fixes: #9467

Co-authored-by: rc-swag <58423624+rc-swag@users.noreply.github.com>
@srl295 srl295 merged commit edd42d1 into master Jun 13, 2024
18 checks passed
@srl295 srl295 deleted the feat/core/9467-devolve-norm-to-js3-epic-ldml branch June 13, 2024 19:28
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 18.0.56-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants