feat(core): devolve normalization to js 🙀 #11541

srl295 · 2024-05-24T23:23:09Z

major redo of actions_normalize - into UTF-32 and not using ICU directly
add some utilities: u32len, u32dup, context_items_from_utf32

Wasm not working as I write this, but:

TODO: NFD table from feat(developer): ldml load ccc from icu 🙀 #10515 (but just for core)

Future items for this issue:

TODO: regex for wasm - coming soon
TODO: remove ICU from wasm (later PR)

(for now, probably should redo context tests eventually)

For: #9467

@keymanapp-test-bot skip

…urn off ICU - always set to 0 for now (keep ICU around) - set KMN_IN_LDML_TESTS in tests to keep ICU there for test and comparison - add core_icu.cpp and put some utils there. #9467

- add a normalize_nfd() which takes a single codepoint - temporarily keep ICU in actions_normalize.cpp and ldml_transforms.cpp - expand wasm opts in unit tests

- major redo of actions_normalize - into UTF-32 and not using ICU directly - add some utilities: u32len, u32dup, context_items_from_utf32 #9467

keymanapp-test-bot · 2024-05-24T23:23:21Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

- add core/tools build tree with custom targets - add to core/build.sh to generate nfd_table.h - test_unicode to validate Unicode version and compare NFD to actual ICU - currently, linear search of the table.

…volve-norm-to-js3-epic-ldml

@mcdurdin

- use RLE encoding, thanks @mcdurdin - much smaller table and faster lookup Fixes: #9467

Fixes: #9467

srl295 · 2024-06-04T17:33:29Z

keyman/core/build.sh

Line 183 in fdb2e95

    
           meson compile -C ${KEYMAN_ROOT}/core/build/mac-x86_64/$BUILDER_CONFIGURATION tools/norm_data && mv -v  ${KEYMAN_ROOT}/core/build/mac-x86_64/$BUILDER_CONFIGURATION/tools/nfd_table.h ${KEYMAN_ROOT}/resources/standards-data/unicode-character-database/

any ideas on how to best call into meson for updating this file?

i looked at build-fixtures in kmc-ldml but that's in developer which doesn't have architectures.

srl295 · 2024-06-05T20:50:18Z

I think this could be ready for review. Will split regex to a followon PR.

jahorton

I'm less familiar with this area of the codebase, so forgive me if I have a lot of concerns and/or questions.

What role does the NFD boundary table serve?

If we're not in WASM mode, then ICU should be available. If we are in WASM mode, we'll be leveraging JS normalization. My naive first instinct says that neither mode should need the table as a result, but it's obvious you wouldn't have written the boundary-table code if that were true.

I see that a number of changes are moving to std::u32string as the intermediary string type so that WASM won't need the icu::UnicodeString type in the future. That makes sense.

jahorton · 2024-06-11T03:38:25Z

core/src/actions_normalize.cpp

- * Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer,
- * nul-terminated
- */
-inline km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) {


If I'm tracking the changes correctly, this removal is fine due to no longer using icu::UnicodeString& as the standard intermediary string type.

Replacing it with std::u32string here and elsewhere... appears to avoid the need for all the ICU-related assertions, etc that previously existed here?

jahorton · 2024-06-11T03:42:45Z

core/src/actions_normalize.cpp

-  assert(!cached_context_string.isBogus());
-  assert(!app_context_string.isBogus());
-  if(output.isBogus() || cached_context_string.isBogus() || app_context_string.isBogus()) {
+  if (!context_items_to_unicode_string(app_context, app_context_string)) {


The new format of the method - returning an error code/flag instead of the ICU-string object directly - allows us to reuse its internal assertions + error-throwing instead of needing the previously-WET style being cleaned up here.

Assuming I'm parsing things correctly. (Took a while to ferret this detail out.)

jahorton · 2024-06-11T03:46:11Z

core/src/core_icu.cpp

@@ -49,16 +48,20 @@ km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) {
 * @return false if failure
 */
 bool normalize(const icu::Normalizer2 *n, std::u16string &str, UErrorCode &status) {


Why are we moving things to std::u32string when std::u16string is what we're using on the primary normalize method?

Actually, most processing is in UTF 32… The 16 bit stuff was only because that's ICUs native API…

jahorton · 2024-06-11T03:51:51Z

core/src/util_normalize.cpp

+/**
+ * Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer,
+ * nul-terminated
+ */
+km_core_usv *string_to_usv(const std::u32string& src) {
+  return km::core::kmx::u32dup(src.c_str());
+}


Uh... so std::u32string doesn't look the same as icu::UnicodeString to me. Is the comment saying that we're basically just using the type to store icu::UnicodeString values without the need to link in its type for WASM, with this method doing the actual conversion?

Without the comment, my naive sight-reading of the code suggests that this is simply cloning the string. It looks like km_core_usv is almost just an alias for std::u32string, but something tells me that's probably not entirely right.

The comment is wrong here… If this function is retained, it's just a way to make the ICU and non-ICU versions a little bit more parallel

I'll have to take your word for it - I don't have sufficient context to review details related to your comment.

jahorton · 2024-06-11T03:53:21Z

core/src/kmx/kmx_xstring.cpp

+km_core_usv *km::core::kmx::u32dup(const km_core_usv *src) {
+  km_core_usv *dup = new km_core_usv[u32len(src) + 1];
+  memcpy(dup, src, (u32len(src) + 1) * sizeof(src[0]));
+  return dup;
+}


So, the .c_str() returned by std::u32string is equivalent to km_core_usv? Meaning it's std-string wrapper stuff that makes the difference between the two types?

Though... I think that this means we're allocating new memory here that lacks an inherent memory management scheme. I only see one new free call in this PR, and a new delete or two. One of the deletes is in cleanup after a failed assertion, which is good, but I'm not having an easy time tracing all the codepaths that will need memory management based on this.

This is patterned after C stdlib strdup. Caller is reponsible for memory management.

Right and in both of the callers, the ICU and non-ICU, the buffer is deleted when it's done

mcdurdin · 2024-06-11T23:48:19Z

What role does the NFD boundary table serve?

NFD boundary data is not available in the web Intl libraries.

srl295 · 2024-06-11T23:56:13Z

The NFD boundary is part of the marker normalization algorithm. JavaScript normalization doesn't expose the NFD boundary so the table is needed.
Unicodestring is a heavy class and not a simple type. Removing it is part of shedding the icu dependency for wasm.

A c++ wasm wrapper of icu4c may be of general interest. May write this up later.

rc-swag · 2024-06-13T04:16:31Z

core/src/util_normalize_table_generator.cpp

+
+  // collect the raw list of chars that do NOT have a boundary before them.
+  std::vector<km_core_usv> noBoundary;
+  for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {


Suggested change

for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {

for (km_core_usv ch = 0; ch < km::core::kmx::Uni_MAX_CODEPOINT; ch++) {

If we are starting at 0, why are we finishing at 0x10FFFE instead of 0x10FFFF? 😁

jahorton

Apologies if I wasn't able to cover everything you'd want in a review, but I get the feeling I'd need significant additional context to be provided in order to provide more in-depth feedback.

What I can review, LGTM.

srl295 · 2024-06-13T12:19:13Z

Apologies if I wasn't able to cover everything you'd want in a review, but I get the feeling I'd need significant additional context to be provided in order to provide more in-depth feedback.

What I can review, LGTM.

Thanks… If a walk-through would be profitable at some point, let me know

- per review comments Fixes: #9467 Co-authored-by: rc-swag <58423624+rc-swag@users.noreply.github.com>

keyman-server · 2024-06-14T18:04:35Z

Changes in this pull request will be available for download in Keyman version 18.0.56-alpha

srl295 added 3 commits May 24, 2024 09:24

feat(core): add a KMN_NO_ICU internal switch to start being able to t…

465c4bf

…urn off ICU - always set to 0 for now (keep ICU around) - set KMN_IN_LDML_TESTS in tests to keep ICU there for test and comparison - add core_icu.cpp and put some utils there. #9467

feat(core): move more normalization logic into JS

a80a0a7

- add a normalize_nfd() which takes a single codepoint - temporarily keep ICU in actions_normalize.cpp and ldml_transforms.cpp - expand wasm opts in unit tests

feat(core): move more normalization logic into JS

5d42024

- major redo of actions_normalize - into UTF-32 and not using ICU directly - add some utilities: u32len, u32dup, context_items_from_utf32 #9467

srl295 self-assigned this May 24, 2024

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label May 24, 2024

keymanapp-test-bot bot added this to the A18S3 milestone May 24, 2024

github-actions bot added core/ Keyman Core feat labels May 24, 2024

Base automatically changed from feat/core/9467-devolve-norm-to-js2-epic-ldml to master May 27, 2024 15:26

github-actions bot added common/ common/web/ labels May 31, 2024

feat(core): generate and use static table in wasm for NFD boundary

53a6638

- add core/tools build tree with custom targets - add to core/build.sh to generate nfd_table.h - test_unicode to validate Unicode version and compare NFD to actual ICU - currently, linear search of the table.

srl295 force-pushed the feat/core/9467-devolve-norm-to-js3-epic-ldml branch from c53bcac to 53a6638 Compare June 4, 2024 16:15

Merge remote-tracking branch 'upstream/master' into feat/core/9467-de…

7fbeea8

…volve-norm-to-js3-epic-ldml

github-actions bot added common/resources/ Build infrastructure common/web/ and removed common/web/ labels Jun 4, 2024

feat(core): speedup NFD boundary table

e967386

- use RLE encoding, thanks @mcdurdin - much smaller table and faster lookup Fixes: #9467

github-actions bot added common/web/ and removed common/web/ labels Jun 4, 2024

feat(core): build improvements for --update-unicode option

fdb2e95

Fixes: #9467

github-actions bot added common/web/ and removed common/web/ labels Jun 4, 2024

Merge branch 'master' into feat/core/9467-devolve-norm-to-js3-epic-ldml

0665d82

github-actions bot added common/web/ and removed common/web/ labels Jun 4, 2024

srl295 marked this pull request as ready for review June 7, 2024 03:35

srl295 requested a review from rc-swag as a code owner June 7, 2024 03:35

Merge branch 'master' into feat/core/9467-devolve-norm-to-js3-epic-ldml

51172c3

srl295 requested a review from mcdurdin June 7, 2024 03:35

github-actions bot added common/ common/resources/ Build infrastructure common/web/ and removed common/ common/resources/ Build infrastructure common/web/ labels Jun 7, 2024

srl295 requested a review from jahorton June 7, 2024 13:56

mcdurdin modified the milestones: A18S3, A18S4 Jun 7, 2024

jahorton reviewed Jun 11, 2024

View reviewed changes

rc-swag reviewed Jun 13, 2024

View reviewed changes

jahorton approved these changes Jun 13, 2024

View reviewed changes

chore(core): update comments and remove a raw numeric literal

263d0e3

- per review comments Fixes: #9467 Co-authored-by: rc-swag <58423624+rc-swag@users.noreply.github.com>

github-actions bot added common/ common/resources/ Build infrastructure common/web/ and removed common/ common/resources/ Build infrastructure common/web/ labels Jun 13, 2024

srl295 merged commit edd42d1 into master Jun 13, 2024
18 checks passed

srl295 deleted the feat/core/9467-devolve-norm-to-js3-epic-ldml branch June 13, 2024 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): devolve normalization to js 🙀 #11541

feat(core): devolve normalization to js 🙀 #11541

srl295 commented May 24, 2024 •

edited

Loading

keymanapp-test-bot bot commented May 24, 2024 •

edited

Loading

srl295 commented Jun 4, 2024 •

edited

Loading

srl295 commented Jun 5, 2024

jahorton left a comment

jahorton Jun 11, 2024

jahorton Jun 11, 2024

jahorton Jun 11, 2024

srl295 Jun 11, 2024

jahorton Jun 11, 2024

srl295 Jun 11, 2024

jahorton Jun 13, 2024

jahorton Jun 11, 2024

jahorton Jun 11, 2024

mcdurdin Jun 11, 2024

srl295 Jun 11, 2024

mcdurdin commented Jun 11, 2024

srl295 commented Jun 11, 2024 •

edited

Loading

rc-swag Jun 13, 2024

mcdurdin Jun 13, 2024

jahorton left a comment

srl295 commented Jun 13, 2024

keyman-server commented Jun 14, 2024

	for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
	for (km_core_usv ch = 0; ch < km::core::kmx::Uni_MAX_CODEPOINT; ch++) {

feat(core): devolve normalization to js 🙀 #11541

feat(core): devolve normalization to js 🙀 #11541

Conversation

srl295 commented May 24, 2024 • edited Loading

keymanapp-test-bot bot commented May 24, 2024 • edited Loading

User Test Results

Test Artifacts

srl295 commented Jun 4, 2024 • edited Loading

srl295 commented Jun 5, 2024

jahorton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcdurdin commented Jun 11, 2024

srl295 commented Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahorton left a comment

Choose a reason for hiding this comment

srl295 commented Jun 13, 2024

keyman-server commented Jun 14, 2024

srl295 commented May 24, 2024 •

edited

Loading

keymanapp-test-bot bot commented May 24, 2024 •

edited

Loading

srl295 commented Jun 4, 2024 •

edited

Loading

srl295 commented Jun 11, 2024 •

edited

Loading