-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): devolve normalization to js 🙀 #11541
Conversation
…urn off ICU - always set to 0 for now (keep ICU around) - set KMN_IN_LDML_TESTS in tests to keep ICU there for test and comparison - add core_icu.cpp and put some utils there. #9467
- add a normalize_nfd() which takes a single codepoint - temporarily keep ICU in actions_normalize.cpp and ldml_transforms.cpp - expand wasm opts in unit tests
- major redo of actions_normalize - into UTF-32 and not using ICU directly - add some utilities: u32len, u32dup, context_items_from_utf32 #9467
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
- add core/tools build tree with custom targets - add to core/build.sh to generate nfd_table.h - test_unicode to validate Unicode version and compare NFD to actual ICU - currently, linear search of the table.
c53bcac
to
53a6638
Compare
…volve-norm-to-js3-epic-ldml
Line 183 in fdb2e95
any ideas on how to best call into meson for updating this file? i looked at |
I think this could be ready for review. Will split regex to a followon PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm less familiar with this area of the codebase, so forgive me if I have a lot of concerns and/or questions.
- What role does the NFD boundary table serve?
If we're not in WASM mode, then ICU should be available. If we are in WASM mode, we'll be leveraging JS normalization. My naive first instinct says that neither mode should need the table as a result, but it's obvious you wouldn't have written the boundary-table code if that were true.
- I see that a number of changes are moving to
std::u32string
as the intermediary string type so that WASM won't need theicu::UnicodeString
type in the future. That makes sense.
core/src/actions_normalize.cpp
Outdated
* Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer, | ||
* nul-terminated | ||
*/ | ||
inline km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm tracking the changes correctly, this removal is fine due to no longer using icu::UnicodeString&
as the standard intermediary string type.
Replacing it with std::u32string
here and elsewhere... appears to avoid the need for all the ICU-related assertions, etc that previously existed here?
assert(!cached_context_string.isBogus()); | ||
assert(!app_context_string.isBogus()); | ||
if(output.isBogus() || cached_context_string.isBogus() || app_context_string.isBogus()) { | ||
if (!context_items_to_unicode_string(app_context, app_context_string)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new format of the method - returning an error code/flag instead of the ICU-string object directly - allows us to reuse its internal assertions + error-throwing instead of needing the previously-WET style being cleaned up here.
Assuming I'm parsing things correctly. (Took a while to ferret this detail out.)
@@ -49,16 +48,20 @@ km_core_usv *unicode_string_to_usv(icu::UnicodeString& src) { | |||
* @return false if failure | |||
*/ | |||
bool normalize(const icu::Normalizer2 *n, std::u16string &str, UErrorCode &status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving things to std::u32string
when std::u16string
is what we're using on the primary normalize
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, most processing is in UTF 32… The 16 bit stuff was only because that's ICUs native API…
core/src/util_normalize.cpp
Outdated
/** | ||
* Helper to convert icu::UnicodeString to a UTF-32 km_core_usv buffer, | ||
* nul-terminated | ||
*/ | ||
km_core_usv *string_to_usv(const std::u32string& src) { | ||
return km::core::kmx::u32dup(src.c_str()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh... so std::u32string
doesn't look the same as icu::UnicodeString
to me. Is the comment saying that we're basically just using the type to store icu::UnicodeString
values without the need to link in its type for WASM, with this method doing the actual conversion?
Without the comment, my naive sight-reading of the code suggests that this is simply cloning the string. It looks like km_core_usv
is almost just an alias for std::u32string
, but something tells me that's probably not entirely right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is wrong here… If this function is retained, it's just a way to make the ICU and non-ICU versions a little bit more parallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to take your word for it - I don't have sufficient context to review details related to your comment.
km_core_usv *km::core::kmx::u32dup(const km_core_usv *src) { | ||
km_core_usv *dup = new km_core_usv[u32len(src) + 1]; | ||
memcpy(dup, src, (u32len(src) + 1) * sizeof(src[0])); | ||
return dup; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the .c_str()
returned by std::u32string
is equivalent to km_core_usv
? Meaning it's std-string wrapper stuff that makes the difference between the two types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though... I think that this means we're allocating new memory here that lacks an inherent memory management scheme. I only see one new free
call in this PR, and a new delete
or two. One of the deletes
is in cleanup after a failed assertion, which is good, but I'm not having an easy time tracing all the codepaths that will need memory management based on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is patterned after C stdlib strdup
. Caller is reponsible for memory management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right and in both of the callers, the ICU and non-ICU, the buffer is deleted when it's done
NFD boundary data is not available in the web Intl libraries. |
A c++ wasm wrapper of icu4c may be of general interest. May write this up later. |
|
||
// collect the raw list of chars that do NOT have a boundary before them. | ||
std::vector<km_core_usv> noBoundary; | ||
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) { | |
for (km_core_usv ch = 0; ch < km::core::kmx::Uni_MAX_CODEPOINT; ch++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are starting at 0, why are we finishing at 0x10FFFE instead of 0x10FFFF? 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if I wasn't able to cover everything you'd want in a review, but I get the feeling I'd need significant additional context to be provided in order to provide more in-depth feedback.
What I can review, LGTM.
Thanks… If a walk-through would be profitable at some point, let me know |
- per review comments Fixes: #9467 Co-authored-by: rc-swag <58423624+rc-swag@users.noreply.github.com>
Changes in this pull request will be available for download in Keyman version 18.0.56-alpha |
Wasm not working as I write this, but:
Future items for this issue:
(for now, probably should redo context tests eventually)
For: #9467
@keymanapp-test-bot skip