-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change icuexportdata trie format to improve normalizer performance #5813
Conversation
ICU4C issue: https://unicode-org.atlassian.net/browse/ICU-22968 |
ICU4C PR: unicode-org/icu#3269 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Landable for the purpose of 2.0, but I think this could have a couple more pointers in the docs and be more encapsulated.
/// Getting a zero from this trie means that you need | ||
/// to make another lookup from `DecompositionDataV1::trie`. | ||
pub struct DecompositionDataV2<'data> { | ||
/// Trie for decomposition. | ||
#[cfg_attr(feature = "serde", serde(borrow))] | ||
pub trie: CodePointTrie<'data, u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I feel like the packed code logic is all scattered. Can we use a structured NormalizationTrieValue(pub u32)
type that has convenience methods for getting all the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that what you suggest would be better for encapsulation. However, given that prior to this PR there was no such encapsulation and I'm already way over my time budget for this, I would very much prefer landing this ASAP (before 2.0 and before this bitrots) without such a refactoring and leaving the refactoring as a follow-up.
Once the ICU4C side lands, this PR needs an update to take a newer export zip in datagen. |
I realized while updating the tag that I think this change makes datagen incompatible with older ICU tags. The data coming from icuexportdata changed structure; it wasn't just some new additions. Are we ok with that? |
I pushed 4 commits to the branch (ignore the force-push; it was to fix a merge issue I made and did not change any of hsivonen's commits). CI is now all green except for a clippy issue. |
I'm fine with this breaking for 2.0. |
I think we pretty much have to be. The performance improvement here is just too good to reject in order to enable the use of old data (which would seem like a very niche use case if there even is a use case). Also, my understanding from prior discussions was that we had agreed we're OK with a data compatibility break like this for 2.0. Thanks for updating this with the new data export. |
Thanks! Landed. |
Why was half of this change done in ICU4C? Couldn't this transformation have been applied in datagen? I'm asking this for two reasons:
Edit: and
|
Because we started icuexportdata early on by putting the trie builder side is in the ICU4C repo.
That would have involved keeping around at least part of the ICU4X 1.5 code for interpreting the old data while rewriting parts of icuexportdata in Rust. Implementing #4602 for the normalizer could make sense as a RIIR project, but then it would make sense to work from UCD and not from whatever the shape of runtime data happened to be at ICU4X 1.5. (But, as discussed previously, the hardest part that makes ICU4X dependent on ICU4C is the collation data builder.) Part of the problem that this PR fixed is that the ICU4X 1.x normalizer data format was a decomposing normalizer format plus hacks enable a composing normalizer instead of being designed to support a composing normalizer. It wasn't at all designed to support transformation by datagen.
I think freezing the icuexportdata format for normalization as it happened to be at ICU4X 1.5 would have been anti-useful, since the format isn't meant to be transformed by datagen. |
With the fast trie type, I see this kind of performance improvement: