Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance impact of rearranging "can combine backwards" bit #4967

Open
hsivonen opened this issue May 30, 2024 · 2 comments
Open
Labels
2.0-breaking Changes that are breaking API changes A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization

Comments

@hsivonen
Copy link
Member

For characters that are their own decomposition, the least significant bit signifies "can combine backwards". As of Unicode 16, this information is also needed for complex decompositions, but the same bit was already taken, so the second-least-significant bit is used (by #4860).

Investigate the performance impact of flipping around the two bit allocations for complex decompositions and unifying the "can combine backwards" bit check.

@hsivonen hsivonen added A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization 2.0-breaking Changes that are breaking API changes labels May 30, 2024
@sffc sffc added this to the ICU4X 2.0 milestone May 30, 2024
@sffc
Copy link
Member

sffc commented May 30, 2024

Seems like something that would be beneficial to do in 2.0. Anyone can take this and @hsivonen has left enough of a trail. Perhaps @echeran

@sffc
Copy link
Member

sffc commented Nov 2, 2024

Estimation of 2.0 status: time to land normalization performance?

  • @hsivonen Is there time to land normalization data struct performance?
  • @Manishearth Next two weeks... I think we're very close. We've been chipping away at the small things. I'm going by the plan for beta.
  • @sffc I think a data struct change could still land in 2.0 final. It doesn't need to be in beta.
  • @Manishearth Can you describe the nature of the changes?
  • @hsivonen (1) deconposition doesn't have a bit to say if there is ... (2) the trivial bit. (3) The K normalizations are supplementary tries as opposed to duplicated data. (4) Checks eagerly for Hangul instead of having a trie value for it. It would be nice to make these better.
  • @Manishearth For changing data for 2.0 final, we can mostly do that. 2.0 beta is complete more or less. We have mid-Dec for 2.0 final.
  • @sffc If you land those things sometime in Q4, there is a high chance that we can get those in.
  • @hsivonen I only expect the controversy to be the change to the data struct. Other than that, I expect it to be, "benchmarks are improved".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0-breaking Changes that are breaking API changes A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization
Projects
Status: Investigate
Development

No branches or pull requests

2 participants