Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require opt-in for search collation datagen #2708

Closed
hsivonen opened this issue Oct 3, 2022 · 12 comments · Fixed by #2789
Closed

Require opt-in for search collation datagen #2708

hsivonen opened this issue Oct 3, 2022 · 12 comments · Fixed by #2789
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement)

Comments

@hsivonen
Copy link
Member

hsivonen commented Oct 3, 2022

Currently, if you follow the data management tutorial, you end up with data for the search collations.

(Steps to repro: run icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my-data --format dir --all-keys --all-locales and then find my-data/ | grep search)

This is bad, because we don't have a search API, so the search collations are in practice useless extra data.

For Web browsers that want to keep exposing the search collations via non-search API for compatibility with the present state of the Web, there should be some opt-in option to force the generation of search collation data. However, for just about everyone else, it's a data size bug to have the search collation data generated.

@hsivonen hsivonen added the C-data-infra Component: provider, datagen, fallback, adapters label Oct 3, 2022
@hsivonen
Copy link
Member Author

hsivonen commented Oct 3, 2022

That is, without an opt-in, collations whose -u-co- subcomponent starts with "search" should be excluded. (Note that the spec makes starting with the thing that distinguished searchness so that in the current state of CLDR "sealchjl" matches and in the future other similar refinements could match.)

@sffc sffc self-assigned this Oct 3, 2022
@robertbastian
Copy link
Member

The solution could be to make the --all-locales option filter out -u-co-search locales. Specifying them explicitly would still work.

@robertbastian
Copy link
Member

Actually I think --all-locales should continue to generate all possible locales. It's not a flag that clients should generally use.

On the Rust API we should replace the locales: &[Locale] input with something cleverer like locales: impl Fn(&Locale) -> bool. We can then use this to filter out search collations, or rare Han collations (#2709). On the CLI we could offer some presets like --all-locales-except-legacy-collations.

@zbraniecki
Copy link
Member

Is our CLI part of the stability guarantees? If no, I'm comfortable with --all-locales-except-legacy-collations as an iterative step. if yes, then I'd prefer to avoid one-offs like this and systematize template file (yaml?) with keys+locales matrix, and/or exceptions (where legacy collations would be an available exception).

@robertbastian
Copy link
Member

Adding an extra flag wouldn't be a breaking change anyway. Where we really backed ourselves into a corner is on the Rust API side, where we cannot easily provide functions to generate lists of locales (as we do for keys), because they would depend on data and keys.

@sffc
Copy link
Member

sffc commented Oct 3, 2022

When it comes to whether or not to include extension keywords, I think it makes the most sense to just add the configurations as new options in datagen.

The fixed-size locale list allows us to perform smarter locale filtering (#834) and it should remain the primary entry point, although it may have been more appropriate to make it a list of LanguageIdentifier instead of Locale.

locales: impl Fn(&Locale) -> bool is something we should support, too, because it is highly general, but it should be a power user feature.

@zbraniecki
Copy link
Member

Adding an extra flag wouldn't be a breaking change anyway.

My question is how easy it will be to remove a flag when we generalize the solution and won't need --all-locales-but-with-exceptionA.

@sffc
Copy link
Member

sffc commented Oct 3, 2022

The intent behind --locales and --all-locales was to enable language/script/region tuples, which is a bit of a misnomer. It should have nothing to do with whether or not -u-co-search is enabled.

We already have --collation-root. Just add another flag like --without-collation-search or --without-collation-legacy.

@hsivonen
Copy link
Member Author

hsivonen commented Oct 4, 2022

I think this one isn't only about having some flag to omit the search collations but also having ergonomics such that the option that seems the most natural to pick doesn't generate data that we don't have a (proper) API for (search/searchjl).

Managing exclusion of other collations that are often unwanted but don't have a clear line like "we don't have an API for these" is a harder datagen UI problem.

@hsivonen
Copy link
Member Author

hsivonen commented Oct 4, 2022

(Regarding "legacy": Arguably all "traditional" collations are legacy, but in terms of data size and in terms of someone perhaps wishing to use them, they aren't on the same level of data size issue as big5han and gb2312, which I failed to find any use case justification for when I did repo archeology. My inference is that big5han and gb2312 originated by analogy from the Japanese standard collation applied to Traditional Chinese and Simplified Chinese, respectively, and they were renamed instead of getting removed from ICU when more appropriate collations were introduced. AFAICT, the standard for -u-co- works backwards from what ICU had at the time of writing it down in the standard. Arguably *-u-co-unihan collations aren't legacy but otherwise anecdotally in low user demand. I filed #2709 separately, because it's not as clear as "we don't have a search API".)

@sffc
Copy link
Member

sffc commented Oct 4, 2022

I'm okay if we exclude search by default, and you can pass --include-collation-search to add it.

@hsivonen
Copy link
Member Author

hsivonen commented Oct 5, 2022

I'm okay if we exclude search by default, and you can pass --include-collation-search to add it.

To me, this seems like a good way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants