-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ResourcePath compatible with [Var]ZeroVec #243
Comments
@zbraniecki had some ideas on using an |
Additional discussion: consider keeping the input order of variant subtags when constructing objects, and sort them as a separate step, which can be the same step as the full locale identifier canonicalization (#218). When implementing the separate sorting step, we can have functions to perform either UTS 35 sorting or BCP 47 sorting. |
About BCP 47 “correct order”: https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 page 57 point 6 “Use variant subtags sparingly and in the correct order.” See also https://www.rfc-editor.org/rfc/rfc5646.html#section-3.1.8 page 31, 4th paragraph “The 'Prefix' also indicates when variant subtags make sense when used together [...] and what the relative ordering of variants is supposed to be.” IMO the BCP 47 variant algorithm is a bit baroque (and not very explicitly spelled out). That's why I think it's defensible to preserve the input order of variants, except that an equality test and matching etc. then have to deal with potentially mis-ordered variant subtags. (And sort or treat as sets etc.) Selecting “the first variant subtag” is therefore error-prone. Note that for the example of "sl-rozaj-biske-1994" LDML ASCII order of variant subtags is the opposite order of what BCP 47 calls “correct”. If using an LSRV structure, then I would move all of the variant subtags into the extended data structure if there is more than one. |
Forgive the long semi-on-topic comment. The baroque nature of variant ordering in BCP47 is necessary due to the use of
The Slovenian dialect subtags are indeed complicated and do represent the most (um...) "expressive" use of the subtag ordering scheme. They are also some of the earliest variants registered. Of the other variants, the only other one that has an ordering with another variant (as opposed to just a non-variant tag prefix) that is currently registered is The issue here is that there is nothing that would prevent new variant subtags that depend on other subtags being registered tomorrow--and the existence of some that already have a dependency structure seems like something not to mess with. Most of the other variants that share a prefix don't make sense together and thus the order only matters to the content's author. When it comes to locale-based operations, variant subtags are primarily a nuisance. I'm glad they exist as a "relief valve" for random sub-language registration requests. That said, they are mainly "default ignorable". I am kind of curious why we need to expend any effort reordering them--BCP47 "remove from right" fallback says that the the subtag order matters. If you give me a tag, I should expect it to match or be processed in the subtag order received (and not some other order). Thus the first subtag is the first one received. Woe betide you if you use |
Would it be beneficial to deprecate the subtag prefix fields? It seems in
all the years we only got 2 (Addison's example), and they cover a very
small slice of population.
Keep a simple 1-variant case that covers all CLDR languages, or even
n-variant cases but without ordering requirement.
чет, 22. окт 2020. у 14:48 Addison Phillips <notifications@github.com> је
написао/ла:
… Forgive the long semi-on-topic comment.
The baroque nature of variant ordering in BCP47 is necessary due to the
use of Prefix fields. For the example tag (sl-rozaj-biske-1994), the
subtag biske has a Prefix of sl-rozaj. It's technically *wrong* to use
the biske subtag in front of rozaj and, indeed, biske is a dialect of
rozaj (Resian). Similarly the 1994 subtag has these Prefix fields:
Prefix: sl-rozaj
Prefix: sl-rozaj-biske
Prefix: sl-rozaj-njiva
Prefix: sl-rozaj-osojs
Prefix: sl-rozaj-solba
The Slovenian dialect subtags are indeed complicated and do represent the
most (um...) "expressive" use of the subtag ordering scheme. They are also
some of the earliest variants registered. Of the other variants, the only
other one that has an ordering with another variant (as opposed to just a
non-variant tag prefix) that is currently registered is alalc97, whose
prefix says ja-Latn-hepburn but for which there is a comment saying that
ja-Latn-alalc97 is "the preferred tag". The Prefix is probably a clerical
error.
The issue here is that there is nothing that would prevent new variant
subtags that depend on other subtags being registered tomorrow--and the
existence of some that already *have* a dependency structure seems like
something not to mess with. Most of the other variants that share a prefix
don't make sense together and thus the order only matters to the content's
author.
When it comes to locale-based operations, variant subtags are primarily a
nuisance. I'm glad they exist as a "relief valve" for random sub-language
registration requests. That said, they are mainly "default ignorable". I am
kind of curious why we need to expend any effort reordering them--BCP47
"remove from right" fallback says that the the subtag order matters. If you
give me a tag, I should expect it to match or be processed in the subtag
order received (and not some other order). Thus the first subtag is the
first one received. Woe betide you if you use en-boont-basiceng and
expect it to match en-basiceng.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7GEKX2GQLYXWHXVS2B5FLSMCSDZANCNFSM4RODIXGA>
.
|
That's not precisely true. In all these years there have only been that many prefixed-by-another-variant variants. But there are plenty of variants (most with only a few exceptions) that have a non-variant prefix and quite a number of these share a Prefix with another variant and thus might reasonably be used together. In these cases, the user choosing the tag defines the order/dependency of meaning. This seems inconvenient, since it means that Variants by their very nature only cover a very small segment of the population--they are mostly, as I said, "default ignorable". Are any of the variant subtags useful in CLDR? Note that the Prefix field is "sort-of" normative in BCP47, but a tag can be considered I guess what I'm saying is: variants are an ordered list of opaque subtags. Unless/until a variant has meaning in a locale (which I don't think any do?) they are just part of the JSRV fallback/matching chain that might influence resource selection but not much else. Take them in the order given, casefold lowercase, and don't do much (preferably anything) else with them. But I think it is a bad idea to limit them to 1 or apply reordering/lose the ordering, since that's not compatible with BCP47. Is there a reason we'd break that which I'm not understanding? |
I agree about (not) reordering. My motivation for creating a new type alongside of Locale and LanguageIdentifier that limits the number of variant subtags is because it's desirable in Rust to have a type that is copyable (via memcpy) without the possibility that it holds a reference to memory somewhere else. This is why I was proposing LSRV. It's not a full Unicode Language Identifier, but it would get the job done for 99.9% of cases, including 100% of cases that come from CLDR. |
2020-11-20: Put LSRV in the icu_provider crate for now. For the sorting (or not sorting) of variants, @zbraniecki pointed out that it comes down to BCP47 not distinguishing between multiple variants and multipart variants. I suggested that we can solve this problem with data, in #218. |
I decided to do this a little differently. I'm going to model the new ResourceOptions as a non-copyable bag of optional fields. In this model, we don't need LanguageIdentifier to be copyable. |
I'm reopening this issue because now we have a concrete reason to try to make |
On this subject, it does appear that a small but nontrivial fraction of the ICU4X data loading infrastructure is spent in the "resource_path_to_string" function. Ideally we can avoid the allocation of a string when fetching data from the data provider. |
I think we should keep ResourcePath variable-width and ResourceKey fixed-width. However, we should:
|
Blocked by #2032 |
Name bikeshed aside, what do you think about
LanguageIdentifier would have a convenience function that lets you get an iterator over the variants; the first variant is pulled from the LSRV, and remaining variants are pulled from the boxed slice.
We can have
impl From<LSRV> for LanguageIdentifier
. In the other direction, we could have a methodLSRV truncate(&self)
.@zbraniecki
The text was updated successfully, but these errors were encountered: