Make ResourcePath compatible with [Var]ZeroVec #243

sffc · 2020-09-16T02:41:25Z

Name bikeshed aside, what do you think about

#[derive(Copy, Clone)]
pub struct LSRV {
  pub language: subtags::Language,
  pub script: Option<subtags::Script>,
  pub region: Option<subtags::Region>,
  pub variant: Option<subtags::Variant>,
}

#[derive(Clone)]
pub struct LanguageIdentifier {
  lsrv: LSRV,
  additional_variants: Box<[subtags::Variant]>,
}

LanguageIdentifier would have a convenience function that lets you get an iterator over the variants; the first variant is pulled from the LSRV, and remaining variants are pulled from the boxed slice.

We can have impl From<LSRV> for LanguageIdentifier. In the other direction, we could have a method LSRV truncate(&self).

@zbraniecki

The text was updated successfully, but these errors were encountered:

sffc · 2020-10-22T17:36:56Z

@zbraniecki had some ideas on using an enum to make LanguageIdentifier copyable, by toggling it between a single-variant-subtag and a multi-variant-subtag.

sffc · 2020-10-22T18:45:02Z

Additional discussion: consider keeping the input order of variant subtags when constructing objects, and sort them as a separate step, which can be the same step as the full locale identifier canonicalization (#218). When implementing the separate sorting step, we can have functions to perform either UTS 35 sorting or BCP 47 sorting.

CC @markusicu @macchiati

markusicu · 2020-10-22T21:13:39Z

About BCP 47 “correct order”:

https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 page 57 point 6 “Use variant subtags sparingly and in the correct order.”

See also https://www.rfc-editor.org/rfc/rfc5646.html#section-3.1.8 page 31, 4th paragraph “The 'Prefix' also indicates when variant subtags make sense when used together [...] and what the relative ordering of variants is supposed to be.”

IMO the BCP 47 variant algorithm is a bit baroque (and not very explicitly spelled out). That's why I think it's defensible to preserve the input order of variants, except that an equality test and matching etc. then have to deal with potentially mis-ordered variant subtags. (And sort or treat as sets etc.)

Selecting “the first variant subtag” is therefore error-prone.

Note that for the example of "sl-rozaj-biske-1994" LDML ASCII order of variant subtags is the opposite order of what BCP 47 calls “correct”.

If using an LSRV structure, then I would move all of the variant subtags into the extended data structure if there is more than one.

aphillips · 2020-10-22T21:48:29Z

Forgive the long semi-on-topic comment.

The baroque nature of variant ordering in BCP47 is necessary due to the use of Prefix fields. For the example tag (sl-rozaj-biske-1994), the subtag biske has a Prefix of sl-rozaj. It's technically wrong to use the biske subtag in front of rozaj and, indeed, biske is a dialect of rozaj (Resian). Similarly the 1994 subtag has these Prefix fields:

Prefix: sl-rozaj
Prefix: sl-rozaj-biske
Prefix: sl-rozaj-njiva
Prefix: sl-rozaj-osojs
Prefix: sl-rozaj-solba

The Slovenian dialect subtags are indeed complicated and do represent the most (um...) "expressive" use of the subtag ordering scheme. They are also some of the earliest variants registered. Of the other variants, the only other one that has an ordering with another variant (as opposed to just a non-variant tag prefix) that is currently registered is heploc, whose prefix says ja-Latn-hepburn but for which there is a comment saying that ja-Latn-alalc97 is "the preferred tag" (as the subtag has been deprecated).

The issue here is that there is nothing that would prevent new variant subtags that depend on other subtags being registered tomorrow--and the existence of some that already have a dependency structure seems like something not to mess with. Most of the other variants that share a prefix don't make sense together and thus the order only matters to the content's author.

When it comes to locale-based operations, variant subtags are primarily a nuisance. I'm glad they exist as a "relief valve" for random sub-language registration requests. That said, they are mainly "default ignorable". I am kind of curious why we need to expend any effort reordering them--BCP47 "remove from right" fallback says that the the subtag order matters. If you give me a tag, I should expect it to match or be processed in the subtag order received (and not some other order). Thus the first subtag is the first one received. Woe betide you if you use en-boont-basiceng and expect it to match en-basiceng.

nciric · 2020-10-23T19:04:32Z

Would it be beneficial to deprecate the subtag prefix fields? It seems in all the years we only got 2 (Addison's example), and they cover a very small slice of population. Keep a simple 1-variant case that covers all CLDR languages, or even n-variant cases but without ordering requirement. чет, 22. окт 2020. у 14:48 Addison Phillips <notifications@github.com> је написао/ла:

…

Forgive the long semi-on-topic comment. The baroque nature of variant ordering in BCP47 is necessary due to the use of Prefix fields. For the example tag (sl-rozaj-biske-1994), the subtag biske has a Prefix of sl-rozaj. It's technically *wrong* to use the biske subtag in front of rozaj and, indeed, biske is a dialect of rozaj (Resian). Similarly the 1994 subtag has these Prefix fields: Prefix: sl-rozaj Prefix: sl-rozaj-biske Prefix: sl-rozaj-njiva Prefix: sl-rozaj-osojs Prefix: sl-rozaj-solba The Slovenian dialect subtags are indeed complicated and do represent the most (um...) "expressive" use of the subtag ordering scheme. They are also some of the earliest variants registered. Of the other variants, the only other one that has an ordering with another variant (as opposed to just a non-variant tag prefix) that is currently registered is alalc97, whose prefix says ja-Latn-hepburn but for which there is a comment saying that ja-Latn-alalc97 is "the preferred tag". The Prefix is probably a clerical error. The issue here is that there is nothing that would prevent new variant subtags that depend on other subtags being registered tomorrow--and the existence of some that already *have* a dependency structure seems like something not to mess with. Most of the other variants that share a prefix don't make sense together and thus the order only matters to the content's author. When it comes to locale-based operations, variant subtags are primarily a nuisance. I'm glad they exist as a "relief valve" for random sub-language registration requests. That said, they are mainly "default ignorable". I am kind of curious why we need to expend any effort reordering them--BCP47 "remove from right" fallback says that the the subtag order matters. If you give me a tag, I should expect it to match or be processed in the subtag order received (and not some other order). Thus the first subtag is the first one received. Woe betide you if you use en-boont-basiceng and expect it to match en-basiceng. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#243 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA7GEKX2GQLYXWHXVS2B5FLSMCSDZANCNFSM4RODIXGA> .

aphillips · 2020-10-23T20:17:38Z

It seems in all the years we only got 2 (Addison's example), and they cover a very small slice of population.

That's not precisely true. In all these years there have only been that many prefixed-by-another-variant variants. But there are plenty of variants (most with only a few exceptions) that have a non-variant prefix and quite a number of these share a Prefix with another variant and thus might reasonably be used together. In these cases, the user choosing the tag defines the order/dependency of meaning. This seems inconvenient, since it means that en-boont-basiceng != en-basiceng-boont even though they apparently mean much the same thing.

Variants by their very nature only cover a very small segment of the population--they are mostly, as I said, "default ignorable". Are any of the variant subtags useful in CLDR?

Note that the Prefix field is "sort-of" normative in BCP47, but a tag can be considered valid even if it ignores the Prefix field, e.g. tlh-Cyrl-AQ-boont is valid even though boont has a prefix of en and ought not be used with any other language tag.

I guess what I'm saying is: variants are an ordered list of opaque subtags. Unless/until a variant has meaning in a locale (which I don't think any do?) they are just part of the JSRV fallback/matching chain that might influence resource selection but not much else. Take them in the order given, casefold lowercase, and don't do much (preferably anything) else with them.

But I think it is a bad idea to limit them to 1 or apply reordering/lose the ordering, since that's not compatible with BCP47. Is there a reason we'd break that which I'm not understanding?

sffc · 2020-10-26T09:49:20Z

But I think it is a bad idea to limit them to 1 or apply reordering/lose the ordering, since that's not compatible with BCP47. Is there a reason we'd break that which I'm not understanding?

I agree about (not) reordering.

My motivation for creating a new type alongside of Locale and LanguageIdentifier that limits the number of variant subtags is because it's desirable in Rust to have a type that is copyable (via memcpy) without the possibility that it holds a reference to memory somewhere else. This is why I was proposing LSRV. It's not a full Unicode Language Identifier, but it would get the job done for 99.9% of cases, including 100% of cases that come from CLDR.

sffc · 2020-11-20T19:00:15Z

2020-11-20: Put LSRV in the icu_provider crate for now.

For the sorting (or not sorting) of variants, @zbraniecki pointed out that it comes down to BCP47 not distinguishing between multiple variants and multipart variants. I suggested that we can solve this problem with data, in #218.

sffc · 2020-12-09T00:26:08Z

This issue can be done alongside #244. Waiting on #196 first.

sffc · 2020-12-23T08:24:14Z

I decided to do this a little differently. I'm going to model the new ResourceOptions as a non-copyable bag of optional fields. In this model, we don't need LanguageIdentifier to be copyable.

sffc · 2021-11-19T20:53:04Z

I'm reopening this issue because now we have a concrete reason to try to make ResourcePath Sized and Copy. It is used as a key in the ZeroMap, and Sized+Copy keys are measurably faster than unsized ones like str.

sffc · 2021-11-20T05:24:40Z

On this subject, it does appear that a small but nontrivial fraction of the ICU4X data loading infrastructure is spent in the "resource_path_to_string" function. Ideally we can avoid the allocation of a string when fetching data from the data provider.

sffc · 2021-12-21T07:35:45Z

I think we should keep ResourcePath variable-width and ResourceKey fixed-width. However, we should:

Allow ResourceKey to be represented in 4 bytes (Revisit resource_key identifier being TinyStr16 #1148)
Make ResourcePath implement VarULE, which will help us remove the expensive resource_path_to_string function.

sffc · 2022-06-24T18:21:21Z

Blocked by #2032

sffc added T-enhancement Type: Nice-to-have but not required C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting labels Sep 16, 2020

sffc self-assigned this Sep 17, 2020

sffc added this to the 2020 Q4 milestone Sep 17, 2020

sffc mentioned this issue Sep 18, 2020

Add icu-locale macros #220

Merged

sffc assigned zbraniecki and unassigned sffc Oct 22, 2020

sffc assigned sffc and unassigned zbraniecki Nov 20, 2020

sffc added A-performance Area: Performance (CPU, Memory) and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 20, 2020

sffc added C-data-infra Component: provider, datagen, fallback, adapters and removed C-locale Component: Locale identifiers, BCP47 labels Dec 4, 2020

sffc added the blocked A dependency must be resolved before this is actionable label Dec 9, 2020

sffc mentioned this issue Dec 14, 2020

Add Locale support for likely subtags #417

Closed

sffc removed the blocked A dependency must be resolved before this is actionable label Dec 23, 2020

sffc mentioned this issue Dec 23, 2020

Renaming and refactoring DataKey, DataEntry, and other traits; adding ErasedDataProvider #431

Merged

sffc closed this as completed in #431 Jan 5, 2021

sffc mentioned this issue Jan 11, 2021

Add LocaleCanonicalizer to support likely subtags #423

Merged

sffc mentioned this issue Jun 28, 2021

Make LanguageIdentifier and/or Locale compatible with [Var]ZeroVec #831

Closed

sffc modified the milestones: 2020 Q4, ICU4X 0.6 Nov 19, 2021

sffc added the v1 label Nov 19, 2021

sffc reopened this Nov 19, 2021

sffc added the S-medium Size: Less than a week (larger bug fix or enhancement) label Nov 19, 2021

sffc changed the title ~~New idea for LSRV~~ Make ResourcePath compatible with ZeroVec via LSRV type Nov 19, 2021

sffc changed the title ~~Make ResourcePath compatible with ZeroVec via LSRV type~~ Make ResourcePath compatible with [Var]ZeroVec Dec 21, 2021

sffc added S-epic Size: Major project (create smaller child issues) and removed S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 1, 2022

This was referenced Feb 1, 2022

Migrate BlobDataProvider to ZeroMap2d #1567

Closed

Split DataProvider into ResourceProvider and DynProvider #1554

Merged

sffc removed the v1 label Apr 1, 2022

sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Features) May 25, 2022

This was referenced Jun 3, 2022

Refactor ZeroMap2d and add get_by functions #1876

Merged

Update Locale and LanguageIdentifier comparison names #2020

Merged

Use get_by in BlobDataProvider lookup function #2033

Merged

sffc added the blocked A dependency must be resolved before this is actionable label Jun 24, 2022

sffc mentioned this issue Jun 25, 2022

Add more strict_cmp_iter and migrate ResourceOptions::strict_cmp #2114

Merged

sffc closed this as completed in #2033 Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ResourcePath compatible with [Var]ZeroVec #243

Make ResourcePath compatible with [Var]ZeroVec #243

sffc commented Sep 16, 2020 •

edited

Loading

sffc commented Oct 22, 2020

sffc commented Oct 22, 2020

markusicu commented Oct 22, 2020

aphillips commented Oct 22, 2020 •

edited

Loading

nciric commented Oct 23, 2020 via email

aphillips commented Oct 23, 2020

sffc commented Oct 26, 2020

sffc commented Nov 20, 2020

sffc commented Dec 9, 2020

sffc commented Dec 23, 2020

sffc commented Nov 19, 2021

sffc commented Nov 20, 2021

sffc commented Dec 21, 2021

sffc commented Jun 24, 2022

Make ResourcePath compatible with [Var]ZeroVec #243

Make ResourcePath compatible with [Var]ZeroVec #243

Comments

sffc commented Sep 16, 2020 • edited Loading

sffc commented Oct 22, 2020

sffc commented Oct 22, 2020

markusicu commented Oct 22, 2020

aphillips commented Oct 22, 2020 • edited Loading

nciric commented Oct 23, 2020 via email

aphillips commented Oct 23, 2020

sffc commented Oct 26, 2020

sffc commented Nov 20, 2020

sffc commented Dec 9, 2020

sffc commented Dec 23, 2020

sffc commented Nov 19, 2021

sffc commented Nov 20, 2021

sffc commented Dec 21, 2021

sffc commented Jun 24, 2022

sffc commented Sep 16, 2020 •

edited

Loading

aphillips commented Oct 22, 2020 •

edited

Loading