Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ResourcePath compatible with [Var]ZeroVec #243

Closed
sffc opened this issue Sep 16, 2020 · 14 comments · Fixed by #431 or #2033
Closed

Make ResourcePath compatible with [Var]ZeroVec #243

sffc opened this issue Sep 16, 2020 · 14 comments · Fixed by #431 or #2033
Assignees
Labels
A-performance Area: Performance (CPU, Memory) blocked A dependency must be resolved before this is actionable C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Sep 16, 2020

Name bikeshed aside, what do you think about

#[derive(Copy, Clone)]
pub struct LSRV {
  pub language: subtags::Language,
  pub script: Option<subtags::Script>,
  pub region: Option<subtags::Region>,
  pub variant: Option<subtags::Variant>,
}

#[derive(Clone)]
pub struct LanguageIdentifier {
  lsrv: LSRV,
  additional_variants: Box<[subtags::Variant]>,
}

LanguageIdentifier would have a convenience function that lets you get an iterator over the variants; the first variant is pulled from the LSRV, and remaining variants are pulled from the boxed slice.

We can have impl From<LSRV> for LanguageIdentifier. In the other direction, we could have a method LSRV truncate(&self).

@zbraniecki

@sffc sffc added T-enhancement Type: Nice-to-have but not required C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting labels Sep 16, 2020
@sffc sffc self-assigned this Sep 17, 2020
@sffc sffc added this to the 2020 Q4 milestone Sep 17, 2020
@sffc
Copy link
Member Author

sffc commented Oct 22, 2020

@zbraniecki had some ideas on using an enum to make LanguageIdentifier copyable, by toggling it between a single-variant-subtag and a multi-variant-subtag.

@sffc
Copy link
Member Author

sffc commented Oct 22, 2020

Additional discussion: consider keeping the input order of variant subtags when constructing objects, and sort them as a separate step, which can be the same step as the full locale identifier canonicalization (#218). When implementing the separate sorting step, we can have functions to perform either UTS 35 sorting or BCP 47 sorting.

CC @markusicu @macchiati

@markusicu
Copy link
Member

About BCP 47 “correct order”:

https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1 page 57 point 6 “Use variant subtags sparingly and in the correct order.”

See also https://www.rfc-editor.org/rfc/rfc5646.html#section-3.1.8 page 31, 4th paragraph “The 'Prefix' also indicates when variant subtags make sense when used together [...] and what the relative ordering of variants is supposed to be.”

IMO the BCP 47 variant algorithm is a bit baroque (and not very explicitly spelled out). That's why I think it's defensible to preserve the input order of variants, except that an equality test and matching etc. then have to deal with potentially mis-ordered variant subtags. (And sort or treat as sets etc.)

Selecting “the first variant subtag” is therefore error-prone.

Note that for the example of "sl-rozaj-biske-1994" LDML ASCII order of variant subtags is the opposite order of what BCP 47 calls “correct”.

If using an LSRV structure, then I would move all of the variant subtags into the extended data structure if there is more than one.

@sffc sffc assigned zbraniecki and unassigned sffc Oct 22, 2020
@aphillips
Copy link
Member

aphillips commented Oct 22, 2020

Forgive the long semi-on-topic comment.

The baroque nature of variant ordering in BCP47 is necessary due to the use of Prefix fields. For the example tag (sl-rozaj-biske-1994), the subtag biske has a Prefix of sl-rozaj. It's technically wrong to use the biske subtag in front of rozaj and, indeed, biske is a dialect of rozaj (Resian). Similarly the 1994 subtag has these Prefix fields:

Prefix: sl-rozaj
Prefix: sl-rozaj-biske
Prefix: sl-rozaj-njiva
Prefix: sl-rozaj-osojs
Prefix: sl-rozaj-solba

The Slovenian dialect subtags are indeed complicated and do represent the most (um...) "expressive" use of the subtag ordering scheme. They are also some of the earliest variants registered. Of the other variants, the only other one that has an ordering with another variant (as opposed to just a non-variant tag prefix) that is currently registered is heploc, whose prefix says ja-Latn-hepburn but for which there is a comment saying that ja-Latn-alalc97 is "the preferred tag" (as the subtag has been deprecated).

The issue here is that there is nothing that would prevent new variant subtags that depend on other subtags being registered tomorrow--and the existence of some that already have a dependency structure seems like something not to mess with. Most of the other variants that share a prefix don't make sense together and thus the order only matters to the content's author.

When it comes to locale-based operations, variant subtags are primarily a nuisance. I'm glad they exist as a "relief valve" for random sub-language registration requests. That said, they are mainly "default ignorable". I am kind of curious why we need to expend any effort reordering them--BCP47 "remove from right" fallback says that the the subtag order matters. If you give me a tag, I should expect it to match or be processed in the subtag order received (and not some other order). Thus the first subtag is the first one received. Woe betide you if you use en-boont-basiceng and expect it to match en-basiceng.

@nciric
Copy link
Contributor

nciric commented Oct 23, 2020 via email

@aphillips
Copy link
Member

It seems in all the years we only got 2 (Addison's example), and they cover a very small slice of population.

That's not precisely true. In all these years there have only been that many prefixed-by-another-variant variants. But there are plenty of variants (most with only a few exceptions) that have a non-variant prefix and quite a number of these share a Prefix with another variant and thus might reasonably be used together. In these cases, the user choosing the tag defines the order/dependency of meaning. This seems inconvenient, since it means that en-boont-basiceng != en-basiceng-boont even though they apparently mean much the same thing.

Variants by their very nature only cover a very small segment of the population--they are mostly, as I said, "default ignorable". Are any of the variant subtags useful in CLDR?

Note that the Prefix field is "sort-of" normative in BCP47, but a tag can be considered valid even if it ignores the Prefix field, e.g. tlh-Cyrl-AQ-boont is valid even though boont has a prefix of en and ought not be used with any other language tag.

I guess what I'm saying is: variants are an ordered list of opaque subtags. Unless/until a variant has meaning in a locale (which I don't think any do?) they are just part of the JSRV fallback/matching chain that might influence resource selection but not much else. Take them in the order given, casefold lowercase, and don't do much (preferably anything) else with them.

But I think it is a bad idea to limit them to 1 or apply reordering/lose the ordering, since that's not compatible with BCP47. Is there a reason we'd break that which I'm not understanding?

@sffc
Copy link
Member Author

sffc commented Oct 26, 2020

But I think it is a bad idea to limit them to 1 or apply reordering/lose the ordering, since that's not compatible with BCP47. Is there a reason we'd break that which I'm not understanding?

I agree about (not) reordering.

My motivation for creating a new type alongside of Locale and LanguageIdentifier that limits the number of variant subtags is because it's desirable in Rust to have a type that is copyable (via memcpy) without the possibility that it holds a reference to memory somewhere else. This is why I was proposing LSRV. It's not a full Unicode Language Identifier, but it would get the job done for 99.9% of cases, including 100% of cases that come from CLDR.

@sffc
Copy link
Member Author

sffc commented Nov 20, 2020

2020-11-20: Put LSRV in the icu_provider crate for now.

For the sorting (or not sorting) of variants, @zbraniecki pointed out that it comes down to BCP47 not distinguishing between multiple variants and multipart variants. I suggested that we can solve this problem with data, in #218.

@sffc sffc assigned sffc and unassigned zbraniecki Nov 20, 2020
@sffc sffc added A-performance Area: Performance (CPU, Memory) and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 20, 2020
@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters and removed C-locale Component: Locale identifiers, BCP47 labels Dec 4, 2020
@sffc
Copy link
Member Author

sffc commented Dec 9, 2020

This issue can be done alongside #244. Waiting on #196 first.

@sffc sffc added the blocked A dependency must be resolved before this is actionable label Dec 9, 2020
@sffc sffc removed the blocked A dependency must be resolved before this is actionable label Dec 23, 2020
@sffc
Copy link
Member Author

sffc commented Dec 23, 2020

I decided to do this a little differently. I'm going to model the new ResourceOptions as a non-copyable bag of optional fields. In this model, we don't need LanguageIdentifier to be copyable.

@sffc
Copy link
Member Author

sffc commented Nov 19, 2021

I'm reopening this issue because now we have a concrete reason to try to make ResourcePath Sized and Copy. It is used as a key in the ZeroMap, and Sized+Copy keys are measurably faster than unsized ones like str.

@sffc sffc reopened this Nov 19, 2021
@sffc sffc added the S-medium Size: Less than a week (larger bug fix or enhancement) label Nov 19, 2021
@sffc sffc changed the title New idea for LSRV Make ResourcePath compatible with ZeroVec via LSRV type Nov 19, 2021
@sffc
Copy link
Member Author

sffc commented Nov 20, 2021

On this subject, it does appear that a small but nontrivial fraction of the ICU4X data loading infrastructure is spent in the "resource_path_to_string" function. Ideally we can avoid the allocation of a string when fetching data from the data provider.

@sffc
Copy link
Member Author

sffc commented Dec 21, 2021

I think we should keep ResourcePath variable-width and ResourceKey fixed-width. However, we should:

  1. Allow ResourceKey to be represented in 4 bytes (Revisit resource_key identifier being TinyStr16 #1148)
  2. Make ResourcePath implement VarULE, which will help us remove the expensive resource_path_to_string function.

@sffc sffc changed the title Make ResourcePath compatible with ZeroVec via LSRV type Make ResourcePath compatible with [Var]ZeroVec Dec 21, 2021
@sffc sffc added S-epic Size: Major project (create smaller child issues) and removed S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 1, 2022
@sffc sffc removed the v1 label Apr 1, 2022
@sffc sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Features) May 25, 2022
@sffc sffc added the blocked A dependency must be resolved before this is actionable label Jun 24, 2022
@sffc
Copy link
Member Author

sffc commented Jun 24, 2022

Blocked by #2032

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) blocked A dependency must be resolved before this is actionable C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required
Projects
None yet
5 participants