-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LanguageIdentifier and/or Locale compatible with [Var]ZeroVec #831
Comments
2021-07-29:
Conclusion: Keep this issue open and fix it when it is needed. |
I have a proposal for how to do this. I will use The VarULE type for Locale can be All ICU4X APIs that take an immutable Locale should take a An optional extension, for which I would advocate strongly, is that
To be clear, the whole premise of this post is that we want to store Locale and LanguageIdentifier in a VarZeroVec, and in order to do this, we need a cheap way to get the VarULE. Currently, we stringify the Locale or LanguageIdentifier whenever we need to use it as a VZV key, which is far from ideal. If we can have "pre-computed" strings float along with Locale or LanguageIdentifier objects, we can be a lot* more efficient. * Based on preliminary observational data that the stringification of the resource path is the most expensive part of the data loading machinery outside of Serde. Needs approval from: |
I think this is a good plan. Perhaps |
I don't like this direction. It seems like a very significant increase in complexity of the very fundamental type of the system. Every Locale will contain stringified version of itself at all times enabling potential inconsistency between fields and strings and requiring updating both at all times. My read is that the direction you are putting us on is that esoteric internals of data provider machinery are reshaping fundamental types of ICU4x in ways that makes them complicated. That complication is only justified when the type is considered for data provider use. I would like to make sure that Locale or Language identifier are optimal canonical types akin to String, Vec, Duration etc I understand that there is a tension between the two and that we are seeking the balance but the proposal feels analogous to me as if stdlib proposed that String contains additional fields that are only necessary for when used in a std::collections which is sufficiently core to stdlib that may justify increased complexity and additional cost for other uses. I'm wondering if we should accept that data provider is not a good candidate to use such primitives and we don't use String, Vec in it for a reason and we shouldn't use Locale in it. Instead, Data Provider should use its own type,.optimized for its own use, and aim at cheap conversion to "public" type. Or, maybe we should do the reverse, follow what you suggested, and introduce a new type, meant for public use that is akin to the current Locale? |
Actually, my model was that locale would primarily be a wrapper around a string, and the other fields can exist as optimizations but need not be so. As in, we should be looking at locales as strings first, potentially with parsed offsets stored in. Most use of the locale is just being passed around and tested for equality, which can happen via the string. |
Ah, I think this is where we disagree. I see Locale as a data structure that can be stringified, you see it as a string with some guarantees. I think there's a tradeoff to both approaches, I'm concerned that you're focused on your immediate use case and in result biased in evaluating the optimal solution.
I don't think this is accurate for all use cases. Efficient storing, cheap canonicalization and low memory cost of allocating is another. TinyStr is 10k+ times more efficient on many of those operations than Strings are. |
Eh, I wouldn't really consider this "my" use case, it's not something I've been thinking about that much. Overall I am okay with the approach proposed here but I'm not championing it. In that case I suspect having a separate LocaleStr type that is used in data (and can be easily converted) is probably the way to go. This does still mean LocaleCanonicalizer would be using string locales in its data. |
I did a mental shortcut here, I'm sorry. I meant "use case you are considering".
I think the canonicalizer may want to handle both with separate pathways. |
Thanks for the replies. I'll reply point-by-point:
It's a new invariant that we need to enforce, yes, but I wouldn't call it a "significant increase in complexity". The simple implementation to enforce the new invariant (consistency between the stack fields and the LocaleStr pointer) is to simply re-generate the string whenever a mutation operation happens, and throw out the old string. It's basically just pre-computing and caching the stringification. I would not advocate for anything more complicated than that unless there is a clear need.
A few things to unpack here. First, what I'm trying to do here is to solve one of the oldest problems in ICU4X: The problem of how to efficiently represent and marshal locales, such that they support high efficiency and i18n correctness both inside and outside of the data provider. I have never been completely happy with the shape of Locale in ICU4X, and as a result, I've opened at least a half dozen issues pointing out limitations of the Locale data model in the last two years, the first being #52, one of the first 100 issues of the project. I feel more confident now than I have at any point in the course of the last two years that what I'm proposing solves our problems. Second, the data provider is a primary consumer of locales. Within the ICU4X components, locales serve two primary purposes: (1) to select the best data file based on vertical fallback, and (2) to carry user preferences. I therefore feel that the needs of data provider should, in fact, carry a fairly high level of importance when talking about the shape of the locale classes. Third, more broadly, I would like to stop looking at zero-copy as being an "esoteric internal of data provider machinery". For better or worse, over the course of the last year, zero-copy has become the lingua franca for basically everything data-related in ICU4X. We should embrace that, rather than treating it as an annoyance.
A few more things to unpack. First, I had basically this exact concern about Second, most Rust stdlib types do work well with Rust stdlib collections, because they were designed that way. If Locale is built for ICU4X, then it should work well with ICU4X's collections. Third, many Rust stdlib types have both an Owned version and a zero-copy version. String derefs to
Sure. Let's look at alternatives. I see two general models for how to represent a locale in a [Var]ZeroVec:
Using LSRV has a few problems:
Using The problem it introduces is that the data model of LocaleStr is fundamentally different from Locale (one is a tuple of TinyStr, and the other is a string slice). We need to convert from one model to the other when performing lookup and comparison, for example. We need to be able to check whether a
Yes. This is another approach. However, I would point out that all formatter constructors (like DateTimeFormat::try_new) need the data provider type, because they immediately pass the locale down into the data provider. This is fine, but it means that Locale is essentially downgraded to a "builder" that is not useful on its own when interacting with ICU4X. But maybe that's what we want! |
I believe that is a position that i disagree with, as stated above in response to Manish. Data provider is a primary consumer within icu4x while Locale is a struct useful also outside and independent of it. I'm afraid that this position is fundamentally different from how you and Manish evaluate it, and hence your optimal design proposition is different from mine. |
To be clear: The line you quoted was not intended as a statement of a position; it was intended as a statement of reality. My position is: "the reality of how Locale is being used in ICU4X should influence the design of Locale". |
icu4x::Locale founding code - I presented here alternative use cases with alternative considerations to the ones that you are considering. |
Acknowledged.
My full statement was:
I claim that is an undisputable fact, aka "reality". Note that I say "a primary consumer" (not "the"), and then explain that I mean this to be withinn the ICU4X components. I'd like if we can agree that data provider is, indeed, a primary consumer of the locale class in ICU4X components, so that we can debate the subjective statement that "the reality of how Locale is being used in ICU4X should influence the design of Locale". |
@zbraniecki as I've said before, this is not my position, overall I am rather ambivalent on the final choice being made here. My personal values here are:
Some things which I think are nice to have, but are likely to be in conflict:
Overall I don't care much about the performance of variants: If there is some parsing cost incurred by them that is fine by me, but I care more about parsing costs imposed on language/country/script. Worth clarifying separately: do others feel this way? Personally, I see a wide range of solutions here that satisfy my needs and touch on the requirements above. I'll list them below. For the purposes of the entries below, I shall use A. "LocaleStr everywhere"This is a class of solution where LocaleStr is used everywhere. The rough model here is "locales are primarily strings in interchange, but they may have other things for convenience" A1. Vanilla LocaleStr everywhereThis is Shane's original proposal, which essentially sets Pros:
Cons:
A2. LocaleStr everywhere, some preparsing in LocaleThis is Shane's proposed variant, where Pros:
Cons:
A3. LocaleStr everywhere, but LocaleStr itself is 🧐 fancierThis is a solution where struct LocaleStr {
language: Range<usize>,
script: Range<usize>, // null if empty (same for below)
region: Range<usize>,
variants: Range<usize>,
n_variants: usize, // maybe?
data: str
} where the different elements index into Locale can then just be When deserialized, this type will check all internal invariants, after which we don't need to worry about them Pros:
Cons:
B. Two separate typesHere, To make B1. Very basic LocaleStrHere, Pros:
Cons:
B2. 🧐 Fancier LocaleStr
Pros:
Cons:
B3 🧐 🧐 Very fancy LocaleStr
struct LocaleStr {
lang: TinyAsciiStr<4>,
script: TinyAsciiStr<4>,
region: TinyAsciiStr<4>,
n_variants: usize, // maybe?
variants: str,
}
Pros:
Cons:
As I said, overall I don't have strong opinions on what to pick here, and I've probably not listed all of the options I would be okay with, but here is what I see the design space as being. |
I agree.
I have quite complicated position on this point. I think that in almost all performance critical runtime code this should not be the case, but it's a strong position weakly held.
"performance of variants" is a vague term here. If Locale contains variants, and variants increase memory cost of an instance, is it performance we care about? I think it is. Same for common operations. I think the problem is that we identify "common operations" to be very different between Data Provider and a runtime use of locale logic with no data provider scenario.
Thank you for documenting those options. I think my current mental model leads me to B1, but I wouldn't be opposed to other B's. By that I mean that Data Provider needs fast/cheap serialize/deserialize, but completely doesn't care about maximize/minimize/canonicalize or any advanced matches on subtags (that may change with introduction of language matcher heuristics tho). This furthers me into position that the solution is to separate those types for their respective use cases. |
@zbraniecki ah, to clarify "performance of variants" I meant perf of getters, not memory: basically, are we okay with solutions where getters have to just in time parse variants if needed |
Intuitively - I agree |
Thanks Manish for the summary. Lots of topics to unpack. I will put them under headings. Ultra-Fast ULE-to-StackIf we can hyper-optimize The other direction, going from Locale to LocaleStr, is a less common operation and can be implemented via Data Provider Needs Should Dictate ICU4X Component APIsLet me explicitly state an implied assumption from some of my previous proposals. Within the ICU4X components, locales shall be passed around in the form that DataProvider wants when the function is using the locale to load data. This means that What the above paragraph means is that if we choose a string-like approach in data provider, then my position is that the ICU4X components should also take a string-like argument. This is part of why I was pushing for A2, because then the functions can take Said another way, although I am fine with Obviously, LocaleCanonicalizer and similar classes that use locales for something other than data loading should take the locale in the form that they need. Perhaps this is the source of some of my strong feelings in this thread. I feel like I am being told that the Locale that was designed for LocaleCanonicalizer is the Locale that we should be using by default through ICU4X. But I feel that the Locale we use by default in ICU4X components should be the Locale optimized for DataProvider. Maximized LocalesI should also note that this discussion hinges on the resolution to #1462. It sounds like that thread is currently leaning toward "data provider operates on maximized locales". Therefore, the final argument type that data provider needs may actually end up being something more along the lines of However, this does not decouple us from the problem of a stack type versus a stringy type. VariantsI definitely agree that I don't care too much about "get" operations on variants and extensions. I care much more about performance of other operations in the locale lifecycle. |
Class C: Hybrid LSR + Variants approachI haven't fully thought through the implications, but should we explore something like // assumes that we use TinyStrNeo such that TinyStr is ULE
struct Locale {
language: TinyStr,
script: TinyStr,
region: TinyStr,
variants_and_extensions: Box<VariantsAndExtensionsStr>
}
struct LocaleStr {
language: TinyStr,
script: TinyStr,
region: TinyStr,
variants_and_extensions: VariantsAndExtensionsStr
} |
That was not my intent. I believe we need a type for Data Provider that is optimized for use within it. It seems to me that this is not the same type that should be used as a foundational type for Rust external use.
My mental model leads me toward |
Given #1589, here's what needs to be done on this issue.
These changes will unblock #243. |
We're currently looking at B1/B2/B3. We need a LocaleStr representation that is cheap to compare with Locale (and which is compact in memory). |
Things we have decided on:
Things we have yet to decide on:
We have the B1/B2/B3 options, and to quickly restate them:
Cost for comparisons decreases going down this list, but cost for converting to string goes up. Overall since we do not care too much about string conversions (do we?) I feel like solutions like B3 are more acceptable. However, this is what B3 looks like right now (call this B3α) struct LocaleStr {
lang: TinyAsciiStr<4>,
script: TinyAsciiStr<4>,
region: TinyAsciiStr<4>,
n_variants: usize, // maybe?
variants: str,
} This has two problems:
A slightly more efficient approach (B3β) for An alternate approach (B3γ) might be I'm leaning towards B3β for a first pass. |
I lean toward B1 as a first pass, because a function to perform a fast comparison between Locale and its corresponding BCP-47 string seems like a generally useful feature, and if we have that function, B1 is basically free: we don't even need to make Between the different versions of B3, I lean toward B3γ, which can be implemented after #1443 is done. However, if B1 ends up being sufficiently efficient, maybe we won't even need B3. |
Discussion:
(discussion incomplete) |
Since no one else is taking this issue, I will take it, since I need it. |
The solution I implemented in #1713 is sufficient for lookup of locales in a ZeroVec. Simply store a My solution does not solve loading locales from a ZeroVec. However, the following things can be done:
Note that since I am therefore going to close this issue, because the immediate problem of lookup is solved, and we have several concrete options for the problem of loading. |
We are going to have many situations where we want to make a zerovec of Locale or LanguageIdentifier objects.
The general solution is to serialize Locale or LanguageIdentifer to a string, and then use VarZeroVec as the data store. Alternatively, we could introduce the LSRV struct proposed and then abandoned in #243.
The text was updated successfully, but these errors were encountered: