Implement and re-implement auxiliary keys #3632

sffc · 2023-07-05T08:40:17Z

The model for auxiliary keys is discussed in #1441. We need it for at least two components (currency and display names, #3260).

sffc · 2023-08-17T18:43:57Z

What separator should we use?

@younies - Slash makes sense because "/length/meter" is nice
@robertbastian - Is there a performance difference?
@sffc - Probably not
@Manishearth - Maybe we should allow slash inside the aux key. But then it could make sense for the main separator not be slash.
@sffc - If we allow slashes inside the aux key then probably the main separator should be / so that directory structures make more sense
@robertbastian - This assumes FsDataProvider will use to_string as is, it could just to_string the locale and aux key separately
@younies - If all are / it looks like a URL which is good
@Manishearth - FsDataProvider doesn't need to have a 1:1 mapping, but I don't necesarilly think it needs to be clean; it doesn't need to be "path is the path" mapping. It's nice with sorting if the aux keys clump together.
@robertbastian - We've been talking about single aux keys. We could encode layered aux keys on the API. So then length/meter is exposed as a multipart aux key. And then we should use the same separator for everything.
@sffc - Let me explain the sorting property. en$USD < en-GB and also en < en-GB so we can compare the locid portion and that implies the sort order for the whole string, and we only need to look at the aux key if the locids are exactly equal. But en/USD > en-GB so we can no longer separate those two steps in the comparison operation.
@sffc - Is there anyone here who prefers a character other than / if the sorting code weren't an issue?
@Manishearth - I do like having the clean sorting property.
@robertbastian - Understand that you want sorting consistency, don't actually think that's the case. We just have to generate our data in a way that sorts the same. E.g. in baked provider we can just define our own sort that doesn't need to match string sorting
@sffc - I def think that the sort order where string is same as semantic is v important. Def cases where you want to serialize as strings and do easy binary search. On a case sensitive fs it's also useful for the sort order to be the same so the list of files is presorted. E.g. a url data provider may want the sort order to be the same. I think sort order is more important than slashes looking pretty. Not something I think is a hard blocker but giving precedence.
@Manishearth - We need data loading infrastructure to be fast, so I would prefer for search and other things to be straightforward. A string binary search is a good way of doing that.
@robertbastian - I don't understand the arguments about filesystem or URLs; binary search isn't used there. We use it in bake provider and blob provider. We have upgraded types in those situations so the sort order of the strings doesn't have to match sort order on stringified types.
@sffc - The fact that these are locales is not a property that needs to be tracked through the stack. We turn the locales into strings, and it is a useful property that the strings and locales sort in the same order. We shouldn't break that property if we don't need to for a good reason.
@robertbastian - The strings in Blob and Baked provider are implementation details to allow fast lookup for DataLocales. We don't need the property that they sort the same as strings
@Manishearth - My main constraint is that I want the sort property maintained: DataLocale should sort such that the auxiliary keys get grouped together. Whether or not their string representation also sorts that way is a nice-to-have. We might at some point have locales with and without auxiliary in the same data key, and locales should not mix.

Competing priorities:

a) Locale sort should be the same as string sort (@sffc)
b) Locales with auxiliary keys should be sorted adjacent to the corresponding locale without an auxiliary key (@Manishearth)
c) Slashes are the character that makes most logical sense (@younies)
d) For multipart auxiliary keys, the separators should be the same (@robertbastian)
e) The solution should have maximal performance

@robertbastian - I'd prioritise them b > d > c > a
@sffc - I'd prioritize them a > b > d > c
@Manishearth b > a > c > d

Options on the table:

en#length#meter / en+length+meter
- Satisfies a, b, d, e. Not c
en#length/meter
- Satisfies a, b, (c), e. Not d
en/length/meter with / sorting before -
- Satisfies b, c, d, e. Not a
en/length/meter with / sorting after -
- Satisfies a, c, d, e. Not b

Seems option 1 is the best overall. Now we have to pick the separator.
@Someone: # has more special meaning in programming langauges and URLs
+ liked by: @Manishearth @skius @robertbastian
# liked by: (@sffc)
Decision: en+length+meter

LGTM: @Manishearth @younies @sffc @robertbastian @skius

sffc · 2023-08-18T00:27:36Z

For future discussion (before 1.3):

It occurred to me that another repr could be ShortVec<TinyStr8>, the same as the repr for variants and extension keywords. However, this would limit const construction to a single 8-byte aux key. On the upside, we could weasel out of all our sorting issues by sticking the aux key in -x- in the DataLocale. There was a much older design that used this model.

sffc · 2023-08-18T22:48:23Z

Reopening to finalize the discussion above

sffc · 2023-08-22T15:51:37Z

Follow-up for datagen filtering: #3907

sffc · 2023-08-24T18:17:19Z

@robertbastian - I was thinking, we could leave it as a string for now, and later we could figure out how it would work, but we might not be able to change that in 2.0?
@skius - The 8-character limit would require a different transliterator ID design, because right now it's a BCP-47 t id.
@robertbastian - Each subtag is at most 8 characters.
@sffc - en-US+foo-bar-baz would be permited, but not en-US+foobarbaz
@robertbastian - What other components do we want to use this for: currency, translit, display names
@zbraniecki - why aren't we using unicode -t- stuff?
@skius - because it's not in DataLocale
@sffc - We are using it, just in the aux key, like und+en-t-zh
@Manishearth - The transliterator IDs already fit in shortvecs like this, so it doesn't harm the use case for transliterator
@sffc - If we added in the subtag restriction, we could serialize the DataLocale as en-US-x-foo-bar-baz
@robertbastian - If we did that, can we use the private use extensions inner type? It's an important implementation detail because I would like DataLocale to not diverge too much from Locale
@sffc Any opposition to using -x for the subtags
@robertbastian - This has the string sorting issue, yes? If the region starts with Y or Z, yes?
@sffc - It should work fine; comparison works in terms of subtag comparison.
@younies we need more than one subtag, since for the units you can say en-US-mu-...
@sffc - DataLocale supports Unicode extension keywords already.
@Manishearth - Right now the things that we're talking about are display names (region/variant), transforms, and currencies.
@robertbastian - Yes, please try this. Whether we use -x- or + is a bikeshed. Please try whichever is easier.
@Manishearth: +1 internal detail
@skius - Can we represent multiple auxiliary keys with -x-?
@sffc - We could have multiple -x- yes? und-x-foo-x-bar
@Manishearth - Or und-x-foo-bar-separato-baz-quux if multiple -x- isn't allowed
@zbraniecki - The syntax is: Subtag, 1..=8, s.is_ascii_alphanumeric(),
@Manishearth what if the user uses -x themselves
@robertbastian Nah, there's no easy way for that to end up in a DataLocale
@skius - How do we separate these?
1. foo-separato-bar => und-x-foo-separato-bar vs
2. foo, bar => und-x-foo-x-bar
@sffc - Is the new -x- thing more desirable than the existing solution?

Conclusion: implement the -x- thing, but defer mostly to the implementer's call.

LGTM: @sffc @Manishearth @skius

robertbastian · 2023-09-05T19:12:54Z

Thought: we should doc-hide auxiliary keys in 1.3, everything that uses them is experimental and we should also consider them experimental

sffc · 2023-10-13T00:04:37Z

I have the -x- auxiliary key in #4144. It makes the subtag comparison code simpler and it also adds what I consider the nice property that we bring back 1-to-1 mapping between Locale and DataLocale.

Given this change, and given that no one has every been super happy with the AuxiliaryKeys name, I'd like to propose that we change it to AuxiliarySubtags.

sffc · 2023-10-19T18:07:06Z

Discussion:

Use "auxiliary subtags" as the name
Investigate and very likely implement the use of the Private subtags struct from icu_locid inside DataLocale
Investigate and very likely remove the public AuxiliaryKeys type from icu_provider

LGTM: @sffc @Manishearth @robertbastian

We still have #3632 open to continue bikeshedding on API design.

robertbastian · 2024-05-16T12:27:30Z

The changes to auxiliary keys can be done now, as they're experimental, I could get started on those. Unless @sffc is working on it, as he's assigned?

sffc · 2024-05-16T15:13:35Z

I did the original implementation of AuxiliaryKeys. You're welcome to take the implementation of DataKeyAttributes.

In preparation for #3632 There are two cases: * `DataLocales` that are passed into ICU4X constructors: * These are constructed with `locale!("foo").into()`, or `"foo".parse()` * When changing a constructor to preferences, the code doesn't need to be changed if we add a `From<Locale>` impl for the preferences * `DataLocales` that are put into the `DataRequest::locale` field. * These are constructed with `langid!("foo").into()`, or `"foo".parse()` * When changing the field to `&LanguageIdentifier`, these can be clippy-cleaned-up, as the `.into()` will become redundant. I've taken care to avoid having intermediate `Locale` or `LanguageIdentifier` variables, everything that is macro-constructed is immediately `into()`ed. This will simplify find-and-replace later.

robertbastian · 2024-05-21T10:43:47Z

Actually due to features, this is quite messy. Currently this is all controlled by the icu_provider/experimental feature, as soon as anything moves out of DataLocale, all consumers will have to have a feature for key attributes. This includes baked data, so I don't think this is feasible before 2.0 after all.

robertbastian · 2024-05-30T18:53:42Z

Would struct DataKeyAttributes<'a>(&'a str) cover all our requirements? I think all planned use cases workd with strs: datetime has fixed static strs, transliterator has to create a String anyway, currencies and locale display names will be able to borrow from TinyAsciiStr subtags, what else is there?

sffc · 2024-05-30T20:48:14Z

Display names might need to load "en-GB" which is two subtags and requires making a string to glue them into &str. Which might be fine

Datetime might have combinations for date and time like ym0d-jms

Hmmm, &dyn Writeable? (but we don't like trait objects)

Or how about just &[Subtag]

robertbastian · 2024-05-31T06:46:37Z

I'd prefer using TinyAsciiStr<8> over Subtag, I want to break the association that data key attributes have anything to do with locales. However I don't like to be restricted to 8 characters, if we used &[&str] that can still be borrowed from locale!("en-GB").

robertbastian · 2024-05-31T08:17:10Z

&[&str] (and &[TinyAsciiStr]) is double the pointers for single-element attributes compared to &str in both postcard and baked data.

sffc · 2024-05-31T18:10:27Z

&[TinyAsciiStr] has only 1 pointer.

robertbastian · 2024-05-31T18:20:25Z

I don't see what the repetition gains us? Having a slice costs 16 bytes.

sffc · 2024-05-31T18:23:21Z

&str and &[TinyAsciiStr] are the same size.

The "repetition" (tinystr slice) allows us to represent concepts such as en-GB and ym0d-jms without allocating anything.

I'm not 100% convinced which one is better. Just laying out the facts.

robertbastian · 2024-05-31T18:28:49Z

Both en-GB and ym0d-jms are slightly shorter with &str though, because with TinyAsciiStr even a two-character tag takes 8 bytes (if that's the tag length we're going with).

robertbastian · 2024-05-31T18:29:46Z

The data representation is a different discussion from the API representation though.

sffc · 2024-05-31T20:17:29Z

If we end up using something tinystr-based, it would be nice (and required for performance) to make it in the API, and we might want to remove the string constructors form the API.

#3632 --------- Co-authored-by: Shane F. Carr <shane@unicode.org>

#3632 * baked: use `(&str, &str)` if needed, otherwise `&str` as before * blob: use U+001E ("record separator") as the separator * fs: use additional directory level

robertbastian · 2024-06-20T13:44:05Z

Conclusion:

pub struct DataMarkerAttributes {
    // Validated to be non-empty ASCII alphanumeric + hyphen
    value: str,
}

This is cheapest to compare (unlike SmallVec or other schemes), and most flexible
DataRequest uses borrowed types anyway, in iter_requests we return Box<DataMarkerAttributes>

LGTM: @sffc @robertbastian

#3632

sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023

sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023

sffc self-assigned this Jul 5, 2023

robertbastian modified the milestones: 1.4 Blocking ⟨P1⟩, 1.3 Blocking ⟨P1⟩ Jul 25, 2023

Manishearth mentioned this issue Aug 15, 2023

Split DateSymbols data #3865

Closed

sffc mentioned this issue Aug 15, 2023

Add AuxiliaryKey to DataLocale #3872

Merged

sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 18, 2023

sffc closed this as completed in #3872 Aug 18, 2023

sffc reopened this Aug 18, 2023

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 24, 2023

sffc mentioned this issue Sep 21, 2023

Make auxiliary key code experimental #4061

Merged

sffc modified the milestones: 1.3 Blocking ⟨P1⟩, 1.4 Blocking ⟨P1⟩ Sep 21, 2023

sffc mentioned this issue Oct 12, 2023

Change aux key to -x- #4144

Merged

sffc added the discuss-priority Discuss at the next ICU4X meeting label Oct 13, 2023

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Oct 19, 2023

sffc mentioned this issue Oct 25, 2023

Add AuxiliaryKeys::from_subtag #4223

Merged

sffc added a commit that referenced this issue Oct 26, 2023

Add AuxiliaryKeys::from_subtag (#4223)

e826a2e

We still have #3632 open to continue bikeshedding on API design.

sffc removed this from the 1.4 Blocking ⟨P1⟩ milestone Nov 14, 2023

sffc moved this from Unclaimed for sprint to Small breakage (defer to end) in icu4x 2.0 May 9, 2024

robertbastian mentioned this issue May 16, 2024

Unifying DataLocale construction #4912

Merged

sffc changed the title ~~Implement auxiliary keys~~ Implement and re-implement auxiliary keys May 16, 2024

sffc assigned robertbastian and sffc and unassigned sffc May 16, 2024

robertbastian mentioned this issue May 31, 2024

Moving AuxiliaryKeys to DataRequest as DataKeyAttributes #4981

Merged

robertbastian added a commit that referenced this issue May 31, 2024

Moving AuxiliaryKeys to DataRequest as DataKeyAttributes (#4981)

0ab2630

#3632 --------- Co-authored-by: Shane F. Carr <shane@unicode.org>

robertbastian mentioned this issue Jun 1, 2024

Replace DataLocale by LanguageIdentifier on DataRequest #4984

Closed

robertbastian mentioned this issue Jun 11, 2024

Remove -x- encoding for marker attributes #5038

Merged

robertbastian added a commit that referenced this issue Jun 14, 2024

Remove -x- encoding for marker attributes (#5038)

63fa8da

#3632 * baked: use `(&str, &str)` if needed, otherwise `&str` as before * blob: use U+001E ("record separator") as the separator * fs: use additional directory level

robertbastian mentioned this issue Jul 31, 2024

Attribute filtering during export #5320

Merged

robertbastian added a commit that referenced this issue Aug 6, 2024

Attribute filtering during export (#5320)

35ef707

#3632

sffc closed this as completed Sep 17, 2024

github-project-automation bot moved this from Small breakage (defer to end) to Done in icu4x 2.0 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement and re-implement auxiliary keys #3632

Implement and re-implement auxiliary keys #3632

sffc commented Jul 5, 2023

sffc commented Aug 17, 2023

sffc commented Aug 18, 2023

sffc commented Aug 18, 2023

sffc commented Aug 22, 2023

sffc commented Aug 24, 2023

robertbastian commented Sep 5, 2023

sffc commented Oct 13, 2023

sffc commented Oct 19, 2023

robertbastian commented May 16, 2024

sffc commented May 16, 2024

robertbastian commented May 21, 2024

robertbastian commented May 30, 2024

sffc commented May 30, 2024

robertbastian commented May 31, 2024

robertbastian commented May 31, 2024 •

edited

Loading

sffc commented May 31, 2024

robertbastian commented May 31, 2024

sffc commented May 31, 2024 •

edited

Loading

robertbastian commented May 31, 2024

robertbastian commented May 31, 2024

sffc commented May 31, 2024

robertbastian commented Jun 20, 2024

Implement and re-implement auxiliary keys #3632

Implement and re-implement auxiliary keys #3632

Comments

sffc commented Jul 5, 2023

sffc commented Aug 17, 2023

sffc commented Aug 18, 2023

sffc commented Aug 18, 2023

sffc commented Aug 22, 2023

sffc commented Aug 24, 2023

robertbastian commented Sep 5, 2023

sffc commented Oct 13, 2023

sffc commented Oct 19, 2023

robertbastian commented May 16, 2024

sffc commented May 16, 2024

robertbastian commented May 21, 2024

robertbastian commented May 30, 2024

sffc commented May 30, 2024

robertbastian commented May 31, 2024

robertbastian commented May 31, 2024 • edited Loading

sffc commented May 31, 2024

robertbastian commented May 31, 2024

sffc commented May 31, 2024 • edited Loading

robertbastian commented May 31, 2024

robertbastian commented May 31, 2024

sffc commented May 31, 2024

robertbastian commented Jun 20, 2024

robertbastian commented May 31, 2024 •

edited

Loading

sffc commented May 31, 2024 •

edited

Loading