Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement and re-implement auxiliary keys #3632

Closed
sffc opened this issue Jul 5, 2023 · 27 comments · Fixed by #3872
Closed

Implement and re-implement auxiliary keys #3632

sffc opened this issue Jul 5, 2023 · 27 comments · Fixed by #3872
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Jul 5, 2023

The model for auxiliary keys is discussed in #1441. We need it for at least two components (currency and display names, #3260).

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Jul 5, 2023
@sffc sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 5, 2023
@sffc sffc self-assigned this Jul 5, 2023
@sffc
Copy link
Member Author

sffc commented Aug 17, 2023

What separator should we use?

  • @younies - Slash makes sense because "/length/meter" is nice
  • @robertbastian - Is there a performance difference?
  • @sffc - Probably not
  • @Manishearth - Maybe we should allow slash inside the aux key. But then it could make sense for the main separator not be slash.
  • @sffc - If we allow slashes inside the aux key then probably the main separator should be / so that directory structures make more sense
  • @robertbastian - This assumes FsDataProvider will use to_string as is, it could just to_string the locale and aux key separately
  • @younies - If all are / it looks like a URL which is good
  • @Manishearth - FsDataProvider doesn't need to have a 1:1 mapping, but I don't necesarilly think it needs to be clean; it doesn't need to be "path is the path" mapping. It's nice with sorting if the aux keys clump together.
  • @robertbastian - We've been talking about single aux keys. We could encode layered aux keys on the API. So then length/meter is exposed as a multipart aux key. And then we should use the same separator for everything.
  • @sffc - Let me explain the sorting property. en$USD < en-GB and also en < en-GB so we can compare the locid portion and that implies the sort order for the whole string, and we only need to look at the aux key if the locids are exactly equal. But en/USD > en-GB so we can no longer separate those two steps in the comparison operation.
  • @sffc - Is there anyone here who prefers a character other than / if the sorting code weren't an issue?
  • @Manishearth - I do like having the clean sorting property.
  • @robertbastian - Understand that you want sorting consistency, don't actually think that's the case. We just have to generate our data in a way that sorts the same. E.g. in baked provider we can just define our own sort that doesn't need to match string sorting
  • @sffc - I def think that the sort order where string is same as semantic is v important. Def cases where you want to serialize as strings and do easy binary search. On a case sensitive fs it's also useful for the sort order to be the same so the list of files is presorted. E.g. a url data provider may want the sort order to be the same. I think sort order is more important than slashes looking pretty. Not something I think is a hard blocker but giving precedence.
  • @Manishearth - We need data loading infrastructure to be fast, so I would prefer for search and other things to be straightforward. A string binary search is a good way of doing that.
  • @robertbastian - I don't understand the arguments about filesystem or URLs; binary search isn't used there. We use it in bake provider and blob provider. We have upgraded types in those situations so the sort order of the strings doesn't have to match sort order on stringified types.
  • @sffc - The fact that these are locales is not a property that needs to be tracked through the stack. We turn the locales into strings, and it is a useful property that the strings and locales sort in the same order. We shouldn't break that property if we don't need to for a good reason.
  • @robertbastian - The strings in Blob and Baked provider are implementation details to allow fast lookup for DataLocales. We don't need the property that they sort the same as strings
  • @Manishearth - My main constraint is that I want the sort property maintained: DataLocale should sort such that the auxiliary keys get grouped together. Whether or not their string representation also sorts that way is a nice-to-have. We might at some point have locales with and without auxiliary in the same data key, and locales should not mix.

Competing priorities:

a) Locale sort should be the same as string sort (@sffc)
b) Locales with auxiliary keys should be sorted adjacent to the corresponding locale without an auxiliary key (@Manishearth)
c) Slashes are the character that makes most logical sense (@younies)
d) For multipart auxiliary keys, the separators should be the same (@robertbastian)
e) The solution should have maximal performance

Options on the table:

  1. en#length#meter / en+length+meter
    • Satisfies a, b, d, e. Not c
  2. en#length/meter
    • Satisfies a, b, (c), e. Not d
  3. en/length/meter with / sorting before -
    • Satisfies b, c, d, e. Not a
  4. en/length/meter with / sorting after -
    • Satisfies a, c, d, e. Not b
  • Seems option 1 is the best overall. Now we have to pick the separator.
  • @Someone: # has more special meaning in programming langauges and URLs
  • + liked by: @Manishearth @skius @robertbastian
  • # liked by: (@sffc)
  • Decision: en+length+meter

LGTM: @Manishearth @younies @sffc @robertbastian @skius

@sffc
Copy link
Member Author

sffc commented Aug 18, 2023

For future discussion (before 1.3):

It occurred to me that another repr could be ShortVec<TinyStr8>, the same as the repr for variants and extension keywords. However, this would limit const construction to a single 8-byte aux key. On the upside, we could weasel out of all our sorting issues by sticking the aux key in -x- in the DataLocale. There was a much older design that used this model.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 18, 2023
@sffc
Copy link
Member Author

sffc commented Aug 18, 2023

Reopening to finalize the discussion above

@sffc sffc reopened this Aug 18, 2023
@sffc
Copy link
Member Author

sffc commented Aug 22, 2023

Follow-up for datagen filtering: #3907

@sffc
Copy link
Member Author

sffc commented Aug 24, 2023

  • @robertbastian - I was thinking, we could leave it as a string for now, and later we could figure out how it would work, but we might not be able to change that in 2.0?
  • @skius - The 8-character limit would require a different transliterator ID design, because right now it's a BCP-47 t id.
  • @robertbastian - Each subtag is at most 8 characters.
  • @sffc - en-US+foo-bar-baz would be permited, but not en-US+foobarbaz
  • @robertbastian - What other components do we want to use this for: currency, translit, display names
  • @zbraniecki - why aren't we using unicode -t- stuff?
  • @skius - because it's not in DataLocale
  • @sffc - We are using it, just in the aux key, like und+en-t-zh
  • @Manishearth - The transliterator IDs already fit in shortvecs like this, so it doesn't harm the use case for transliterator
  • @sffc - If we added in the subtag restriction, we could serialize the DataLocale as en-US-x-foo-bar-baz
  • @robertbastian - If we did that, can we use the private use extensions inner type? It's an important implementation detail because I would like DataLocale to not diverge too much from Locale
  • @sffc Any opposition to using -x for the subtags
  • @robertbastian - This has the string sorting issue, yes? If the region starts with Y or Z, yes?
  • @sffc - It should work fine; comparison works in terms of subtag comparison.
  • @younies we need more than one subtag, since for the units you can say en-US-mu-...
  • @sffc - DataLocale supports Unicode extension keywords already.
  • @Manishearth - Right now the things that we're talking about are display names (region/variant), transforms, and currencies.
  • @robertbastian - Yes, please try this. Whether we use -x- or + is a bikeshed. Please try whichever is easier.
  • @Manishearth: +1 internal detail
  • @skius - Can we represent multiple auxiliary keys with -x-?
  • @sffc - We could have multiple -x- yes? und-x-foo-x-bar
  • @Manishearth - Or und-x-foo-bar-separato-baz-quux if multiple -x- isn't allowed
  • @zbraniecki - The syntax is: Subtag, 1..=8, s.is_ascii_alphanumeric(),
  • @Manishearth what if the user uses -x themselves
  • @robertbastian Nah, there's no easy way for that to end up in a DataLocale
  • @skius - How do we separate these?
    1. foo-separato-bar => und-x-foo-separato-bar vs
    2. foo, bar => und-x-foo-x-bar
  • @sffc - Is the new -x- thing more desirable than the existing solution?

Conclusion: implement the -x- thing, but defer mostly to the implementer's call.

LGTM: @sffc @Manishearth @skius

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 24, 2023
@robertbastian
Copy link
Member

Thought: we should doc-hide auxiliary keys in 1.3, everything that uses them is experimental and we should also consider them experimental

@sffc
Copy link
Member Author

sffc commented Oct 13, 2023

I have the -x- auxiliary key in #4144. It makes the subtag comparison code simpler and it also adds what I consider the nice property that we bring back 1-to-1 mapping between Locale and DataLocale.

Given this change, and given that no one has every been super happy with the AuxiliaryKeys name, I'd like to propose that we change it to AuxiliarySubtags.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Oct 13, 2023
@sffc
Copy link
Member Author

sffc commented Oct 19, 2023

Discussion:

  • Use "auxiliary subtags" as the name
  • Investigate and very likely implement the use of the Private subtags struct from icu_locid inside DataLocale
  • Investigate and very likely remove the public AuxiliaryKeys type from icu_provider

LGTM: @sffc @Manishearth @robertbastian

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Oct 19, 2023
sffc added a commit that referenced this issue Oct 26, 2023
We still have #3632 open to continue bikeshedding on API design.
@sffc sffc removed this from the 1.4 Blocking ⟨P1⟩ milestone Nov 14, 2023
@sffc sffc moved this from Unclaimed for sprint to Small breakage (defer to end) in icu4x 2.0 May 9, 2024
@robertbastian
Copy link
Member

The changes to auxiliary keys can be done now, as they're experimental, I could get started on those. Unless @sffc is working on it, as he's assigned?

@sffc sffc changed the title Implement auxiliary keys Implement and re-implement auxiliary keys May 16, 2024
@sffc sffc assigned robertbastian and sffc and unassigned sffc May 16, 2024
@sffc
Copy link
Member Author

sffc commented May 16, 2024

I did the original implementation of AuxiliaryKeys. You're welcome to take the implementation of DataKeyAttributes.

robertbastian added a commit that referenced this issue May 21, 2024
In preparation for #3632 

There are two cases:
* `DataLocales` that are passed into ICU4X constructors:
* These are constructed with `locale!("foo").into()`, or `"foo".parse()`
* When changing a constructor to preferences, the code doesn't need to
be changed if we add a `From<Locale>` impl for the preferences
* `DataLocales` that are put into the `DataRequest::locale` field.
* These are constructed with `langid!("foo").into()`, or `"foo".parse()`
* When changing the field to `&LanguageIdentifier`, these can be
clippy-cleaned-up, as the `.into()` will become redundant.

I've taken care to avoid having intermediate `Locale` or
`LanguageIdentifier` variables, everything that is macro-constructed is
immediately `into()`ed. This will simplify find-and-replace later.
@robertbastian
Copy link
Member

Actually due to features, this is quite messy. Currently this is all controlled by the icu_provider/experimental feature, as soon as anything moves out of DataLocale, all consumers will have to have a feature for key attributes. This includes baked data, so I don't think this is feasible before 2.0 after all.

@robertbastian
Copy link
Member

Would struct DataKeyAttributes<'a>(&'a str) cover all our requirements? I think all planned use cases workd with strs: datetime has fixed static strs, transliterator has to create a String anyway, currencies and locale display names will be able to borrow from TinyAsciiStr subtags, what else is there?

@sffc
Copy link
Member Author

sffc commented May 30, 2024

Display names might need to load "en-GB" which is two subtags and requires making a string to glue them into &str. Which might be fine

Datetime might have combinations for date and time like ym0d-jms

Hmmm, &dyn Writeable? (but we don't like trait objects)

Or how about just &[Subtag]

@robertbastian
Copy link
Member

I'd prefer using TinyAsciiStr<8> over Subtag, I want to break the association that data key attributes have anything to do with locales. However I don't like to be restricted to 8 characters, if we used &[&str] that can still be borrowed from locale!("en-GB").

@robertbastian
Copy link
Member

robertbastian commented May 31, 2024

&[&str] (and &[TinyAsciiStr]) is double the pointers for single-element attributes compared to &str in both postcard and baked data.

@sffc
Copy link
Member Author

sffc commented May 31, 2024

&[TinyAsciiStr] has only 1 pointer.

@robertbastian
Copy link
Member

I don't see what the repetition gains us? Having a slice costs 16 bytes.

@sffc
Copy link
Member Author

sffc commented May 31, 2024

&str and &[TinyAsciiStr] are the same size.

The "repetition" (tinystr slice) allows us to represent concepts such as en-GB and ym0d-jms without allocating anything.

I'm not 100% convinced which one is better. Just laying out the facts.

@robertbastian
Copy link
Member

Both en-GB and ym0d-jms are slightly shorter with &str though, because with TinyAsciiStr even a two-character tag takes 8 bytes (if that's the tag length we're going with).

@robertbastian
Copy link
Member

The data representation is a different discussion from the API representation though.

@sffc
Copy link
Member Author

sffc commented May 31, 2024

If we end up using something tinystr-based, it would be nice (and required for performance) to make it in the API, and we might want to remove the string constructors form the API.

robertbastian added a commit that referenced this issue May 31, 2024
#3632

---------

Co-authored-by: Shane F. Carr <shane@unicode.org>
robertbastian added a commit that referenced this issue Jun 14, 2024
#3632 

* baked: use `(&str, &str)` if needed, otherwise `&str` as before
* blob: use U+001E ("record separator") as the separator
* fs: use additional directory level
@robertbastian
Copy link
Member

Conclusion:

pub struct DataMarkerAttributes {
    // Validated to be non-empty ASCII alphanumeric + hyphen
    value: str,
}
  • This is cheapest to compare (unlike SmallVec or other schemes), and most flexible
  • DataRequest uses borrowed types anyway, in iter_requests we return Box<DataMarkerAttributes>

LGTM: @sffc @robertbastian

@sffc sffc closed this as completed Sep 17, 2024
@github-project-automation github-project-automation bot moved this from Small breakage (defer to end) to Done in icu4x 2.0 Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants