Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AuxiliaryKey to DataLocale #3872

Merged
merged 27 commits into from
Aug 18, 2023
Merged

Add AuxiliaryKey to DataLocale #3872

merged 27 commits into from
Aug 18, 2023

Conversation

sffc
Copy link
Member

@sffc sffc commented Aug 15, 2023

Fixes #3632

@sffc sffc changed the title Initial auxiliary key APIs Add AuxiliaryKey to DataLocale Aug 16, 2023
@sffc sffc marked this pull request as ready for review August 16, 2023 02:24
@sffc
Copy link
Member Author

sffc commented Aug 16, 2023

I think this is ready modulo the decision on the internal data type for the auxiliary key. Did I miss anything?

pub struct AuxiliaryKey {
// DISCUSS: SmallStr? TinyStrAuto?
// DISCUSS: Make this a dynamically sized type so references can be taken?
value: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Cow<str> to allow const-constructing with a data_locale! macro. Also resizing doesn't seem to be use case, so there's no need for separate length and capacity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Cow<str> uses a String in the owned variant.

How about something like

enum BoxOrStatic {
    Box(Box<str>),
    Static(&'static str),
    Tiny(TinyAsciiStr<23>),
}

The size of that is 24 bytes, which is the same as a String.

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=5d7c515c26f12e820c932959599bfb56

The main downside is that code which gets a &str from that thing is going to need to do more branching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it does, I somehow thought it'd be Box<str>. That enum looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manishearth Opinions on the enum before I spend time implementing it? I'm a little concerned about the branching and resulting code size but I haven't measured it yet.

@@ -134,6 +139,10 @@ impl Writeable for DataLocale {
sink.write_str("-u-")?;
self.keywords.write_to(sink)?;
}
if let Some(aux) = self.aux.as_ref() {
sink.write_char('$')?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ has special meaning in string literals in languages we want to target, e.g. Dart. Is there a problem with /?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably make / work but it doesn't automatically work in the sorting function strict_cmp (the hot path for data lookup). It's easier to implement $ because '$' < '-'. I could also use # which has the same property.

If we decide that / is what we really want, I'll make it work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the restriction is separator < -, there are !#$%&'*+,. My ranking of these is

  1. + or & are great because the auxiliary key adds something to the data locale
  2. # is a nice neutral separator
  3. !, * could be used as separators, but they're usually numeric operator
  4. % and $ have the escaping issue
  5. ' and , are too typographic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other options above - would be @, which we already use in data keys, and ;.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented + as decided in #3632 (comment)

Manishearth
Manishearth previously approved these changes Aug 16, 2023
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Same question as Rob about the separator

/// a
/// );
/// }
/// ```
pub fn strict_cmp(&self, other: &[u8]) -> Ordering {
// TODO: Make this work with aux keys
let subtags = other.split(|b| *b == b'-');
let mut pipe_iter = other.split(|b| *b == b'$');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's put the aux key separator in a constant somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by adding pub const fn AuxiliaryKey::separator()

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 16, 2023
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 17, 2023
@sffc
Copy link
Member Author

sffc commented Aug 17, 2023

@robertbastian PTAL at the new multi-aux-key APIs.

@@ -468,10 +468,46 @@ impl BakedExporter {
let mut map = BTreeMap::new();
let mut statics = Vec::new();

#[derive(Copy, Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like overkill compared to

&first_locale.to_ascii_uppercase().replace('-', "_").replace(AuxiliaryKey::separator() as char, "__")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged but my version saves an allocation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the complexity tradeoff here is not worth it for datagen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking of Either::<[char], [char, char]> which is concise but we don't have an Either dep

Manishearth
Manishearth previously approved these changes Aug 18, 2023
@sffc sffc merged commit 70c4727 into unicode-org:main Aug 18, 2023
25 checks passed
@sffc sffc deleted the auxkey branch August 18, 2023 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement and re-implement auxiliary keys
3 participants