Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Datagen API and CLI with FallbackOptions #4710

Merged
merged 42 commits into from
Mar 27, 2024

Conversation

sffc
Copy link
Member

@sffc sffc commented Mar 19, 2024

@sffc sffc marked this pull request as ready for review March 19, 2024 23:12
@sffc
Copy link
Member Author

sffc commented Mar 19, 2024

This PR doesn't export any APIs, but it plumbs the old APIs through the new APIs with no detectable output diffs. I can merge this now or keep working in this PR.

provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
return false;
};
let mut it = config.locales.iter();
matches!((it.next(), it.next()), (Some(lid), None) if lid == &LocaleWithExpansion::wildcard())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make wildcard() an associated const you can match against it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'd rather keep it a function because it's more flexible.

provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
@Manishearth Manishearth removed their request for review March 21, 2024 09:53
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
pub deduplication_strategy: Option<DeduplicationStrategy>,
/// private: set via constructor
locales: HashSet<LocaleWithExpansion>,
pub und_inclusion: Option<UndInclusion>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you special-casing und? It will either be included due to fallback, or the user can include it through @und.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; I wanted to make it easier to exclude und, which is desirable when building language packs. (Frank was having this problem building a language pack for Chrome.) Strictly speaking, this can be modeled as a LocaleFamily variant, such as "ancestors and descendants except for und". However, I don't see a super clean way to handle it. The options would be:

  1. --locales en-001 implies und, en, ... and the user needs to specify --without-und to disable und (what this changelist implies)
    • Pros: Puts the special case of und as its own flag for users who need this behavior
    • Cons: Entanglement of options
  2. --locales en-001 implies en, ..., (remove und from LocaleFamily with_ancestors) and --with-und (default behavior) includes und, and the user can specify --without-und to disable the inclusion of und
    • Pros: A cleaner model with less conflict between options
    • Cons: Still an extra flag to imply this behavior
  3. --locales en-001 implies und, en, ..., and a new LocaleFamily type _en-001 implies en, ...
    • Pros: Clean model with all locale resolution handling in the same mechanism
    • Cons: Every locale needs to be prefixed with _ in order to avoid including und. Need to also invent a syntax for the "ancestors, no descendants, no und" mode.
  4. --locales en-001 implies und, en, ..., and a new LocaleFamily type !und removes und from the selection
    • Pros: All locale resolution happens in the same option. New include/exclude syntax could be used to model other things.
    • Cons: Include/exclude syntax could be error-prone. We now have an ordered list of rules instead of an unordered set of families.

Do you see anything else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting we'd do something like 3. Including whole locale families but excluding und is a niche (and risky) enough behaviour that introducing a new locale family symbol shouldn't be the main problem. If the main use case is language packs, there'd usually only be a single und-less family per invocation anyway.

I don't understand why we need a "ancestors, no descendants, no und" mode. und is an ancestor, and it's not special. If you're building en language packs, you don't want to duplicate en-001 the same way you don't want to duplicate und.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a follow-up

provider/datagen/src/driver.rs Show resolved Hide resolved
provider/datagen/src/driver.rs Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
pub deduplication_strategy: Option<DeduplicationStrategy>,
/// private: set via constructor
locales: HashSet<LocaleWithExpansion>,
pub und_inclusion: Option<UndInclusion>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; I wanted to make it easier to exclude und, which is desirable when building language packs. (Frank was having this problem building a language pack for Chrome.) Strictly speaking, this can be modeled as a LocaleFamily variant, such as "ancestors and descendants except for und". However, I don't see a super clean way to handle it. The options would be:

  1. --locales en-001 implies und, en, ... and the user needs to specify --without-und to disable und (what this changelist implies)
    • Pros: Puts the special case of und as its own flag for users who need this behavior
    • Cons: Entanglement of options
  2. --locales en-001 implies en, ..., (remove und from LocaleFamily with_ancestors) and --with-und (default behavior) includes und, and the user can specify --without-und to disable the inclusion of und
    • Pros: A cleaner model with less conflict between options
    • Cons: Still an extra flag to imply this behavior
  3. --locales en-001 implies und, en, ..., and a new LocaleFamily type _en-001 implies en, ...
    • Pros: Clean model with all locale resolution handling in the same mechanism
    • Cons: Every locale needs to be prefixed with _ in order to avoid including und. Need to also invent a syntax for the "ancestors, no descendants, no und" mode.
  4. --locales en-001 implies und, en, ..., and a new LocaleFamily type !und removes und from the selection
    • Pros: All locale resolution happens in the same option. New include/exclude syntax could be used to model other things.
    • Cons: Include/exclude syntax could be error-prone. We now have an ordered list of rules instead of an unordered set of families.

Do you see anything else?

pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
pub deduplication_strategy: Option<DeduplicationStrategy>,
/// private: set via constructor
locales: HashSet<LocaleWithExpansion>,
pub und_inclusion: Option<UndInclusion>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting we'd do something like 3. Including whole locale families but excluding und is a niche (and risky) enough behaviour that introducing a new locale family symbol shouldn't be the main problem. If the main use case is language packs, there'd usually only be a single und-less family per invocation anyway.

I don't understand why we need a "ancestors, no descendants, no und" mode. und is an ancestor, and it's not special. If you're building en language packs, you don't want to duplicate en-001 the same way you don't want to duplicate und.

/// Sets this driver to generate the given locales assuming runtime fallback.
///
/// Use the [`langid!`] macro from the prelude to create an
/// explicit list, or [`DatagenProvider::locales_for_coverage_levels`] for CLDR coverage levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or create language families

@sffc
Copy link
Member Author

sffc commented Mar 26, 2024

Plan for this PR:

  • Keep the (verbose) API agreed on in the issue
  • Remove include_und option
  • OK to add LocaleFamily::without_ancestors but could be in a follow-up

LGTM: @robertbastian

@sffc sffc requested a review from a team as a code owner March 26, 2024 22:17
@sffc sffc changed the title Add plumbing for LocalesWithFallback Datagen API Add new Datagen API and CLI with FallbackOptions Mar 27, 2024
provider/datagen/src/bin/datagen/args.rs Outdated Show resolved Hide resolved
.map(|l| l.parse::<LanguageIdentifier>())
.map(|l| {
l.parse::<LanguageIdentifier>()
.map(LocaleFamily::with_descendants)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're doing this in a follow-up: here, as on the main CLI, we should parse directly to LocaleFamily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. Working on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you fixed it in datagen but not here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no you didn't change it in datagen, that's why CI is failing

provider/datagen/src/bin/datagen/mod.rs Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
Comment on lines 225 to 227
let mut sorted_locales =
locales.iter().map(ToString::to_string).collect::<Vec<_>>();
sorted_locales.sort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found something cool for this: https://internals.rust-lang.org/t/add-a-vec-sorted-function/11948/3. We already have itertools in datagen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it returns std::vec::IntoIter, so it still does the same thing under the hood, just a bit nicer syntax and a bit more prone to error because if you call any other iterator methods like .map() then you end up double-allocating. It also seems a bit risky that std::vec::IntoIter::collect uses specialization. But I can do this if you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think it's nicer because it avoids an intermediate variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking

ffi/dart/tools/datagen/build.rs Show resolved Hide resolved
@sffc
Copy link
Member Author

sffc commented Mar 27, 2024

@robertbastian CI is passing as of a245cbb. I started adding the LocaleFamily string parsing logic (thanks for the reminder) but don't have it quite ready yet. Do you approve this pending the remainder of the internal plumbing for LocaleFamily in the CLI? Also, if you prefer I can roll back to a245cbb (or c5ddea3, which has some renames and hopefully passes CI) and merge that and put the LocaleFamily parsing in a follow-up.

@robertbastian
Copy link
Member

I approve at c5ddea3, and also the rest if you get CI to pass.

robertbastian
robertbastian previously approved these changes Mar 27, 2024
@@ -94,7 +94,7 @@ fn main() -> eyre::Result<()> {
let locales_intermediate: LanguageIdentifiersOrLocaleFamilies = match config.locales {
config::LocaleInclude::All => LocaleFamilies(vec![LocaleFamily::full()]),
config::LocaleInclude::None => LanguageIdentifiers(vec![]),
config::LocaleInclude::Explicit(set) => LanguageIdentifiers(set.into_iter().collect()),
config::LocaleInclude::Explicit(set) => LocaleFamilies(set.into_iter().collect()),
config::LocaleInclude::CldrSet(levels) => LanguageIdentifiers(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I think we've been treating cldr sets as families

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locales_intermediate just preserves the type things come in as. They get converted to LocaleFamilies in with_fallback and they stay as LanguageIdentifiers in without_fallback.

Comment on lines +121 to +124
LocaleFamilies(lfs) => lfs
.into_iter()
.map(|family| family.write_to_string().parse())
.collect::<Result<Vec<LanguageIdentifier>, icu_locid::ParserError>>()?,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This logic is a bit strange but I think it gets the job done. In Clap we parse the list to a Vec<LocaleFamily>, and then here I re-parse it as a Vec<LanguageIdentifier>. I guess another way to do this would be to parse to a Vec<String> in Clap and then pick the concrete type here.

@sffc
Copy link
Member Author

sffc commented Mar 27, 2024

This is ready to land. @robertbastian last approved it modulo making CI/tests pass, which is now the case.

@sffc sffc merged commit 7397d7d into unicode-org:main Mar 27, 2024
30 checks passed
@sffc sffc deleted the datagen-neo-driver branch March 27, 2024 20:33
@sffc sffc removed the request for review from a team March 27, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants