Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom / pluggable "formatters" (beside Date/Time/Number ...) #22

Closed
jamuhl opened this issue Jan 28, 2020 · 15 comments
Closed
Labels
requirements Issues related with MF requirements list resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.

Comments

@jamuhl
Copy link

jamuhl commented Jan 28, 2020

While having the build in formatters (numbers, dates, lists, relative dates) is awesome sometimes there is the need to define some custom format (simple as lowercasing, uppercasing, ...)

Example:

"You are {what, uppercase}."

i18next uses a function in form: (value, format, lng, options) => value to allow those (https://github.com/i18next/i18next/blob/master/src/defaults.js#L55)

See previous comments:

@romulocintra romulocintra added the requirements Issues related with MF requirements list label Jan 28, 2020
@Fleker
Copy link

Fleker commented Jan 28, 2020

Is this supposed to refer to a comment that the translator would implement, or some sort of extensible markup that a custom renderer would handle (and a default renderer would ignore)?

@jamuhl
Copy link
Author

jamuhl commented Jan 28, 2020

@Fleker not sure, in the case of i18next the developer specifies that custom format code - so I guess the second of your assumption: markup that a custom renderer would handle (and a default renderer would ignore) (just I would replace the wording renderer with formatter as nothing gets rendered)

@longlho
Copy link

longlho commented Jan 29, 2020

Wouldn't HTML markup support encompass this already? Having function that might modify specific words arbitrarily isn't necessarily safe IMO since it might change the context of the sentence completely.

@zbraniecki
Copy link
Member

We have a number of "custom" builtins in Fluent and a number of requests for more. Custom functions are fairly often environment specific. (we're talking about specifying their behavior better)
Here are some examples:

  1. PLATFORM

Firefox/Gecko has a lot of per-platform strings. They're either messages that have different value depending on the platform, or accesskeys that differ per-platform etc.
They also differ per-locale. Some locales use the same message across the platforms, others have separate macOS one, but cluster all other platforms, while other locales have separate linux one, but cluster all others.
Here's how it looks:

In JS we have:

let PLATFORM = () => {
      switch (AppConstants.platform) {
        case "linux":
        case "android":
          return AppConstants.platform;
        case "win":
          return "windows";
        case "macosx":
          return "macos";
        default:
          return "other";
      }
};

new FluentBundle(locale, { functions: { PLATFORM });

and in Rust it looks like this:

#[derive(Debug)]
#[repr(C)]
pub enum FluentPlatform {
    Linux,
    Windows,
    Macos,
    Android,
    Other,
}
    bundle.add_function("PLATFORM", |_args, _named_args| {
        match crate::ffi::FluentBuiltInGetPlatform() {
            FluentPlatform::Linux => "linux".into(),
            FluentPlatform::Windows => "windows".into(),
            FluentPlatform::Macos => "macos".into(),
            FluentPlatform::Android => "android".into(),
            FluentPlatform::Other => "other".into(),
        }
    }).expect("Failed to add a function to the bundle.");

and then localizers can do:
example 1:

enable-password-sync-notification-message =
  { PLATFORM() ->
      [windows] Want your logins everywhere you use { -brand-product-name }? Go to your { -sync-brand-short-name } Options and select the Logins checkbox.
     *[other] Want your logins everywhere you use { -brand-product-name }? Go to your { -sync-brand-short-name } Preferences and select the Logins checkbox.
  }

example2 2:

navbar-tooltip-instruction =
    .value = { PLATFORM() ->
        [macos] Pull down to show history
       *[other] Right-click or pull down to show history
    }

example 3:

profiles-opendir = 
    { PLATFORM() ->
        [macos] Show in Finder
        [windows] Open Folder
       *[other] Open Directory
    }

example 4:

findbar-highlight-all2 =
    .label = Highlight All
    .accesskey = { PLATFORM() ->
        [macos] l
       *[other] a
    }
    .tooltiptext = Highlight all occurrences of the phrase

And in some locales, the localizers may not have a distinguish term for Preferences in macOS HiG, so they'd just do:

enable-password-sync-notification-message = Want your logins everywhere you use { -brand-product-name }? Go to your { -sync-brand-short-name } Preferences and select the Logins checkbox.

Example: source in en-US with custom selector, and Italian translation and Czech without it.

  1. terms/genders as selectors

Fluent has the concept of terms that in some locales have genders and then can be used as selectors. Here's an example of a string in English:

search-results-help-link = Need help? Visit <a data-l10n-name="url">{ -brand-short-name } Support</a>

and equivalent in Czech:

search-results-help-link =
    Potřebujete pomoc? Navštivte <a data-l10n-name="url">Podporu { -brand-short-name.gender ->
        [masculine] { -brand-short-name(case: "gen") }
        [feminine] { -brand-short-name(case: "gen") }
        [neuter] { -brand-short-name(case: "gen") }
       *[other] aplikace { -brand-short-name }
    }</a>

(I'd prefer each variant to contain the whole sentence, but ignore that for the sake of this conversation).

  1. Formal/Informal
key = { TONE() ->
        [formal] ...
       *[informal] ...
    }

requested

  1. Capitalization
{ CAPITALIZATION(DATETIME($isoString, weekday: "long"), "lower") }
{ CAPITALIZATION(DATETIME($isoString, weekday: "long"), "title") }

This one should be handled by DATETIME eventually via displayContext, but is an example of a solution that can be implemented locally for a product without having to change the fluent library or wait for it to add support for display contexts.

  1. Time of the day
greetings = { TIME_OF_THE_DAY() ->
    [morning] Good morning
  *[other] Hello
}
  1. Arthmetics
floor-msg = { $level ->
    [-2] on basement floor B{ $level }
    [-1] on basement floor B
    [0] on ground floor
    [one] on floor 2
   *[other] on floor { ADD($level, 1) }
}

requested

  1. Lists
photo-msg = { LIST($names) } liked your photo.

requested

  1. Screen width
reload-desc = { SCREEN_WIDTH() ->
    [narrow] Warn me before redirect or reload.
   *[wide] Warn me when websites try to redirect or reload the page.
}

Here we were playing with the idea of enabling responsive localization, much like responsive CSS today, which would allow localizers in locales where it matters (say, German, while Chinese don't need it) to specify different variants depending on the available space and let Fluent adapt: Demo video

These are just examples from our production and issues filed in fluent repo with requests for "how to solve X".
Custom functions enable users to solve many of those problems without having to wait for advancement of the localization system they use.

@nbouvrette
Copy link
Collaborator

I think it depends on which problems we would like to solve. If our focus is on linguistic issues, then I do see the value for "formatters" (or cases), but I think it would be important to predefine them and even simplify them wherever possible. But this seems more of an inflection discussion.

For example, you could use this syntax to automatically format "A or An" based on the value of the variables:

{length, singular {{#, indefiniteArticle} minute walk.} plural {{#, indefiniteArticle} minutes walk.}

Now if we are talking about non-linguistic examples:

  • Capitalization: this could be interesting to explore in cases where titles in en-US and en-UK could have different expectations, but could also be solved with CSS or other text libraries depending on the scenario.
  • Platform: also looks like this could be handled by logic in the code rather than at the string/syntax level but I would be curious to hear what are the benefit you see rather than separating those strings
  • Layout or screen width: would also think this is more of a presentation responsibility and have different versions of the strings that can be displayed differently with CSS or other mechanisms

I think that if we don't use this type of feature to focus on linguistic problems, we might end up in the loss of translation memory leverage and higher translation cost.

@zbraniecki
Copy link
Member

Capitalization: this could be interesting to explore in cases where titles in en-US and en-UK could have different expectations, but could also be solved with CSS or other text libraries depending on the scenario.

How would you resolve that via CSS? If the localizer needs capitalization of a word, how would they communicate it to the CSS?

Platform: also looks like this could be handled by logic in the code rather than at the string/syntax level but I would be curious to hear what are the benefit you see rather than separating those strings

Again, how would you resolve it in code logic if 90 locales don't need a per-platform selector and one locale does?
If you don't separate concerns, you're ending up leaking the complexity of the most complex language onto all of them by requiring 91 locales to specify per-platform string, because one locale needs it.

level but I would be curious to hear what are the benefit you see rather than separating those strings
Layout or screen width: would also think this is more of a presentation responsibility and have different versions of the strings that can be displayed differently with CSS or other mechanisms

Similar to the previous one. How would you let locales that need variants provide them, without requiring all locales to provide them?

@nbouvrette
Copy link
Collaborator

How would you resolve that via CSS? If the localizer needs capitalization of a word, how would they communicate it to the CSS?

You are right that it would be impossible for the linguist to do this on their end but, I see only 4 practical scenarios where this could apply:

  1. The variable's values are also present in other strings and can be capitalized manually there (of course if it's used in different contexts this could create other issues...)

  2. Presumably, this is not just a 1 off style for 1 language, but the linguist could ask the engineer to add HTML markup around the variable, and then apply conditional styling only for this locale by tracking locale metadata at the document level. Of course, having communication between different parties like this does imply a solid process (and/or tools) in place to allow that. Here is how the output of such communication could look like:

<div class="what">You are <em>{what}</em>.</div>
html[lang=en-gb] .what em {
	text-transform: uppercase;
}
  1. Similar to the previous one, if we are talking about a general capitalization, for example, en-US expect capitalized headers, you would potentially have one version of English strings for both en-US and en-UK and tweak headers with CSS:
html[lang=en-us] h1 {
	text-transform: capitalize;
}
  1. And the last option would be to not use a variable at all and then the linguist could capitalize it in the original string. This one is actually quite important because bad concatenation is one of the worst i18n offenders I have seen in use. When variables are used for the wrong reasons, it effectively breaks i18n.

But, I do think there is something to explore with capitalization. I would just recommend documenting good practical use cases upfront (and maybe even guidelines that we could re-use later in a document) before implementing such a feature, otherwise, it could also end up being used for the wrong reasons.

@nbouvrette
Copy link
Collaborator

Platform: also looks like this could be handled by logic in the code rather than at the string/syntax level but I would be curious to hear what are the benefit you see rather than separating those strings

Again, how would you resolve it in code logic if 90 locales don't need a per-platform selector and one locale does?

But if I understand correctly, taking the platform example, you are saying that a platform (operating system) would have completely different behavior in 1 locale only? Do you have an example for this? Unless I missed something I could not find it in your original example and I'm having a hard time picture it.

The way I see this, and I don't have a lot of personal experience with this scenario, but you would possibly have 1 string per platform when it comes to a platform related topic. For example:

opendir-linux = Open Directory
opendir-windows = Open Folder
opendir-macos = Show in Finder
opendir-android = Open Directory
showString(`opendir-{PLATFORM}`);

Layout or screen width: would also think this is more of a presentation responsibility and have different versions of the strings that can be displayed differently with CSS or other mechanisms

Similar to the previous one. How would you let locales that need variants provide them, without requiring all locales to provide them?

You are right, TMSes expect symmetrical input/output keys when translating strings which is why Fluent's Multi-variant Message can be quite powerful. The main challenges I see around it, as it works today (having little experience using the syntax):

  • It seems to be used a lot to solve non-linguistic problems. For example for the layout, you could solve it this way:
reload-desc-narrow = Warn me before redirect or reload.
reload-desc-wide = Warn me when websites try to redirect or reload the page.
<p class="narrow">{reload-desc-narrow}<p>
<p class="wide">{reload-desc-wide}<p>
.wide {
	display: none;
}

@media (min-width: 30rem) {
	.wide {
		display: initial;
	}

	.narrow {
		display: none;
	}
}

You could have potential repetition or unused strings in some languages, but the solution is simple and requires no markup while fitting nicely within existing TMSes with good translation memory leverage.

  • The other challenge is that for linguistic problems, there are no predefined variants that can make it very challenging to keep consistent and/or validate/audit. But this is why to me this seemed more of a discussion to continue in the inflection thread.

@zbraniecki
Copy link
Member

zbraniecki commented Feb 2, 2020

But if I understand correctly, taking the platform example, you are saying that a platform (operating system) would have completely different behavior in 1 locale only? Do you have an example for this?

Sure.

downloads-shortcut =
    .key = J

one-off in locale X:

downloads-shortcut =
    .key = { PLATFORM() ->
        [linux] Y
       *[other] J
    }

The result is that a localizer can select a different shortcut for a given platform if needed, without requiring developers to alter the code and/or instilling the burden of managing linux-specific string on all locales.

The way I see this, and I don't have a lot of personal experience with this scenario, but you would possibly have 1 string per platform when it comes to a platform related topic. For example:

Your example requires that all locales provide all four strings so that some of them can use some of the variants.

It seems to be used a lot to solve non-linguistic problems. For example for the layout, you could solve it this way:

The example solution you're providing requires all locales to provide narrow/wide variants, while only several may need it.

I think our conversation boils down to an observation that drove Fluent design - the factors that impact ability to produce high quality translation of a translation unit differ per locale.
With 60-100-more locales in play, you are either asking the developer to establish a social contract with the localizer based on a union of all the factors, or you let per-locale variations.

Fluent is heavy on the latter side because we wanted to offer the flexibility while minimizing the "leaking" of complexity from one locale to another or from a locale to a developer.

Historically at Mozilla we used an approach similar to the one you're giving and gettext also used that (if any locale needs a plural, all locales provide a plural).
This resulted not only in the issue I described above, which required developers to make decisions they're not well equipped to make (should we provide per-width string? (2) per-platform? (2x4)? per-gender? (2x4x3) where to stop?), but also in an additional problem that the developer is not even well equipped to make decisions on whether there should be variants and what kind of variants should there be.
They usually end up taking their personal linguistic skills, and trying to extrapolate them on other cultures.

For example, it's very easy to see why pluralization is the only well-addressed variant selection mechanism if you observe that English has pluralization, but not declension, and gender cases are limited etc.

Fluent uses a concept of separation of concerns - a developer should never have to make a decisions about the localization, and if there's any scenario which a localizer wants to solve for their locale, it should not impose any complexity for another.

I recognize that this is a particular position and one can take another. I'd only argue that examples you provide are not really scalable and should not be considered a "solution".
We can (and should) set a boundary and decide what we are not going to solve. Some of the examples above may be cases of things we will decide not to solve.

But I don't believe only linguistic issues should be solvable via localization system.
My boundary is whether a need or variant selection is locale specific, or common across all locales.

Custom selectors and formatters provide functionality that is developer-independent and doesn't leak across locales.
I believe we should aim to support them, and aim to provide good specification that helps TMS/CAT tools ability to reason about them.

@nbouvrette
Copy link
Collaborator

Thanks @zbraniecki for the extra context. I think it helps (at least for me) to understand better the strategy behind Fluent. There are still some areas for me that are not clear and I think we could be able to break down each approach into pros and cons to have a better picture.

We are getting away from the original topic of this thread, I don't know if we should start a new one?

Some observations so far, let's imagine we try to break this into 2 schools of thought:

Focused on linguistic problems (just made up a name for the approach I was proposing)

  • Each linguistic problem would have predefined selectors and rule per language (plural is an excellent example and should be extended to cover more cases)
  • A hybrid (Fluent-like syntax) approach could also be possible but once linguistic selectors would become available, they would be the recommended approach
  • Everything that is not a linguistic problem (for example layout or configuration of keys) could use normal key/values.
  • This will come with the challenge that the author would be responsible to know if multiple variants of a string are required or should be able to fix/adapt it given linguist feedback
  • Repetition would be preferred over using syntax for non-linguistic problems (e.g. avoid using select)
  • Easy to adopt by most existing TMSes and file formats
  • The syntax can be simple and used directly by linguists

Full flexibility (this is the best name I could come up with for Fluent)

  • A powerful syntax that can solve both complex linguistic problems but even problems that are closer to the application or even markup
  • Well integrated with code, and can also avoid repetition in a certain language
  • Gives more control to individual linguists without creating too many dependencies between languages
  • It might be tricky to expand globally (need better validation tools or TMS integration?)
  • Can also be hard to enforce best practices or correct usage of the syntax because of its flexibility

Fluent uses a concept of separation of concerns - a developer should never have to make decisions about the localization, and if there's any scenario that a localizer wants to solve for their locale, it should not impose any complexity for another.

But I'm still having a hard time understanding how can Fluent solve some of the issues in the method I was proposing? For example, the linguist needs to know what the PLATFORM values are to be able to use it. What happens if you add a new platform, does this trigger automatically localization requests?

Isn't it the same, or even maybe more complex than coming up ahead of time, knowing which platform you support and having 1 string for each when you author the string? If a developer adds a new platform, he should also remember to update the related strings. In a continuous localization setup, this would automatically trigger new localization requests.

I recognize that this is a particular position and one can take another. I'd only argue that the examples you provide are not really scalable and should not be considered a "solution".

I'm also curious about how this can fit in big commercial TMSes? do you have details on this or maybe you had something else in mind? To me, this is also a very important point to consider when talking about scalability.

@romulocintra romulocintra removed the requirements Issues related with MF requirements list label Feb 18, 2020
@alabamenhu
Copy link

The way I see this, and I don't have a lot of personal experience with this scenario, but you would possibly have 1 string per platform when it comes to a platform related topic. For example:

opendir-linux = Open Directory
opendir-windows = Open Folder
opendir-macos = Show in Finder
opendir-android = Open Directory
showString(`opendir-{PLATFORM}`);

So, let's say I can't determine the platform, or I add a new one (maybe, opendir-web). What is the best translation out of the four to use as a default if a specific translation isn't available?

For many reasons, the different platforms might have the optimal default translations. For example, font in Spanish is translated as either the historically-correct tipo (de letra) used by Macs, or Microsoft's Spanglish fuente. In a general purpose program, I'd probably go with fuente as the default even though I personally want to barf when I see it. But in a publishing program... the traditional is a more acceptable default. The programmer is not qualified to know which one is the best default, and besides the fact that there's no great way to put that kind of logic into code, the programmer really shouldn't have to worry about which strings have platform differences.

That's the whole idea of Fluent: let the programmer focus on programmer stuff, and let the localizer focus on localizer stuff. Sure, it can make localization more complex, but as tools start developing around it, things should get much easier on the localizer for the complex stuff.

I can already imagine there being a fairly extensive set of terms that handle some of the most common intra-language issues, and a tool automatically signalling the the translator that, for instance, they probably shouldn't hardcode ordenador, but suggest instead the term {-computer} so that some countries get ordenador, others get computador and others get computadora, and subsequently automatically adding the term to the file.

@nbouvrette
Copy link
Collaborator

That's the whole idea of Fluent: let the programmer focus on programmer stuff, and let the localizer focus on localizer stuff. Sure, it can make localization more complex, but as tools start developing around it, things should get much easier on the localizer for the complex stuff.

You are right Fluent is quite powerful (probably the most powerful syntax around). But as you mentioned the integration with existing tools is still in progress. One of the big challenges is that traditionally, source assets cannot be modified by translation management systems. This means that regardless of the syntax, the linguist will need to be involved during authoring, or there are major changes that will need to happen in existing tools. But even if you could modify the source to add new "variants" of a string - then how will the code use it? Unless it's fully self-contained and uses within other strings, there is some collaboration that needs to happen, regardless of the solution, to get optimal localization.

The big challenge I see ahead is when you start mixing non-linguistic problems (like white labeling, OSes or even A/B testing) with a linguistic solution - where do we draw the line?

@mihnita mihnita added the requirements Issues related with MF requirements list label Sep 24, 2020
@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Jun 28, 2023
@aphillips
Copy link
Member

The whole idea of a registry is basically addressing this issue. Closing this as "addressed". Open a new specific issue if you thing something is missing.

@macchiati
Copy link
Member

Is there an issue for which the functions should be in the standard registry? That should go beyond what ICU has as part of MF1.0, eg CurrencyAmount (currency+number), Measure (unit+number+usage), etc.

@aphillips
Copy link
Member

@macchiati The short answer is "no, I don't think so", although I think there is an issue that tracks MFv1 compatibility (#361 ) including this comment from me.

We have an agenda item for 2023-07-03 to discuss the registry, starting with "will we have a standard one?"

I believe that we should have a standard registry and that it should go beyond what is in MF1 to embrace other formatters. An important question is "what criteria should be applied to inclusion the standard registry?", since implementations would be required to provide the items in the standard registry with, presumably, the options specified there.

There is some debate about options. @eemeli (and others) have expressed a desire to use JS's array of options. Others such as @mihnita and myself prefer skeletons for certain operations that have them. There exists mappings (and implementation support) to move between these and compromises (including allowing for implementation-specific extension, e.g. ICU4J might support skeletons as an extension to standardized option bags)

Separately there are two add-on opportunities: (1) implementation-specific registry additions (e.g. ECMA-402 might add JS specific options to e.g. datetime) and (2) user-specific registry additions (e.g. user-defined formatters) and we'd need to define how those interact with each other.

In any case, I'd like us to break out specific items rather than having a giant "registry" issue with everything in it 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requirements Issues related with MF requirements list resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.
Projects
None yet
Development

No branches or pull requests

10 participants