Add Pattern Selection #333

eemeli · 2023-02-01T07:40:29Z

This is an attempt to explicitly document how case/variant selection happens with messages that have a when Selector statement. The method presented here should match the current implementations (ICU4J, Intl.MessageFormat polyfill) in general shape, though some specifics such as error handling may be slightly different.

The overall intent is to minimally but sufficiently define selection, such that two implementations that use similar custom selector functions will make the same selection, when given the same input message and formatting context. In a number of places details are left for each implementation to fill in for themselves, as each may have a different internal representations of resolved and unresolved values and may perform value matching in different ways.

By necessity, the method definition needs to use more formal language than what we have so far in the spec. For that, I've borrowed some of the conventions of the TC39 spec and hope that it's sufficiently readable as is, without us needing separate definitions of e.g. what a "list" is.

An earlier draft of this PR was reviewed by @stasm.

Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>

aphillips

I think there may be a problem with defining match/when selectors in the way described here. This description requires the first step (which you call "Setup", but which, upon reflection, should describe the processing of the match line in the pattern) to resolve a single value without reference to the available values in the Variant set. Some existing Selectors, notably the plural one, need access to the variant set in order to do that.

Perhaps it would be better to think of match as an operator and feeding it the when statements in order to return a single pattern string? We can still debate whether the match is an ordered greedy one (as you have here) or seeks the "best match". I think it would work better and it would allow selector functions to define "match" however they need to.

spec/formatting.md

aphillips · 2023-02-01T16:36:36Z

spec/formatting.md

+1. Let _res_ be a new empty list of resolved values that support selection.
+2. For each Expression _exp_ of the message's Selector Expressions,
+   1. Let _rv_ be the resolved value of _exp_.
+   2. If selection is supported for _rv_:


I don't know what this means? rv is just a value and all we're doing is resolving what the value is (we're not performing the selection yet). When is selection not supported for a value?

For example, let's presume that we're formatting to a non-string target and a variable $img resolves to an HTML <img> element. We should be perfectly fine using $img in a placeholder in pattern, but what happens if the message has a when {$img}? This line is here to allow an implementation to say that in that case, we won't even try matching anything against the <img>.

aphillips · 2023-02-01T16:39:18Z

spec/formatting.md

+      1. Append _rv_ as the last element of the list _res_.
+   3. Else:
+      1. Emit a Selection Error.
+      2. Let _nomatch_ be a resolved value for which selection always fails.


Should this be defined externally rather than inline?

These instructions could be simpler as:

For each expression exp ... etc ...

Let rv be the resolve value of exp or nomatch if expression cannot be resolved.

Append rv as the last element of res

If rv is nomatch emit error

To me, it's clearer to provide the definition of nomatch inline, and to have only one thing happening on each line of the algorithm. Also, the last step of your proposed method requires for an equality comparison between potentially non-primitive values. We should avoid such if at all possible, even if it makes the method have a couple more steps or indentation levels.

aphillips · 2023-02-01T16:43:02Z

spec/formatting.md

+      2. Let _nomatch_ be a resolved value for which selection always fails.
+      3. Append _nomatch_ as the last element of the list _res_.
+
+The shape of the resolved values must be determined by each implementation,


The normative must seems hard to enforce here. What is this text trying to ensure?

This is trying to be explicit about the parts of the selection process that are left for the implementation, and makes it clear that the spec explicitly leaves out the shape of any resolved values, or how to work with them.

My personal preference would be for the spec to be built in terms of explicit, well-defined resolved values, but @mihnita in particular has strongly pushed back on this. Unfortunately, this means that it's tricky to talk about e.g. what the value of $num is here:

let $num = {(1) :number minimumFractionDigits=1} match {$num} when one {You have {$num} thing} when * {You have {$num} things}

I would find it much easier to say that $num resolves to be an instance of a MessageNumber with a value 1 and an options bag { minimumFractionDigits: 1 }. When you use a MessageNumber as a selector, this is how the matching works, and when you use it in a placeholder, that is what you get when formatting to a string, and other is the behaviour when formatting to parts.

But taking considerations of realpolitik into account, we appear to need to define the spec without any such MessageNumber constructs, and hence end up with this circumlocution around resolved values.

But taking considerations of realpolitik into account...

Hmm, we should continue to stay focused on our objective technical arguments, and of course doing our usual due diligence around them (multiple alternatives, pros & cons, eval criteria for a preference). I'm not sure that we're all so far apart in our thinking on this topic, so giving into non-technical concerns slows us down from achieving our best work.

I would like to get a clearer picture on the previous discussion from @mihnita. It could be here, but also could be in a meeting. Maybe after hearing more details, it would help us identify the precise technical sticking points? And hopefully my response to the other comment helps add some context to the topic by identifying how we might be talking past each other.

spec/formatting.md

aphillips · 2023-02-01T16:53:12Z

spec/formatting.md

+Using _res_,
+the Variants are iterated in source order and the following test is performed
+to find one with all of its keys matching the Selector Expressions:


This could be clearer. I think you're describing a greedy matcher in which the order of the variant matrix matters--it returns the first variant that matches all of the conditions. I would rephrase this (although note that I'm not in agreement with the matching described):

For each Variant, test if its VariantKeys match the values in res

Such a matcher requires the developer, translator, tooling, and runtime to keep the serialized order of the matrix intact end-to-end (including when various languages explode the matrix). I think this is an unnecessary burden that I would like to avoid.

As we've seen elsewhere, some values can match more than one value in the variant key set. For example, the value 1 in a plural matcher in the en locale can match both the value 1 and the keyword one. The value 1 is a better match, but not the only match. So:

let $foo = 1 // this will be for the plural let $bar = (moo) match {$foo $bar} when one bar { no match because bar!=moo } when one * { unfortunate match } when 1 moo { we want this one } ... etc...

The description you have here would not work for plural selection because the value can either be 1 or keyword one but not both. In current implementation, the plural formatter has both the array of VariantKeys and the value to evaluate against them.

echeran · 2023-02-02T02:58:17Z

spec/formatting.md

+      3. Append _nomatch_ as the last element of the list _res_.
+
+The shape of the resolved values must be determined by each implementation,
+along with the manner of determining their support for selection.


The text is underspecified/unclear here and/or missing an important point. The text is saying that an implementation of MessageFormat 2 should determine whether a resolved value is valid for selection. But based on what @markusicu and others have been saying for a while, in fact, the responsibility for selecting goes together with formatting.

The example of plurals makes it obvious. A number can always match OTHER aka *, and they can match an exact match expression (ex: =7). They can also match one of the plural categories: ZERO, ONE, TWO, FEW, MANY. (Side note: that reiterates Addison's comment the care needed in matching.) The point I want to add and emphasize, on top of that, is that formatting affects the matching as in the plural case. The number 1200000 in French matches the OTHER plural category, but 1.2M matches the plural category MANY.

Similarly, how to do matching is a concern that belongs alongside the formatting implementation for this type of formatting / this value type. Whether the strings =7, =1200000, *, or MANY are matched by the formatted numbers that I will serialize here as 1.2M and 1200000 requires a whole set of rules.

Another example of how the manner of matching can be specific to the selector/formatter or value type is how semantic versions get matched via "greater than or equal to" logic.

So at the least, it would help to call out:

The input and formatted values are needed for matching

The MF2 implementation should be invoking a selection function, and maybe it exposes the match function predicate it uses internally

Given that formatted values are needing for matching, we should say that formatter functions are a prerequisite for selector functions

I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. The CLDR rules you link to depend on Plural Rule Operands, which at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period ..

Theoretically, it would be possible to determine these operand values without the intermediate formatting step, and in certain cases it might be possible to reuse the formatting if it happens to match the expected output for the current locale and options. But in practice the reasonable thing to do is to format & re-parse the input number within the plural selector, and separately format the input number for string output.

With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. ... [PluralRuleOperands] at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period .. ... With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

We might be partially talking past each other due to overloaded terminology, and there's still an unaddressed point.

Formatting usually means returning a string, but sometimes in ICU when formatting type X, an intermediate representation after applying some locale-specific processing is sometimes called FormattedX. This intermediate pre-processed state still occurs before the formatting symbols are applied when getting the final formatted string. (ex: FormattedNumber is intermediate, it still has a toString() method, but can also be an input to selection in PluralRules.select()). Instead of using a string adhering to a grammar like in JS, ICU4X uses a more structured type (FixedDecimal) for the intermediate value that can still be used for selection and formatting to string. That avoids the redundant need to parse strings to reconstruct that information if using the JS string approach.

This preprocessing step is still usually handled by a formatter, so we still have that as a dependency in this example. In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. The other higher level point that still seems unaccounted for is that how the selection occurs is non trivial (it's not string equality).

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

We might be partially talking past each other due to overloaded terminology

Yeah, it's sounding like what I'm referring to as a "resolved value" could be represented in ICU4J by the FormattedX entities. At least in my headcanon, the whole process of what could be called "formatting" makes more sense to split into two: "resolution" and "formatting". In the first, you gather up all the information you need in order to do e.g. formatting or selection, and in the second you take that information and you emit a value in the final representation that you need.

[...] In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. [...]

Definitely agree, for plural selection.

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

While I agree that the behaviour of a plural selector (be it :number or :plural) must be well defined for use in an implementation, I do not think that definition belongs in this specification.

Do you think we ought to include something like what I mention in this comment in the MF2 spec, though? As I understand it, a FormattedNumber should work as an implementation of a "MessageNumber".

in my headcanon ... In ["resolution"], you gather up all the information you need in order to do e.g. formatting or selection...

As far as confusing terminology goes, "resolution" still sounds a little too vague. This initial preprocessing step during the overall work of "formatting" (input value -> string) also depends on locale information too. Ex: for compact notation numbers, the exponent that you use is informed by the grouping strategy for numbers in the locale. (@eggrobin, did I get that example right?)

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

While I agree that the behaviour of a plural selector (be it :number or :plural) must be well defined for use in an implementation, I do not think that definition belongs in this specification.

To clarify, I'm not saying that we should define the behavior of plural selector in the spec. What I'm saying is that this PR codifies an algorithm for a first match strategy, but the notion of "match" between a selector value and variant key is underspecified. We are in agreement that the notion of match has to be implemented separately for each type of selector, and those implementation details are not a concern for the spec text. But what I am saying is that we also clearly can't leave the story at that, since the proposed algorithm in spec text is built on top of an assumption of a specific notion of match (equality, ex: string equality), and we know from the plural selector example that that specific notion is insufficient to cover all cases, so it needs to be generalized.

And so, I think we already agree that there is a clear connection there between the proposed high-level algorithm for variant key selection in MF2 and the selector-specific notion of match. In order to resolve the problem that I described above, just as we have done before when designing for things like formatting functions, the proper way to achieve both:

generalizing the notion of matching from simple string equality comparison predicate function to an impl-specific predicate function

decoupling those impl-specific notions of matching from the high-level algorithm

...is to have an interface representing the selection predicate function.

That is the specificity that the current proposed algorithm needs. Without doing so, the algorithm text in the PR here will ignore a design problem that we know that we have. Having an interface to represent a specific selector's impl-specific behavior achieves proper simplicity through decoupling that makes for a good design around this problem.

Do you think we ought to include something like what I mention in this comment in the MF2 spec, though? As I understand it, a FormattedNumber should work as an implementation of a "MessageNumber".

I think what you were mentioning in that comment is the same as what I have been saying so far with "selection depends on formatting". But maybe that wasn't clear because we used different words to describe the same structured pre-processed form ("formatting" as in .formatToParts() in EMCA-402 vs. "resolved value").

The way you describes it makes me unsure yet if we agree on how best to design for this. Rather than trying to specify the type of the value in the pre-processed structured form (can we call this form the "Preformatting Structured Parts" maybe?), I think we could once again define an interface for the function that returns the value. That would allows us to decouple separate concerns cleanly and simply. But I think we agree in principle, if I understand correctly.

And it would help us all to eventually find some consistent precise naming, too.

echeran · 2023-02-02T03:01:22Z

spec/formatting.md

+this selection will always succeed.
+Variants after one with all catch-all keys will never be selected.
+
+### Examples


We should have an example with a plural/number formatting+selection to address the comments regarding implicit assumptions about matching.

More specifically, the intention here is that when we include an example of a plurals selection message, we have a real use case where matching is not as simple as just string comparison. We should be able run it through the above described algorithm verbiage and verify that everything we need is accounted for in the text.

eemeli · 2023-02-02T08:02:44Z

@aphillips:
Perhaps it would be better to think of match as an operator and feeding it the when statements in order to return a single pattern string? We can still debate whether the match is an ordered greedy one (as you have here) or seeks the "best match". I think it would work better and it would allow selector functions to define "match" however they need to.

The outcome of the resolution meetings with the CLDR-TC a year ago included this on selection:

Selecting variant messages based on selectors
a. Use a first-match approach.
b. Any specially ordering needs to be done on the tooling side.

Effectively, this means that unlike in MF1, in MF2 the order of the variants defines their precedence. So an MF1 message

{num, plural, one{ONE} =1{1!} other{OTHER}}

would need to be represented like this in MF2

match {$num :plural}
when 1 {1!}
when one {ONE}
when * {OTHER}

in order to keep the same precedence of the =1 exact match over the one category match.

I would very strongly prefer not reopening this particular decision.

spec/formatting.md

echeran · 2023-02-03T00:21:54Z

spec/formatting.md

+      3. Append _nomatch_ as the last element of the list _res_.
+
+The shape of the resolved values must be determined by each implementation,
+along with the manner of determining their support for selection.


I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. ... [PluralRuleOperands] at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period .. ... With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

We might be partially talking past each other due to overloaded terminology, and there's still an unaddressed point.

Formatting usually means returning a string, but sometimes in ICU when formatting type X, an intermediate representation after applying some locale-specific processing is sometimes called FormattedX. This intermediate pre-processed state still occurs before the formatting symbols are applied when getting the final formatted string. (ex: FormattedNumber is intermediate, it still has a toString() method, but can also be an input to selection in PluralRules.select()). Instead of using a string adhering to a grammar like in JS, ICU4X uses a more structured type (FixedDecimal) for the intermediate value that can still be used for selection and formatting to string. That avoids the redundant need to parse strings to reconstruct that information if using the JS string approach.

This preprocessing step is still usually handled by a formatter, so we still have that as a dependency in this example. In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. The other higher level point that still seems unaccounted for is that how the selection occurs is non trivial (it's not string equality).

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

echeran · 2023-02-03T00:35:35Z

spec/formatting.md

+this selection will always succeed.
+Variants after one with all catch-all keys will never be selected.
+
+### Examples


More specifically, the intention here is that when we include an example of a plurals selection message, we have a real use case where matching is not as simple as just string comparison. We should be able run it through the above described algorithm verbiage and verify that everything we need is accounted for in the text.

echeran · 2023-02-03T00:44:28Z

spec/formatting.md

+         1. Return False to indicate failure.
+2. Return True to indicate success.
+
+The manner of testing _key_ against _sel_ must be defined by each implementation.


This is the place in the text that I was referring to in the earlier comment. It seems to suggest that the matching / testing of the runtime value element _sel_ to the element _key_ of the VariantKey, should be defined by the implementation, which sounds like implementation of MF2. If not, then the wording is confusing.

And regardless, we still want to be more specific about what part of MF2 is responsible for the matching logic of _sel_ to _key_ since we know from plurals that we can't assume the trivial case (value equality comparison) to be sufficient.

eemeli · 2023-02-14T11:18:28Z

Marked this as a draft to indicate that yesterday's MFWG call identified at least the following dependencies for this PR, which will need to be resolved in separate issues/PRs:

A potential reconsideration of a "best match" approach, as opposed to our current "first match" selection
Including an explicit definition of resolved/preprocessed/intermediate values as a spec-internal utility interface

Once those have been resolved, this PR may need to be correspondingly updated to match.

eemeli · 2023-03-29T09:37:27Z

Closing; will iterate on this and open a new PR with a column-first selection method.

Add Pattern Selection

9d57cdf

Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>

eemeli added the Agenda+ Requested for upcoming teleconference label Feb 1, 2023

eemeli requested review from aphillips, stasm, zbraniecki and mihnita February 1, 2023 07:40

aphillips requested changes Feb 1, 2023

View reviewed changes

echeran requested changes Feb 2, 2023

View reviewed changes

eemeli commented Feb 2, 2023

View reviewed changes

spec/formatting.md Outdated Show resolved Hide resolved

eemeli commented Feb 2, 2023

View reviewed changes

spec/formatting.md Outdated Show resolved Hide resolved

eemeli added 2 commits February 2, 2023 21:47

Apply suggestions from code review

5f28cda

Update spec/formatting.md

91f88c1

echeran reviewed Feb 3, 2023

View reviewed changes

eemeli marked this pull request as draft February 14, 2023 11:12

eemeli removed the Agenda+ Requested for upcoming teleconference label Feb 14, 2023

mihnita mentioned this pull request Mar 3, 2023

What does "backward-compatible with MF1" really means? #361

Closed

eemeli closed this Mar 29, 2023

eemeli deleted the selection branch March 29, 2023 09:38

eemeli mentioned this pull request Mar 29, 2023

Add column-first pattern selection #372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pattern Selection #333

Add Pattern Selection #333

eemeli commented Feb 1, 2023

aphillips left a comment

aphillips Feb 1, 2023

eemeli Feb 2, 2023

aphillips Feb 1, 2023

eemeli Feb 2, 2023

aphillips Feb 1, 2023

eemeli Feb 2, 2023

echeran Feb 4, 2023

aphillips Feb 1, 2023

echeran Feb 2, 2023

eemeli Feb 2, 2023

echeran Feb 3, 2023

eemeli Feb 3, 2023

echeran Feb 4, 2023

echeran Feb 2, 2023

echeran Feb 3, 2023

eemeli commented Feb 2, 2023

echeran Feb 3, 2023

echeran Feb 3, 2023

echeran Feb 3, 2023

eemeli commented Feb 14, 2023

eemeli commented Mar 29, 2023

Add Pattern Selection #333

Add Pattern Selection #333

Conversation

eemeli commented Feb 1, 2023

aphillips left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli commented Feb 14, 2023

eemeli commented Mar 29, 2023