Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pattern Selection #333

Closed
wants to merge 3 commits into from
Closed

Add Pattern Selection #333

wants to merge 3 commits into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Feb 1, 2023

This is an attempt to explicitly document how case/variant selection happens with messages that have a when Selector statement. The method presented here should match the current implementations (ICU4J, Intl.MessageFormat polyfill) in general shape, though some specifics such as error handling may be slightly different.

The overall intent is to minimally but sufficiently define selection, such that two implementations that use similar custom selector functions will make the same selection, when given the same input message and formatting context. In a number of places details are left for each implementation to fill in for themselves, as each may have a different internal representations of resolved and unresolved values and may perform value matching in different ways.

By necessity, the method definition needs to use more formal language than what we have so far in the spec. For that, I've borrowed some of the conventions of the TC39 spec and hope that it's sufficiently readable as is, without us needing separate definitions of e.g. what a "list" is.

An earlier draft of this PR was reviewed by @stasm.

Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>
@eemeli eemeli added the Agenda+ Requested for upcoming teleconference label Feb 1, 2023
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a problem with defining match/when selectors in the way described here. This description requires the first step (which you call "Setup", but which, upon reflection, should describe the processing of the match line in the pattern) to resolve a single value without reference to the available values in the Variant set. Some existing Selectors, notably the plural one, need access to the variant set in order to do that.

Perhaps it would be better to think of match as an operator and feeding it the when statements in order to return a single pattern string? We can still debate whether the match is an ordered greedy one (as you have here) or seeks the "best match". I think it would work better and it would allow selector functions to define "match" however they need to.

spec/formatting.md Outdated Show resolved Hide resolved
spec/formatting.md Outdated Show resolved Hide resolved
1. Let _res_ be a new empty list of resolved values that support selection.
2. For each Expression _exp_ of the message's Selector Expressions,
1. Let _rv_ be the resolved value of _exp_.
2. If selection is supported for _rv_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means? rv is just a value and all we're doing is resolving what the value is (we're not performing the selection yet). When is selection not supported for a value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's presume that we're formatting to a non-string target and a variable $img resolves to an HTML <img> element. We should be perfectly fine using $img in a placeholder in pattern, but what happens if the message has a when {$img}? This line is here to allow an implementation to say that in that case, we won't even try matching anything against the <img>.

1. Append _rv_ as the last element of the list _res_.
3. Else:
1. Emit a Selection Error.
2. Let _nomatch_ be a resolved value for which selection always fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be defined externally rather than inline?


These instructions could be simpler as:

  1. For each expression exp ... etc ...
    1. Let rv be the resolve value of exp or nomatch if expression cannot be resolved.
    2. Append rv as the last element of res
    3. If rv is nomatch emit error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it's clearer to provide the definition of nomatch inline, and to have only one thing happening on each line of the algorithm. Also, the last step of your proposed method requires for an equality comparison between potentially non-primitive values. We should avoid such if at all possible, even if it makes the method have a couple more steps or indentation levels.

2. Let _nomatch_ be a resolved value for which selection always fails.
3. Append _nomatch_ as the last element of the list _res_.

The shape of the resolved values must be determined by each implementation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normative must seems hard to enforce here. What is this text trying to ensure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to be explicit about the parts of the selection process that are left for the implementation, and makes it clear that the spec explicitly leaves out the shape of any resolved values, or how to work with them.

My personal preference would be for the spec to be built in terms of explicit, well-defined resolved values, but @mihnita in particular has strongly pushed back on this. Unfortunately, this means that it's tricky to talk about e.g. what the value of $num is here:

let $num = {(1) :number minimumFractionDigits=1}
match {$num}
when one {You have {$num} thing}
when * {You have {$num} things}

I would find it much easier to say that $num resolves to be an instance of a MessageNumber with a value 1 and an options bag { minimumFractionDigits: 1 }. When you use a MessageNumber as a selector, this is how the matching works, and when you use it in a placeholder, that is what you get when formatting to a string, and other is the behaviour when formatting to parts.

But taking considerations of realpolitik into account, we appear to need to define the spec without any such MessageNumber constructs, and hence end up with this circumlocution around resolved values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But taking considerations of realpolitik into account...

Hmm, we should continue to stay focused on our objective technical arguments, and of course doing our usual due diligence around them (multiple alternatives, pros & cons, eval criteria for a preference). I'm not sure that we're all so far apart in our thinking on this topic, so giving into non-technical concerns slows us down from achieving our best work.

I would like to get a clearer picture on the previous discussion from @mihnita. It could be here, but also could be in a meeting. Maybe after hearing more details, it would help us identify the precise technical sticking points? And hopefully my response to the other comment helps add some context to the topic by identifying how we might be talking past each other.

spec/formatting.md Outdated Show resolved Hide resolved
spec/formatting.md Outdated Show resolved Hide resolved
Comment on lines +50 to +52
Using _res_,
the Variants are iterated in source order and the following test is performed
to find one with all of its keys matching the Selector Expressions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be clearer. I think you're describing a greedy matcher in which the order of the variant matrix matters--it returns the first variant that matches all of the conditions. I would rephrase this (although note that I'm not in agreement with the matching described):

For each Variant, test if its VariantKeys match the values in res


Such a matcher requires the developer, translator, tooling, and runtime to keep the serialized order of the matrix intact end-to-end (including when various languages explode the matrix). I think this is an unnecessary burden that I would like to avoid.


As we've seen elsewhere, some values can match more than one value in the variant key set. For example, the value 1 in a plural matcher in the en locale can match both the value 1 and the keyword one. The value 1 is a better match, but not the only match. So:

let $foo = 1  // this will be for the plural
let $bar = (moo)
match {$foo $bar}
when one bar { no match because bar!=moo }
when one * { unfortunate match }
when 1 moo { we want this one }
... etc...

The description you have here would not work for plural selection because the value can either be 1 or keyword one but not both. In current implementation, the plural formatter has both the array of VariantKeys and the value to evaluate against them.

3. Append _nomatch_ as the last element of the list _res_.

The shape of the resolved values must be determined by each implementation,
along with the manner of determining their support for selection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is underspecified/unclear here and/or missing an important point. The text is saying that an implementation of MessageFormat 2 should determine whether a resolved value is valid for selection. But based on what @markusicu and others have been saying for a while, in fact, the responsibility for selecting goes together with formatting.

The example of plurals makes it obvious. A number can always match OTHER aka *, and they can match an exact match expression (ex: =7). They can also match one of the plural categories: ZERO, ONE, TWO, FEW, MANY. (Side note: that reiterates Addison's comment the care needed in matching.) The point I want to add and emphasize, on top of that, is that formatting affects the matching as in the plural case. The number 1200000 in French matches the OTHER plural category, but 1.2M matches the plural category MANY.

Similarly, how to do matching is a concern that belongs alongside the formatting implementation for this type of formatting / this value type. Whether the strings =7, =1200000, *, or MANY are matched by the formatted numbers that I will serialize here as 1.2M and 1200000 requires a whole set of rules.

Another example of how the manner of matching can be specific to the selector/formatter or value type is how semantic versions get matched via "greater than or equal to" logic.

So at the least, it would help to call out:

  1. The input and formatted values are needed for matching
  2. The MF2 implementation should be invoking a selection function, and maybe it exposes the match function predicate it uses internally
  3. Given that formatted values are needing for matching, we should say that formatter functions are a prerequisite for selector functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. The CLDR rules you link to depend on Plural Rule Operands, which at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period ..

Theoretically, it would be possible to determine these operand values without the intermediate formatting step, and in certain cases it might be possible to reuse the formatting if it happens to match the expected output for the current locale and options. But in practice the reasonable thing to do is to format & re-parse the input number within the plural selector, and separately format the input number for string output.

With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. ... [PluralRuleOperands] at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period .. ... With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

We might be partially talking past each other due to overloaded terminology, and there's still an unaddressed point.

Formatting usually means returning a string, but sometimes in ICU when formatting type X, an intermediate representation after applying some locale-specific processing is sometimes called FormattedX. This intermediate pre-processed state still occurs before the formatting symbols are applied when getting the final formatted string. (ex: FormattedNumber is intermediate, it still has a toString() method, but can also be an input to selection in PluralRules.select()). Instead of using a string adhering to a grammar like in JS, ICU4X uses a more structured type (FixedDecimal) for the intermediate value that can still be used for selection and formatting to string. That avoids the redundant need to parse strings to reconstruct that information if using the JS string approach.

This preprocessing step is still usually handled by a formatter, so we still have that as a dependency in this example. In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. The other higher level point that still seems unaccounted for is that how the selection occurs is non trivial (it's not string equality).

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be partially talking past each other due to overloaded terminology

Yeah, it's sounding like what I'm referring to as a "resolved value" could be represented in ICU4J by the FormattedX entities. At least in my headcanon, the whole process of what could be called "formatting" makes more sense to split into two: "resolution" and "formatting". In the first, you gather up all the information you need in order to do e.g. formatting or selection, and in the second you take that information and you emit a value in the final representation that you need.

[...] In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. [...]

Definitely agree, for plural selection.

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

While I agree that the behaviour of a plural selector (be it :number or :plural) must be well defined for use in an implementation, I do not think that definition belongs in this specification.

Do you think we ought to include something like what I mention in this comment in the MF2 spec, though? As I understand it, a FormattedNumber should work as an implementation of a "MessageNumber".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my headcanon ... In ["resolution"], you gather up all the information you need in order to do e.g. formatting or selection...

As far as confusing terminology goes, "resolution" still sounds a little too vague. This initial preprocessing step during the overall work of "formatting" (input value -> string) also depends on locale information too. Ex: for compact notation numbers, the exponent that you use is informed by the grouping strategy for numbers in the locale. (@eggrobin, did I get that example right?)

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

While I agree that the behaviour of a plural selector (be it :number or :plural) must be well defined for use in an implementation, I do not think that definition belongs in this specification.

To clarify, I'm not saying that we should define the behavior of plural selector in the spec. What I'm saying is that this PR codifies an algorithm for a first match strategy, but the notion of "match" between a selector value and variant key is underspecified. We are in agreement that the notion of match has to be implemented separately for each type of selector, and those implementation details are not a concern for the spec text. But what I am saying is that we also clearly can't leave the story at that, since the proposed algorithm in spec text is built on top of an assumption of a specific notion of match (equality, ex: string equality), and we know from the plural selector example that that specific notion is insufficient to cover all cases, so it needs to be generalized.

And so, I think we already agree that there is a clear connection there between the proposed high-level algorithm for variant key selection in MF2 and the selector-specific notion of match. In order to resolve the problem that I described above, just as we have done before when designing for things like formatting functions, the proper way to achieve both:

  1. generalizing the notion of matching from simple string equality comparison predicate function to an impl-specific predicate function
  2. decoupling those impl-specific notions of matching from the high-level algorithm

...is to have an interface representing the selection predicate function.

That is the specificity that the current proposed algorithm needs. Without doing so, the algorithm text in the PR here will ignore a design problem that we know that we have. Having an interface to represent a specific selector's impl-specific behavior achieves proper simplicity through decoupling that makes for a good design around this problem.

Do you think we ought to include something like what I mention in this comment in the MF2 spec, though? As I understand it, a FormattedNumber should work as an implementation of a "MessageNumber".

I think what you were mentioning in that comment is the same as what I have been saying so far with "selection depends on formatting". But maybe that wasn't clear because we used different words to describe the same structured pre-processed form ("formatting" as in .formatToParts() in EMCA-402 vs. "resolved value").

The way you describes it makes me unsure yet if we agree on how best to design for this. Rather than trying to specify the type of the value in the pre-processed structured form (can we call this form the "Preformatting Structured Parts" maybe?), I think we could once again define an interface for the function that returns the value. That would allows us to decouple separate concerns cleanly and simply. But I think we agree in principle, if I understand correctly.

And it would help us all to eventually find some consistent precise naming, too.

this selection will always succeed.
Variants after one with all catch-all keys will never be selected.

### Examples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have an example with a plural/number formatting+selection to address the comments regarding implicit assumptions about matching.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, the intention here is that when we include an example of a plurals selection message, we have a real use case where matching is not as simple as just string comparison. We should be able run it through the above described algorithm verbiage and verify that everything we need is accounted for in the text.

@eemeli
Copy link
Collaborator Author

eemeli commented Feb 2, 2023

@aphillips:
Perhaps it would be better to think of match as an operator and feeding it the when statements in order to return a single pattern string? We can still debate whether the match is an ordered greedy one (as you have here) or seeks the "best match". I think it would work better and it would allow selector functions to define "match" however they need to.

The outcome of the resolution meetings with the CLDR-TC a year ago included this on selection:

  1. Selecting variant messages based on selectors
    a. Use a first-match approach.
    b. Any specially ordering needs to be done on the tooling side.

Effectively, this means that unlike in MF1, in MF2 the order of the variants defines their precedence. So an MF1 message

{num, plural, one{ONE} =1{1!} other{OTHER}}

would need to be represented like this in MF2

match {$num :plural}
when 1 {1!}
when one {ONE}
when * {OTHER}

in order to keep the same precedence of the =1 exact match over the one category match.

I would very strongly prefer not reopening this particular decision.

spec/formatting.md Outdated Show resolved Hide resolved
spec/formatting.md Outdated Show resolved Hide resolved
3. Append _nomatch_ as the last element of the list _res_.

The shape of the resolved values must be determined by each implementation,
along with the manner of determining their support for selection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to push back a little on formatting being an explicit requirement for selection. In many cases I agree that formatting is an implicit practical requirement, but that's an implementation detail. ... [PluralRuleOperands] at least in JS are calculated from a locale-independent formatting of the input value that's relatively easy to parse. For example, digit grouping is not done and the decimal digit is always a period .. ... With the French millions case, for instance, the plural selection ends up depending on the formatting to 1.2M, while the formatting of a matching placeholder would end up as 1,2 M.

We might be partially talking past each other due to overloaded terminology, and there's still an unaddressed point.

Formatting usually means returning a string, but sometimes in ICU when formatting type X, an intermediate representation after applying some locale-specific processing is sometimes called FormattedX. This intermediate pre-processed state still occurs before the formatting symbols are applied when getting the final formatted string. (ex: FormattedNumber is intermediate, it still has a toString() method, but can also be an input to selection in PluralRules.select()). Instead of using a string adhering to a grammar like in JS, ICU4X uses a more structured type (FixedDecimal) for the intermediate value that can still be used for selection and formatting to string. That avoids the redundant need to parse strings to reconstruct that information if using the JS string approach.

This preprocessing step is still usually handled by a formatter, so we still have that as a dependency in this example. In practice, the logic for selection is going to be closely related to the logic for formatting. These things are often intertwined. The other higher level point that still seems unaccounted for is that how the selection occurs is non trivial (it's not string equality).

The spec still should be specific on who or how that selection should be done. As it stands currently, it's underspecified. And the text instead asserts to the effect that "the implementation will determine ...", which implies the MF2 implementation. But that's not the appropriate level of responsibility for matching / selection, which should be closer to the formatting.

this selection will always succeed.
Variants after one with all catch-all keys will never be selected.

### Examples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, the intention here is that when we include an example of a plurals selection message, we have a real use case where matching is not as simple as just string comparison. We should be able run it through the above described algorithm verbiage and verify that everything we need is accounted for in the text.

1. Return False to indicate failure.
2. Return True to indicate success.

The manner of testing _key_ against _sel_ must be defined by each implementation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place in the text that I was referring to in the earlier comment. It seems to suggest that the matching / testing of the runtime value element _sel_ to the element _key_ of the VariantKey, should be defined by the implementation, which sounds like implementation of MF2. If not, then the wording is confusing.

And regardless, we still want to be more specific about what part of MF2 is responsible for the matching logic of _sel_ to _key_ since we know from plurals that we can't assume the trivial case (value equality comparison) to be sufficient.

@eemeli eemeli marked this pull request as draft February 14, 2023 11:12
@eemeli eemeli removed the Agenda+ Requested for upcoming teleconference label Feb 14, 2023
@eemeli
Copy link
Collaborator Author

eemeli commented Feb 14, 2023

Marked this as a draft to indicate that yesterday's MFWG call identified at least the following dependencies for this PR, which will need to be resolved in separate issues/PRs:

  • A potential reconsideration of a "best match" approach, as opposed to our current "first match" selection
  • Including an explicit definition of resolved/preprocessed/intermediate values as a spec-internal utility interface

Once those have been resolved, this PR may need to be correspondingly updated to match.

@eemeli
Copy link
Collaborator Author

eemeli commented Mar 29, 2023

Closing; will iterate on this and open a new PR with a column-first selection method.

@eemeli eemeli closed this Mar 29, 2023
@eemeli eemeli deleted the selection branch March 29, 2023 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants