Initial AST dump #2

zbraniecki · 2020-06-17T22:12:57Z

For the initial work, I suggest we take the fluent-rs AST: https://github.com/projectfluent/fluent-rs/blob/master/fluent-syntax/src/ast.rs

and design a vastly simplified subset of it that captures a single Message.

Something along the lines of:

pub struct Message {
    pub value: Pattern,
    pub comment: Option<String>,
}

pub struct Pattern {
    pub elements: Vec<PatternElement>,
}

pub enum PatternElement {
    TextElement(String),
    Placeable(Expression),
}

pub struct Variant {
    pub key: VariantKey,
    pub value: Pattern,
    pub default: bool,
}

pub enum VariantKey {
    Identifier(Identifier),
    NumberLiteral(String),
}

pub enum InlineExpression {
    StringLiteral {
        value: String,
    },
    NumberLiteral {
        value: String,
    },
    FunctionReference {
        id: String,
        argument: Option<Identifier>,
    },
    VariableReference {
        id: Identifier,
    },
}

pub struct Identifier {
    pub name: String,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
        selector: InlineExpression,
        variants: Vec<Variant>,
    },
}

zbraniecki · 2020-06-17T22:23:18Z

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

zbraniecki · 2020-06-17T23:48:43Z

I chose this subset because I think it captures the essence of multiple valuable traits of Fluent that I would like to offer for consideration for MF 2.0:

It encodes a single Pattern as a list of textual parts and placeables (see Placeable vs Placeholder vs ? #4)
It allows placeables to be selector expressions or inline expressions
Developer/Env provided functions can be used in the selector or in the inline

This allows per-environment to do:

// Function as a formatter
Today is { DATETIME($now) }.

and

// Function as a selector
You have { PLURAL($emailCount) ->
    [one] one email
  *[other] { $emailCount } emails
}

which addresses the part of the MF2.0 purpose of "being more flexible" - unicode-org/message-format-wg#84

In particular, it makes PLURAL just one of many possible formatters/selectors ensuring that any system that will support PLURAL, will support all of functions.
I'm not strongly opinionated whether functions as formatters/selectors should be the same thing, but haven't find a reason not to be, so initially offering them as the same AST node.

echeran · 2020-06-18T00:50:09Z

I have some comments:

I have been thinking (Parser #3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.
Placeable -> Placeholder :-P
Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)
I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior. If we have 2 plurals, or a plural and a gender, etc in one message, then our "cases" correspond to the Cartesian product of the possible values that the placeholders can taken on (ex: #{ [ONE, female], [ONE, male], [ONE, other], [OTHER, female], [OTHER, male], [OTHER, other] } ). So instead of Variant.key: VariantKey, maybe Variant.case_vals: HashMap<Identifier, String>? This assumes that we ensure that there is a concept of Placeholder that has a field of type Identifier. And if that makes sense so far, in this scenario I'm describing, the "switch" (select) part of the "switch/case" scenario to which Variant belongs is implicitly defined by the use of placeholders whose types take on a finite enumerated set of values. Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?
There are other aspects of placeholders that we should consider including as properties of the placeholder, which is how Okapi handles it:
- what relative position type the placeholder has -- standalone or the open or close of a pair
- what function type of placeholder - this could be a different way to encode the selector (plural, gender, free-form, etc.) as "PlaceholderType" in a way that's directly attached to the placeholder
  - I'm not sure if or how often Fluent's selector functions operate > 1 placeholder. The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a Placeholder.
- my descriptions of Okapi's placeholder's fields stink, but the word "type" is over-overloaded in the source code

I'll stop there, and hopefully some of that makes sense. I may have misunderstood things about Fluent, so please correct (and @mihnita, chime in on corrections).

filmil · 2020-06-18T00:59:08Z

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

How about if we started by example, in terms of the use cases we'd like to handle? I personally find it hard to figure out whether an AST or EBNF actually supports what I'd like to do by staring at a wall of text. :)

filmil · 2020-06-18T01:15:58Z

I have been thinking (Parser #3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.

IMHO, examples of what we want structured comes even before that.

I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior. [...] Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?

There is a practical point to keeping the "computational" part of a format string separate from human-readable (human-translatable) string as well.

At some point (looking back to the ICU conference last October), it seemed to make sense to separate out parameter binding, values based on those parameters and pattern matching. Especially because I'd like to expand the set of possible transformations beyond plural and gender into inflections and then things get increasingly more interesting.

* I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.  The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a `Placeholder`.

zbraniecki · 2020-06-18T08:21:58Z

Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior.

Good point. We can achieve it by doing:

#[derive(Debug, PartialEq)]
pub struct Variant {
-    pub key: VariantKey,
+    pub key: Vec<VariantKey>,
    pub value: Pattern,
    pub default: bool,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
-        selector: InlineExpression,
+        selector: Vec<InlineExpression>,
        variants: Vec<Variant>,
    },
}

Does it sound good?

I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.

They don't yet in Fluent :( We so far only got to do it via nested selectors:

key = { PLURAL($num) ->
    [one] { GENDER($user) ->
        [masculine] Foo
       *[other] Bar
    }
   *[other] Baz

and plan to get back to flatten selectors here: projectfluent/fluent#4 to get

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, other] Bar
   *[other] Baz
}

or

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, *other] Bar
    [*other] Baz
}

I believe we should support the flatten approach in MF 2.0.

zbraniecki · 2020-06-18T08:23:49Z

@stasm

stasm · 2020-06-22T14:00:04Z

Some high-level thoughts about the things mentioned in this thread so far:

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

It allows placeables to be selector expressions or inline expressions

It would be interesting to experiment with a different approach than the one we know from MessageFormat and Fluent where select expressions go into placeables. I mean the approach where the branching logic happens first, before patterns are defined. I call this the exploded message approach; I'm sure there are better names ;)

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

filmil · 2020-06-22T17:19:35Z

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

zbraniecki · 2020-06-22T18:23:38Z

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

I agree that it would be interesting to try that. But we need to answer the question about nested selections in such a case.

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

This may be relatively easy to represent in the datamodel, but may be very very hard to represent in textual form. Maybe it's ok to have a more open datamodel, and let the textual representation be capable of expressing just some of the metadata.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

We're not certain yet. For now we focus on non-textual representation, but I expect that for the Web usage we'll want a resource format, similarly to how we don't encode CSS in JSON/YAML, but rather have its own dedicated textual format.
There are many reasons for which YAML/JSON is not really the best target for l10n resource format, and I think we'll want to have l10n-tailored one later on, maybe even multiple, but the one that will get standardized for the Web is likely to be the dominant in the forseeable future.

Bottom line is - I think for now we should focus on AST and data model, but the way we imagine what we want to express should take into account that one day we'll want to express it in a human-readable/writable format.

zbraniecki · 2020-06-22T20:51:18Z

I opened #6 to discuss AST of selectors vs placeholders.

stasm · 2020-06-22T20:57:23Z

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

That's a great question, and I think it's something we can answer with a prototype :) Thanks for filing #6, I'll continue there.

mihnita · 2020-07-10T01:19:53Z

each variant could have its own comments. I don't know if there's a value to that?

I think there is value.

mihnita · 2020-07-10T01:23:13Z

Should we overall start with AST or EBNF? Fluent's EBNF is here:

TLDR: I am with stasm@ on this one

"I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax."

So just data model + examples to show that it works.

I think that EBNF focuses too much on the syntax part.

It says stuff like:

  foo := '[' listItems ']';
  listItems := item [',' listItems;

when what we want is really: foo is an array of item(s)

If we look at the EBNF doc used by
https://www.ics.uci.edu/~pattis/ICS-33/lectures/ebnf.pdf
they have a section named "1.6 Syntax versus Semantics" that starts with
"EBNF descriptions specify only syntax: the form in which something is written.
They do not specify semantics: the meaning of what is written"

So in this respect the rust code is more readable:

 pub elements: Vec<PatternElement>,

(or the same thing in proto syntax, repeated PatternElement elements)

mihnita · 2020-07-10T01:32:03Z

is there a requirement that MessageFormat 2.0 be encodable as a string

I think it is. But likely not at this stage.
My hope is that we can come up with a data model, and then define one / several string representations.

That would have several benefits:

Would allow Fluent / Message / FBT to update the current syntax (if the data model is close enough :-), instead of throwing it all away. Nice for migration
Would allow us to define syntaxes that are friendlier to the framework used (for example JS can have something json like)
The unique data model would allow converting between formats (Fluent <=> FBT <=> MessageFormat) and map between any format and the LIOM / XLIFF. So Mozilla can write a Fluent <=> data model filter, ICU a MS <=> data model filter, and we have a common XLIFF <=> data model filter.

zbraniecki mentioned this issue Jun 18, 2020

Encode examples #5

Open

zbraniecki mentioned this issue Jun 22, 2020

Selector vs Placeholder #6

Open

zbraniecki mentioned this issue Jul 9, 2020

Add AST and en/pl examples #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial AST dump #2

Initial AST dump #2

zbraniecki commented Jun 17, 2020 •

edited

Loading

zbraniecki commented Jun 17, 2020

zbraniecki commented Jun 17, 2020

echeran commented Jun 18, 2020

filmil commented Jun 18, 2020 •

edited

Loading

filmil commented Jun 18, 2020

zbraniecki commented Jun 18, 2020 •

edited

Loading

zbraniecki commented Jun 18, 2020

stasm commented Jun 22, 2020

filmil commented Jun 22, 2020

zbraniecki commented Jun 22, 2020

zbraniecki commented Jun 22, 2020

stasm commented Jun 22, 2020

mihnita commented Jul 10, 2020

mihnita commented Jul 10, 2020

mihnita commented Jul 10, 2020

Initial AST dump #2

Initial AST dump #2

Comments

zbraniecki commented Jun 17, 2020 • edited Loading

zbraniecki commented Jun 17, 2020

zbraniecki commented Jun 17, 2020

echeran commented Jun 18, 2020

filmil commented Jun 18, 2020 • edited Loading

filmil commented Jun 18, 2020

zbraniecki commented Jun 18, 2020 • edited Loading

zbraniecki commented Jun 18, 2020

stasm commented Jun 22, 2020

filmil commented Jun 22, 2020

zbraniecki commented Jun 22, 2020

zbraniecki commented Jun 22, 2020

stasm commented Jun 22, 2020

mihnita commented Jul 10, 2020

mihnita commented Jul 10, 2020

mihnita commented Jul 10, 2020

zbraniecki commented Jun 17, 2020 •

edited

Loading

filmil commented Jun 18, 2020 •

edited

Loading

zbraniecki commented Jun 18, 2020 •

edited

Loading