Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial AST dump #2

Open
zbraniecki opened this issue Jun 17, 2020 · 15 comments
Open

Initial AST dump #2

zbraniecki opened this issue Jun 17, 2020 · 15 comments

Comments

@zbraniecki
Copy link
Owner

zbraniecki commented Jun 17, 2020

For the initial work, I suggest we take the fluent-rs AST: https://github.com/projectfluent/fluent-rs/blob/master/fluent-syntax/src/ast.rs

and design a vastly simplified subset of it that captures a single Message.

Something along the lines of:

pub struct Message {
    pub value: Pattern,
    pub comment: Option<String>,
}

pub struct Pattern {
    pub elements: Vec<PatternElement>,
}

pub enum PatternElement {
    TextElement(String),
    Placeable(Expression),
}

pub struct Variant {
    pub key: VariantKey,
    pub value: Pattern,
    pub default: bool,
}

pub enum VariantKey {
    Identifier(Identifier),
    NumberLiteral(String),
}

pub enum InlineExpression {
    StringLiteral {
        value: String,
    },
    NumberLiteral {
        value: String,
    },
    FunctionReference {
        id: String,
        argument: Option<Identifier>,
    },
    VariableReference {
        id: Identifier,
    },
}

pub struct Identifier {
    pub name: String,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
        selector: InlineExpression,
        variants: Vec<Variant>,
    },
}
@zbraniecki
Copy link
Owner Author

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

@zbraniecki
Copy link
Owner Author

I chose this subset because I think it captures the essence of multiple valuable traits of Fluent that I would like to offer for consideration for MF 2.0:

  • It encodes a single Pattern as a list of textual parts and placeables (see Placeable vs Placeholder vs ? #4)
  • It allows placeables to be selector expressions or inline expressions
  • Developer/Env provided functions can be used in the selector or in the inline

This allows per-environment to do:

// Function as a formatter
Today is { DATETIME($now) }.

and

// Function as a selector
You have { PLURAL($emailCount) ->
    [one] one email
  *[other] { $emailCount } emails
}

which addresses the part of the MF2.0 purpose of "being more flexible" - unicode-org/message-format-wg#84

In particular, it makes PLURAL just one of many possible formatters/selectors ensuring that any system that will support PLURAL, will support all of functions.
I'm not strongly opinionated whether functions as formatters/selectors should be the same thing, but haven't find a reason not to be, so initially offering them as the same AST node.

@echeran
Copy link
Collaborator

echeran commented Jun 18, 2020

I have some comments:

  • I have been thinking (Parser #3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.
  • Placeable -> Placeholder :-P
  • Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)
  • I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior. If we have 2 plurals, or a plural and a gender, etc in one message, then our "cases" correspond to the Cartesian product of the possible values that the placeholders can taken on (ex: #{ [ONE, female], [ONE, male], [ONE, other], [OTHER, female], [OTHER, male], [OTHER, other] } ). So instead of Variant.key: VariantKey, maybe Variant.case_vals: HashMap<Identifier, String>? This assumes that we ensure that there is a concept of Placeholder that has a field of type Identifier. And if that makes sense so far, in this scenario I'm describing, the "switch" (select) part of the "switch/case" scenario to which Variant belongs is implicitly defined by the use of placeholders whose types take on a finite enumerated set of values. Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?
  • There are other aspects of placeholders that we should consider including as properties of the placeholder, which is how Okapi handles it:
    • what relative position type the placeholder has -- standalone or the open or close of a pair
    • what function type of placeholder - this could be a different way to encode the selector (plural, gender, free-form, etc.) as "PlaceholderType" in a way that's directly attached to the placeholder
      • I'm not sure if or how often Fluent's selector functions operate > 1 placeholder. The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a Placeholder.
    • my descriptions of Okapi's placeholder's fields stink, but the word "type" is over-overloaded in the source code

I'll stop there, and hopefully some of that makes sense. I may have misunderstood things about Fluent, so please correct (and @mihnita, chime in on corrections).

@filmil
Copy link

filmil commented Jun 18, 2020

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

How about if we started by example, in terms of the use cases we'd like to handle? I personally find it hard to figure out whether an AST or EBNF actually supports what I'd like to do by staring at a wall of text. :)

@filmil
Copy link

filmil commented Jun 18, 2020

  • I have been thinking (Parser #3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.

IMHO, examples of what we want structured comes even before that.

  • I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior. [...] Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?

There is a practical point to keeping the "computational" part of a format string separate from human-readable (human-translatable) string as well.

At some point (looking back to the ICU conference last October), it seemed to make sense to separate out parameter binding, values based on those parameters and pattern matching. Especially because I'd like to expand the set of possible transformations beyond plural and gender into inflections and then things get increasingly more interesting.

* I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.  The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a `Placeholder`.

@zbraniecki
Copy link
Owner Author

zbraniecki commented Jun 18, 2020

  • Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

  • I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior.

Good point. We can achieve it by doing:

#[derive(Debug, PartialEq)]
pub struct Variant {
-    pub key: VariantKey,
+    pub key: Vec<VariantKey>,
    pub value: Pattern,
    pub default: bool,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
-        selector: InlineExpression,
+        selector: Vec<InlineExpression>,
        variants: Vec<Variant>,
    },
}

Does it sound good?

  • I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.

They don't yet in Fluent :( We so far only got to do it via nested selectors:

key = { PLURAL($num) ->
    [one] { GENDER($user) ->
        [masculine] Foo
       *[other] Bar
    }
   *[other] Baz

and plan to get back to flatten selectors here: projectfluent/fluent#4 to get

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, other] Bar
   *[other] Baz
}

or

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, *other] Bar
    [*other] Baz
}

I believe we should support the flatten approach in MF 2.0.

@zbraniecki
Copy link
Owner Author

@stasm

@stasm
Copy link

stasm commented Jun 22, 2020

Some high-level thoughts about the things mentioned in this thread so far:

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

It allows placeables to be selector expressions or inline expressions

It would be interesting to experiment with a different approach than the one we know from MessageFormat and Fluent where select expressions go into placeables. I mean the approach where the branching logic happens first, before patterns are defined. I call this the exploded message approach; I'm sure there are better names ;)

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

@filmil
Copy link

filmil commented Jun 22, 2020

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

@zbraniecki
Copy link
Owner Author

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

I agree that it would be interesting to try that. But we need to answer the question about nested selections in such a case.

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

This may be relatively easy to represent in the datamodel, but may be very very hard to represent in textual form. Maybe it's ok to have a more open datamodel, and let the textual representation be capable of expressing just some of the metadata.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

We're not certain yet. For now we focus on non-textual representation, but I expect that for the Web usage we'll want a resource format, similarly to how we don't encode CSS in JSON/YAML, but rather have its own dedicated textual format.
There are many reasons for which YAML/JSON is not really the best target for l10n resource format, and I think we'll want to have l10n-tailored one later on, maybe even multiple, but the one that will get standardized for the Web is likely to be the dominant in the forseeable future.

Bottom line is - I think for now we should focus on AST and data model, but the way we imagine what we want to express should take into account that one day we'll want to express it in a human-readable/writable format.

@zbraniecki
Copy link
Owner Author

I opened #6 to discuss AST of selectors vs placeholders.

@stasm
Copy link

stasm commented Jun 22, 2020

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

That's a great question, and I think it's something we can answer with a prototype :) Thanks for filing #6, I'll continue there.

@mihnita
Copy link

mihnita commented Jul 10, 2020

each variant could have its own comments. I don't know if there's a value to that?

I think there is value.

@mihnita
Copy link

mihnita commented Jul 10, 2020

Should we overall start with AST or EBNF? Fluent's EBNF is here:

TLDR: I am with stasm@ on this one

"I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax."

So just data model + examples to show that it works.


I think that EBNF focuses too much on the syntax part.

It says stuff like:

  foo := '[' listItems ']';
  listItems := item [',' listItems;

when what we want is really: foo is an array of item(s)

If we look at the EBNF doc used by
https://www.ics.uci.edu/~pattis/ICS-33/lectures/ebnf.pdf
they have a section named "1.6 Syntax versus Semantics" that starts with
"EBNF descriptions specify only syntax: the form in which something is written.
They do not specify semantics: the meaning of what is written"

So in this respect the rust code is more readable:

 pub elements: Vec<PatternElement>,

(or the same thing in proto syntax, repeated PatternElement elements)

@mihnita
Copy link

mihnita commented Jul 10, 2020

is there a requirement that MessageFormat 2.0 be encodable as a string

I think it is. But likely not at this stage.
My hope is that we can come up with a data model, and then define one / several string representations.

That would have several benefits:

  • Would allow Fluent / Message / FBT to update the current syntax (if the data model is close enough :-), instead of throwing it all away. Nice for migration
  • Would allow us to define syntaxes that are friendlier to the framework used (for example JS can have something json like)
  • The unique data model would allow converting between formats (Fluent <=> FBT <=> MessageFormat) and map between any format and the LIOM / XLIFF. So Mozilla can write a Fluent <=> data model filter, ICU a MS <=> data model filter, and we have a common XLIFF <=> data model filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants