Defining the canonical syntax #178

echeran · 2021-06-21T02:18:01Z

In the previous extended meeting on 2021-05-31 (meeting notes), we decided to open a parallel discussion on the topic of the syntax. This issue represents the continuation of that discussion, starting with a summary of the ideas from the meeting.

Background

As outlined in the Goals and Non-Goals document, one of the deliverables is "A formal definition of the canonical syntax for representing the data model,..."

Multiple topics were discussed, but the discussion began and ended on the topic of whether there be a human-friendly syntax in addition to (or instead of) a syntax that directly represents the data model.

Topics discussed

How do we interpret "the canonical syntax" wording?

The Goals and Non-Gals document has the wording "the canonical syntax". This implies that there should be one syntax that we define and maintain.
This distinction to define and maintain only one syntax is okay, and it is a less important than defining only one single data model, which is the more critical aspect of design.

What considerations are important for defining the syntax?

Selection messages with multiple selector args

There are some functionality features that are not supported by current MessageFormat or Fluent syntax, ex: selection messages that do selection using multiple selector args. Therefore, we need a new format.
Supporting a collection of messages doesn't seem to be something that needs the invention of a new type of syntax or file format.

Concept of a file

Should selection messages have the syntax of selection self-contained, or should they prompt the need for a larger message approaching a file format? Current MessageFormat can represent a simple message or a selection message, but Fluent designed its own format for collections of messages.
The syntax we define doesn't need to be tied to the concept of a file. In fact, it should not, because that would be limiting how and where the syntax could be used. There are types of data stores that are neither bounded nor have an inherent linearly that files have, ex: databases, OS message registries, RPCs of a similar over the network, etc.

Targeting the web / HTML

Although we should be able to express our data in a format not tied to a file format, so as not to narrow ourselves, we also need to define a file format to target the web. Whatever we design will be a good candidate for a localization system for HTML, and that will require a file format.
So long as we choose a syntax that can serialized to a stream of bytes, then where those serialized bytes are persisted and how they transported are concerns that can be decided flexibly by the implementations, ex: store messages in files, in DBs, etc. The syntax of the messages is orthogonal to where and how the serialized message bytes are used, regardless of we whether we want to call the syntax a file format or something else.

Representation of new features

Once we move beyond pure JavaScript, we need to think about how what we're doing will be used by further projects without causing scope creep on our current project.
The choice is syntax is not terribly important because all syntaxes can represent the basic constructs of associative data (maps/objects) and sequential data (lists/arrays), and with those you can represent just about anything.
One factor that is important is the likelihood for adoption, and the prevalence and ecosystem support of a syntax seem to be important for that. So if all syntaxes are equal, then for this reason JSON seems like a natural choice, and we could define a companion schema for it.

Representing collections of messages

Should we define a collection of messages?
Is that also a suggestion for having a hierarchy of messages?
We already agreed on having a higher-level hierarchical grouping of messages early on in discussions of the remaining 2 data model proposals.

Do we want to create a special human-readable syntax?

Distinguishing a human readable syntax from computer friendly representation

There is a difference between how we represent messages in memory / in computers, and how we represent them in a serialized format. The serialized format may need to be optimized to be humane and easy to write, while this serialized format can be parsed into a computer-friendly interoperable representation complete with all the metadata, ex: using YAML or JSON, etc.
The full data of the message is all data. Whether you call parts of it metadata or not, metadata is data. If the data isn't important, don't include it in the message.

An analogy to CSS and how applicable that is

Perhaps a better way to frame the above discussion is through this question: why is CSS not expressed in JSON? There were debates in the early data of the web on how to encode CSS, and they chose something other than JSON.
This CSS analogy doesn't apply because of an assumption that is implicit. CSS is a language that is mostly written by programmers and you're editing the text.
Is CSS really written by programmers these days?
Does optimization for presentation really matter? What is special about MessageFormat that requires a specialized syntax?
Would you still create a separate syntax for CSS if you were designing it today.
Yes, you're editing it directly, usually. And more importantly, CSS has multiple complex concepts present: you have XPath style selectors for hierarchical data, there is a "cascading" nature of application of the rules on hierarchical data, and regular expression-style syntax, too. When reasoning about CSS, there is also the context that the CSS rules are executed by an engine in a certain way, and that contributes to the complexity. That level of complexity is not in MessageFormat. The complexity of CSS involves concepts that programmers can deal with, but much of the translation industry work is done by non-programmers.
The CSS analogy is a good analogy, but what is implicit is that we consider CSS a well-designed language. Is it? It represents complex ideas, but hopefully the result of our work in MessageFormat is simpler than CSS.

Relative importance of human-friendly representation to a computer-friendly one

When we talk about syntax for humans, it needs to be easy to read, but when serialized for computers, it needs to be a structure that is easy to process. We can imagine something that the data model can support that we would not necessarily want to be directly in the syntax, but that results from parsing the syntax.
The more important use case is the one in which computers talk. And that goes with the idea that our syntax represents the functionality that we support. This does not preclude the idea that there should be a compact, concise representation for humans, but choosing how to make it concise involves tradeoffs that depend on the implementers and users, and different uses cases may result in different tradeoffs and different compact representations. I don't think that we, as a group, should maintain a syntax which limits what you can do with the data model. I prioritize computers over humans when push comes to shove.
We can let other people design the human representation, but it is a fallacy to say that it is not necessary.

Past experience insights + potential future user experience

Computer exchange of the data model is relatively easy and non-controversial if we ensure that the data model is representable in JSON. If supported by JSON, it is easy to guarantee that machine exchange will work. But JSON is not the best representation for humans because it is verbose and hard to write by hand.
We can leave it up to others to decide how they want to create their compact representation. For simple messages, how different will a JSON representation be from a custom compact representation? Maybe, if the messages are complicated, and if it affects adoption, then maybe a human friendly represent is important. How often are people editing things by hand that are complicated?
With current MessageFormat, the only way to edit messages is to write them by hand, including complicated messages.
For Fluent, we designed the syntax such that it could be edited by hand by pretty much anyone. Once we started using it, we realized that the only people interested in editing Fluent were mostly programmers, and a few translators with programming experience. There was disappointment that we couldn't convince "regular" translators to use Fluent syntax, and we instead jumped through hoops to hide syntax and design rich UIs to help hide syntax. Translators will favor graphical UIs over syntax, but programmers will favor a text syntax as they write code.
With this experience, I see that editing by hand is a fallback. It is a fallacy to say that because UIs are the primary vehicle for message authoring, that we can discount the fallback to text format authoring, even if it is the minority of use cases.
We can build tooling to handle authoring complicated messages, and for programmers who want to use text editors, there is already tooling to support commonly-used syntaxes like JSON, YAML, etc., if we choose to use them, unlike a custom syntax like current MessageFormat. If programmers have corner case needs that tooling can't fully support, they can author a message and convert to the text representation using tooling, and then edit and copy-paste the text accordingly. That is something we can trust programmers to handle.

aphillips · 2023-06-17T14:55:10Z

The above is an excellent summary with interesting information in it. We have since decided to adopt a message-level syntax and to exclude resource formats from this specific group's work. I'm therefore closing this tracking issue.

romulocintra added design Design principles, decisions requirements Issues related with MF requirements list labels Sep 30, 2021

eemeli linked a pull request Apr 28, 2022 that will close this issue

Add syntax proposal with EBNF #230

Merged

mihnita added the blocker-candidate The submitter thinks this might be a block for the Technology Preview label Nov 3, 2022

echeran mentioned this issue Feb 17, 2023

Add explicit whitespace definitions #344

Merged

aphillips closed this as completed Jun 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining the canonical syntax #178

Defining the canonical syntax #178

echeran commented Jun 21, 2021

aphillips commented Jun 17, 2023

Defining the canonical syntax #178

Defining the canonical syntax #178

Comments

echeran commented Jun 21, 2021

Background

Topics discussed

How do we interpret "the canonical syntax" wording?

What considerations are important for defining the syntax?

Selection messages with multiple selector args

Concept of a file

Targeting the web / HTML

Representation of new features

Representing collections of messages

Do we want to create a special human-readable syntax?

Distinguishing a human readable syntax from computer friendly representation

An analogy to CSS and how applicable that is

Relative importance of human-friendly representation to a computer-friendly one

Past experience insights + potential future user experience

aphillips commented Jun 17, 2023