Define the behavior of backslash #22

stasm · 2017-01-27T15:24:13Z

In #12 (comment) I said we'd need to define the exact behavior of the backslash character \ for the purposes of escaping. This includes defining:

the list of known escape sequences (\ ( a space), \t, \n, \*, \[, \{, \u, \\, others?),
how the Unicode escapes work: is \u20 valid and the same as \u0020?
the behavior of unknown sequences, like \a (does the backslash take the following character out of the syntax parsing?),
the behavior for edge-cases, like:
```
  foo\bar = Foobar
```
Is that a syntax error? If not, what is the name of the identifier?
```
  foo = Foo\
  bar = Bar
```
Is that an escaped new-line?

The text was updated successfully, but these errors were encountered:

Fix projectfluent#12, projectfluent#17, projectfluent#18. With this change, the entire body of a message needs to indented. This makes error recovery very easy: finding the next message definition is as simple as finding the next identifier with no indentation. It also opens up a number of opportunities: we can remove the `|` syntax for multiline blocks of text and allow line breaks inside of placeables safely. The PR also allows the value to be defined on a new line, making the following examples equivalent: lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. I hope this will help when attributes are present: lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. .attr = Attribute Lastly, quoted patterns are only available inside of placeables and cannot be used directly as values. The exact semantics of \ escapes will be defined in projectfluent#22.

Fix projectfluent#12, projectfluent#17, projectfluent#18. With this change, the entire body of a message must be indented. This makes error recovery very easy: finding the next message definition is as simple as finding the next identifier with no indentation. It also opens up a number of opportunities: we can remove the `|` syntax for multiline blocks of text and allow line breaks inside of placeables safely. The change also allows the value to be defined on a new line, making the following examples equivalent: lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. Lastly, quoted patterns are only available inside of placeables, cannot contain aother placeables and cannot be used directly as values. The exact semantics of \ escapes will be defined in projectfluent#22.

Fix #12, #17, #18. With this change, the entire body of a message must be indented. This makes error recovery very easy: finding the next message definition is as simple as finding the next identifier with no indentation. It also opens up a number of opportunities: we can remove the `|` syntax for multiline blocks of text and allow line breaks inside of placeables safely. The change also allows the value to be defined on a new line, making the following examples equivalent: lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pellentesque congue metus, non mattis sem faucibus sit amet. Lastly, quoted patterns are only available inside of placeables, cannot contain aother placeables and cannot be used directly as values. The exact semantics of \ escapes will be defined in #22.

stasm · 2017-03-02T00:46:35Z

A draft proposal:

Escape sequences are only allowed in the text and quoted-text productions.
Known escapes are: \\, \*, \[, \{, \uXXXX, \t, \n, \".
- Do we actually need \t and \n? The whole syntax was designed to make it easy to use white-space and those characters could be written literally, if needed.
Unicode sequences are only valid with four characters: \u0020 is valid, \u20 is not. This is the same as in JavaScript.
- ES2015 introduced Unicode code point escapes which are written as \u{XXXXXXX} with any number of X, thus allowing representing code points from outside of the Basic Multilingual Plane without resorting to using surrogate pairs. Non-BMP code points are very rare but the questions still remains: are we okay with using surrogate pairs for them if we go with the \uXXXX proposal?
Escaping any other character returns the character itself and the character is parsed as normal; \a results in a and \ at the end of line results in the EOL character which is parsed as normal. This is different from JavaScript which has a special case for \EOL which is called LineContinuation and technically is not an escape sequence.

For instance, in the following example, the escaped EOL results in a real EOL and it ends the value part of the a variant:
```
foo = {
       *[a] AAA\
        [b] BBB
    }
```

@Pike, @zbraniecki — I'd love to hear your thoughts on this. Thanks!

zbraniecki · 2017-03-02T00:56:12Z

sgtm! I'd not do \n, \t until we have a use case.

stasm · 2017-03-02T10:17:22Z

I woke up this morning and I had another idea: what if we tried to use the { "x" } pattern as much as we can? The following is a counter-proposal to the one above.

Firstly, let's talk about the backslash. In a more extreme version of the proposal, it can (a) become a regular literal character. Or, it could (b) escape any character to itself, taking it out of the parse flow.

In (a) we need a new solution for Unicode escapes. Perhaps a new literal: Foo { U+10000 } bar?
In (b) we need four exceptions: \\, \", \uXXXX and \EOL because we don't want to take the EOL out of the parse flow.
Or, (c) we could introduce the U+10000 literals and leave backslash for escaping only the " and EOL (and the \ itself). This makes sense: these are the characters used to end a string.

Special characters occurring in text can be escaped by putting then in placeables. quoted-text doesn't allow more placeables, so the following are valid and unambiguous: { "{" }, { "[" } etc. The one exception is the double quote " itself. In (a) we don't have a way to put it in a quoted-text.

So the question boils down to: how much do we want to limit the quoted-text production? It is mostly used in call-expression and I like the idea of keeping the arguments very simple. But maybe we don't want to limit them too much in case we'd like to have things like WRAP(brand-name, char: "\"") or LIST(users, separator: "\uXXXX") in the future.

stasm · 2017-03-06T18:49:57Z

After more thought I'd like to go back to the first proposal and also make it simpler.

Escape sequences are only allowed in text and quoted-text.
Known escape sequences are: \\ for the literal backslash, \" for the literal double quote and \{ for the literal opening brace.
Any other escaped characters result in the literal character being added to the text content of the production. So, \a is a and \EOL is EOL.
Other special characters like [ can be written as { "[" } if they happen to be at the beginning of the line and should be part of the text content.
Using Unicode in FTL is encouraged and as such, we don't offer the \uXXXX sequence at all.

@Pike, @zbraniecki - mind taking another look at this, please?

Pike · 2017-03-08T09:22:16Z

I'm not sure if doing \n->n is a good idea. Or maybe that's something we can warn about in a linter step? It seems like such an ubiquitous assumption that that'd be a newline. And other fall-through escapes. We do have such a warning in compare-locales for .properties, too. Rambling.

For unicode escapes, I've just toyed around with the unicode hex keyboard on the mac. Interestingly, you need to enter surrogate pairs to get to 𝌆, 8 keystrokes away. I wonder if @flodolo or @TheoChevalier have opinions on this as people that actually have to type that unicode stuff.

Apart from that, the latest proposal sounds fine to me.

TheoChevalier · 2017-03-08T11:10:48Z

I don’t think not being able to use \uXXXX would be a problem, but I guess people would have to try once to discover it’s not supported? Would using it produce syntax error?

flodolo · 2017-03-08T11:46:56Z

I might have more questions than answers…

Do we really expect to be able to live without new lines in a string? BTW, we also have \r around in mozilla-central
What about trying to promote FTL as a file format for other uses, where Fluent is not used as the technology driving the project but just as a parser? Is that excluded as a potential scenario? If it's not, supporting unicode and new lines seems needed.
IMO all special characters should be treated equally. If [ is a special character, it should be escaped as \[. Think regular expressions for example.
Also, the idea of having to write something like { "[" } for displaying one character makes we really want to ┻━┻ ︵ヽ(`Д´)ﾉ︵ ┻━┻
Displaying t when I write \t, because \t is not recognized as a known escape sequence, doesn't sound like a good idea. Here are a few possible scenarios:

Code:\t { foo }: I wanted to create a tab, displaying t is bad, I should get rid of the whole escape sequence.
Go to c:\documents: I wanted to write a literal \, so it should have been Go to c:\\documents. Again, displaying c:documents doesn't seem like a good idea.
%Spx \u00D7 %Spx: I wanted to display %Spx × %Spx, displaying %Spx u00D7 %Spx, awful result.

Maybe dropping the string all together with an error is a better option.

stasm · 2017-03-08T12:02:32Z

Do we really expect to be able to live without new lines in a string? BTW, we also have \r around in mozilla-central

New-lines are supported by the syntax natively. You don't need to escape them, just write them as normal:

foo =
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed
    aliquam dui quis nibh rutrum semper. Vestibulum a enim eget
    orci imperdiet tincidunt nec mattis leo. Aenean faucibus ligula
    turpis, eu tincidunt lorem malesuada eget.

Also, the idea of having to write something like { "[" } for displaying one character makes we really want to ┻━┻ ︵ヽ(`Д´)ﾉ︵ ┻━┻

That would be only necessary if [ happens to be at the beginning of a multiline value. Otherwise, it's not special. Consider:

foo = Foo [ Bar ]

bar =
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    { "[" } bar ].

Code:\t { foo }

What is the advantage of writing \t rather than typing the tab character?

flodolo · 2017-03-08T12:14:24Z

New-lines are supported by the syntax natively.

Uh, forgot that multiline strings have a different syntax. So, that's not a problem.

That would be only necessary if [ happens to be at the beginning of a multiline value.

Aren't we asking too much then to localizers working on these files? I know we would prefer them to use tools, where this would be automated and transparent, but it seems to add a lot of complexity.
Can you think of other languages where escaping depends on the position of the character?

What is the advantage of writing \t rather than typing the tab character?

None, but my understanding is that we're considering the case where someone wrote the string assuming \t (or \n, \r, etc.) would be converted to a tab or a newline, and how to deal with that.

zbraniecki · 2017-03-08T21:25:43Z

What is the advantage of writing \t rather than typing the tab character?

Some editors define tab behavior to jump between inputs.

stasm · 2017-04-07T16:55:41Z

Last week we met in person and briefly discussed this issues with @Pike, @zbraniecki and @flodolo . Here are the key take-aways from that conversation:

We don't have to answer all questions right now.
Prefer to be flexible: don't normalize by default.
In the future, allow bindings to configure the context's behavior wrt. normalization.
Unicode escapes are a safety valve.
Parsing \n to n isn't helpful and produces an unexpected behavior.

With that in mind, I'd like to suggest a minimal specification for our current purposes.

Escape sequences are only allowed in text and quoted-text.
Newlines are preserved by the parser. This allows proper serialization.
Known escape sequences are: \\ for the literal backslash, \" for the literal double quote, \{ for the literal opening brace and \u followed by 4 hex digits for Unicode code points. Representing code points from outside of the Basic Multilingual Plane is made possible with surrogate pairs (two \uXXXX sequences). Using the actual character is encouraged, however.
Any other escaped characters result in a parsing error. (We might relax this to producing warnings and parsing to a space for instance, but let's start with a stricter approach.)

zbraniecki · 2017-07-10T18:47:08Z

@stasm is there anything left in this issue?

stasm · 2017-08-04T13:34:02Z

No, I forgot to close this issue. And to tag Syntax Spec 0.3 back in April. D'oh. Thanks.

stasm added this to the 0.3 milestone Jan 27, 2017

stasm mentioned this issue Feb 15, 2017

Require message body to be indented #32

Merged

stasm added the syntax label Feb 16, 2017

stasm mentioned this issue Apr 11, 2017

Define the behavior of backslash escapes #41

Merged

stasm closed this as completed Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the behavior of backslash #22

Define the behavior of backslash #22

stasm commented Jan 27, 2017

stasm commented Mar 2, 2017

zbraniecki commented Mar 2, 2017

stasm commented Mar 2, 2017

stasm commented Mar 6, 2017

Pike commented Mar 8, 2017

TheoChevalier commented Mar 8, 2017

flodolo commented Mar 8, 2017

stasm commented Mar 8, 2017

flodolo commented Mar 8, 2017

zbraniecki commented Mar 8, 2017

stasm commented Apr 7, 2017 •

edited

Loading

zbraniecki commented Jul 10, 2017

stasm commented Aug 4, 2017

Define the behavior of backslash #22

Define the behavior of backslash #22

Comments

stasm commented Jan 27, 2017

stasm commented Mar 2, 2017

zbraniecki commented Mar 2, 2017

stasm commented Mar 2, 2017

stasm commented Mar 6, 2017

Pike commented Mar 8, 2017

TheoChevalier commented Mar 8, 2017

flodolo commented Mar 8, 2017

stasm commented Mar 8, 2017

flodolo commented Mar 8, 2017

zbraniecki commented Mar 8, 2017

stasm commented Apr 7, 2017 • edited Loading

zbraniecki commented Jul 10, 2017

stasm commented Aug 4, 2017

stasm commented Apr 7, 2017 •

edited

Loading