Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the behavior of backslash #22

Closed
stasm opened this issue Jan 27, 2017 · 13 comments
Closed

Define the behavior of backslash #22

stasm opened this issue Jan 27, 2017 · 13 comments
Milestone

Comments

@stasm
Copy link
Contributor

stasm commented Jan 27, 2017

In #12 (comment) I said we'd need to define the exact behavior of the backslash character \ for the purposes of escaping. This includes defining:

  • the list of known escape sequences (\ ( a space), \t, \n, \*, \[, \{, \u, \\, others?),

  • how the Unicode escapes work: is \u20 valid and the same as \u0020?

  • the behavior of unknown sequences, like \a (does the backslash take the following character out of the syntax parsing?),

  • the behavior for edge-cases, like:

      foo\bar = Foobar
    

    Is that a syntax error? If not, what is the name of the identifier?

      foo = Foo\
      bar = Bar
    

    Is that an escaped new-line?

@stasm stasm added this to the 0.3 milestone Jan 27, 2017
stasm added a commit to stasm/fluent that referenced this issue Feb 15, 2017
Fix projectfluent#12, projectfluent#17, projectfluent#18.

With this change, the entire body of a message needs to indented. This makes
error recovery very easy: finding the next message definition is as simple as
finding the next identifier with no indentation.

It also opens up a number of opportunities: we can remove the `|` syntax for
multiline blocks of text and allow line breaks inside of placeables safely.

The PR also allows the value to be defined on a new line, making the
following examples equivalent:

    lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

    lipsum
        = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

I hope this will help when attributes are present:

    lipsum
        = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

        .attr = Attribute

Lastly, quoted patterns are only available inside of placeables and cannot be
used directly as values.

The exact semantics of \ escapes will be defined in projectfluent#22.
@stasm stasm added the syntax label Feb 16, 2017
stasm added a commit to stasm/fluent that referenced this issue Feb 24, 2017
Fix projectfluent#12, projectfluent#17, projectfluent#18.

With this change, the entire body of a message must be indented. This makes
error recovery very easy: finding the next message definition is as simple as
finding the next identifier with no indentation.

It also opens up a number of opportunities: we can remove the `|` syntax for
multiline blocks of text and allow line breaks inside of placeables safely.

The change also allows the value to be defined on a new line, making the
following examples equivalent:

    lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

    lipsum =
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

Lastly, quoted patterns are only available inside of placeables, cannot contain
aother placeables and cannot be used directly as values.

The exact semantics of \ escapes will be defined in projectfluent#22.
stasm added a commit that referenced this issue Feb 24, 2017
Fix #12, #17, #18.

With this change, the entire body of a message must be indented. This makes
error recovery very easy: finding the next message definition is as simple as
finding the next identifier with no indentation.

It also opens up a number of opportunities: we can remove the `|` syntax for
multiline blocks of text and allow line breaks inside of placeables safely.

The change also allows the value to be defined on a new line, making the
following examples equivalent:

    lipsum = Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

    lipsum =
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi
        pellentesque congue metus, non mattis sem faucibus sit amet.

Lastly, quoted patterns are only available inside of placeables, cannot contain
aother placeables and cannot be used directly as values.

The exact semantics of \ escapes will be defined in #22.
@stasm
Copy link
Contributor Author

stasm commented Mar 2, 2017

A draft proposal:

  • Escape sequences are only allowed in the text and quoted-text productions.

  • Known escapes are: \\, \*, \[, \{, \uXXXX, \t, \n, \".

    • Do we actually need \t and \n? The whole syntax was designed to make it easy to use white-space and those characters could be written literally, if needed.
  • Unicode sequences are only valid with four characters: \u0020 is valid, \u20 is not. This is the same as in JavaScript.

    • ES2015 introduced Unicode code point escapes which are written as \u{XXXXXXX} with any number of X, thus allowing representing code points from outside of the Basic Multilingual Plane without resorting to using surrogate pairs. Non-BMP code points are very rare but the questions still remains: are we okay with using surrogate pairs for them if we go with the \uXXXX proposal?
  • Escaping any other character returns the character itself and the character is parsed as normal; \a results in a and \ at the end of line results in the EOL character which is parsed as normal. This is different from JavaScript which has a special case for \EOL which is called LineContinuation and technically is not an escape sequence.

    For instance, in the following example, the escaped EOL results in a real EOL and it ends the value part of the a variant:

    foo = {
           *[a] AAA\
            [b] BBB
        }
    

@Pike, @zbraniecki — I'd love to hear your thoughts on this. Thanks!

@zbraniecki
Copy link
Collaborator

sgtm! I'd not do \n, \t until we have a use case.

@stasm
Copy link
Contributor Author

stasm commented Mar 2, 2017

I woke up this morning and I had another idea: what if we tried to use the { "x" } pattern as much as we can? The following is a counter-proposal to the one above.

Firstly, let's talk about the backslash. In a more extreme version of the proposal, it can (a) become a regular literal character. Or, it could (b) escape any character to itself, taking it out of the parse flow.

  • In (a) we need a new solution for Unicode escapes. Perhaps a new literal: Foo { U+10000 } bar?

  • In (b) we need four exceptions: \\, \", \uXXXX and \EOL because we don't want to take the EOL out of the parse flow.

  • Or, (c) we could introduce the U+10000 literals and leave backslash for escaping only the " and EOL (and the \ itself). This makes sense: these are the characters used to end a string.

Special characters occurring in text can be escaped by putting then in placeables. quoted-text doesn't allow more placeables, so the following are valid and unambiguous: { "{" }, { "[" } etc. The one exception is the double quote " itself. In (a) we don't have a way to put it in a quoted-text.

So the question boils down to: how much do we want to limit the quoted-text production? It is mostly used in call-expression and I like the idea of keeping the arguments very simple. But maybe we don't want to limit them too much in case we'd like to have things like WRAP(brand-name, char: "\"") or LIST(users, separator: "\uXXXX") in the future.

@stasm
Copy link
Contributor Author

stasm commented Mar 6, 2017

After more thought I'd like to go back to the first proposal and also make it simpler.

  • Escape sequences are only allowed in text and quoted-text.
  • Known escape sequences are: \\ for the literal backslash, \" for the literal double quote and \{ for the literal opening brace.
  • Any other escaped characters result in the literal character being added to the text content of the production. So, \a is a and \EOL is EOL.
  • Other special characters like [ can be written as { "[" } if they happen to be at the beginning of the line and should be part of the text content.
  • Using Unicode in FTL is encouraged and as such, we don't offer the \uXXXX sequence at all.

@Pike, @zbraniecki - mind taking another look at this, please?

@Pike
Copy link
Contributor

Pike commented Mar 8, 2017

I'm not sure if doing \n->n is a good idea. Or maybe that's something we can warn about in a linter step? It seems like such an ubiquitous assumption that that'd be a newline. And other fall-through escapes. We do have such a warning in compare-locales for .properties, too. Rambling.

For unicode escapes, I've just toyed around with the unicode hex keyboard on the mac. Interestingly, you need to enter surrogate pairs to get to 𝌆, 8 keystrokes away. I wonder if @flodolo or @TheoChevalier have opinions on this as people that actually have to type that unicode stuff.

Apart from that, the latest proposal sounds fine to me.

@TheoChevalier
Copy link

I don’t think not being able to use \uXXXX would be a problem, but I guess people would have to try once to discover it’s not supported? Would using it produce syntax error?

@flodolo
Copy link
Contributor

flodolo commented Mar 8, 2017

I might have more questions than answers…

  1. Do we really expect to be able to live without new lines in a string? BTW, we also have \r around in mozilla-central

  2. What about trying to promote FTL as a file format for other uses, where Fluent is not used as the technology driving the project but just as a parser? Is that excluded as a potential scenario? If it's not, supporting unicode and new lines seems needed.

  3. IMO all special characters should be treated equally. If [ is a special character, it should be escaped as \[. Think regular expressions for example.
    Also, the idea of having to write something like { "[" } for displaying one character makes we really want to ┻━┻ ︵ヽ(`Д´)ノ︵ ┻━┻

  4. Displaying t when I write \t, because \t is not recognized as a known escape sequence, doesn't sound like a good idea. Here are a few possible scenarios:

  • Code:\t { foo }: I wanted to create a tab, displaying t is bad, I should get rid of the whole escape sequence.
  • Go to c:\documents: I wanted to write a literal \, so it should have been Go to c:\\documents. Again, displaying c:documents doesn't seem like a good idea.
  • %Spx \u00D7 %Spx: I wanted to display %Spx × %Spx, displaying %Spx u00D7 %Spx, awful result.

Maybe dropping the string all together with an error is a better option.

@stasm
Copy link
Contributor Author

stasm commented Mar 8, 2017

Do we really expect to be able to live without new lines in a string? BTW, we also have \r around in mozilla-central

New-lines are supported by the syntax natively. You don't need to escape them, just write them as normal:

foo =
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed
    aliquam dui quis nibh rutrum semper. Vestibulum a enim eget
    orci imperdiet tincidunt nec mattis leo. Aenean faucibus ligula
    turpis, eu tincidunt lorem malesuada eget.

Also, the idea of having to write something like { "[" } for displaying one character makes we really want to ┻━┻ ︵ヽ(`Д´)ノ︵ ┻━┻

That would be only necessary if [ happens to be at the beginning of a multiline value. Otherwise, it's not special. Consider:

foo = Foo [ Bar ]
bar =
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    { "[" } bar ].

Code:\t { foo }

What is the advantage of writing \t rather than typing the tab character?

@flodolo
Copy link
Contributor

flodolo commented Mar 8, 2017

New-lines are supported by the syntax natively.

Uh, forgot that multiline strings have a different syntax. So, that's not a problem.

That would be only necessary if [ happens to be at the beginning of a multiline value.

Aren't we asking too much then to localizers working on these files? I know we would prefer them to use tools, where this would be automated and transparent, but it seems to add a lot of complexity.
Can you think of other languages where escaping depends on the position of the character?

What is the advantage of writing \t rather than typing the tab character?

None, but my understanding is that we're considering the case where someone wrote the string assuming \t (or \n, \r, etc.) would be converted to a tab or a newline, and how to deal with that.

@zbraniecki
Copy link
Collaborator

What is the advantage of writing \t rather than typing the tab character?

Some editors define tab behavior to jump between inputs.

@stasm
Copy link
Contributor Author

stasm commented Apr 7, 2017

Last week we met in person and briefly discussed this issues with @Pike, @zbraniecki and @flodolo . Here are the key take-aways from that conversation:

  • We don't have to answer all questions right now.
  • Prefer to be flexible: don't normalize by default.
  • In the future, allow bindings to configure the context's behavior wrt. normalization.
  • Unicode escapes are a safety valve.
  • Parsing \n to n isn't helpful and produces an unexpected behavior.

With that in mind, I'd like to suggest a minimal specification for our current purposes.

  • Escape sequences are only allowed in text and quoted-text.
  • Newlines are preserved by the parser. This allows proper serialization.
  • Known escape sequences are: \\ for the literal backslash, \" for the literal double quote, \{ for the literal opening brace and \u followed by 4 hex digits for Unicode code points. Representing code points from outside of the Basic Multilingual Plane is made possible with surrogate pairs (two \uXXXX sequences). Using the actual character is encouraged, however.
  • Any other escaped characters result in a parsing error. (We might relax this to producing warnings and parsing to a space for instance, but let's start with a stricter approach.)

@zbraniecki
Copy link
Collaborator

@stasm is there anything left in this issue?

@stasm
Copy link
Contributor Author

stasm commented Aug 4, 2017

No, I forgot to close this issue. And to tag Syntax Spec 0.3 back in April. D'oh. Thanks.

@stasm stasm closed this as completed Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants