Create explicit AST node for escape sequences #195

Pike · 2018-10-23T10:40:03Z

Right now, the runtime behavior of \XXX is only implicitly defined by the js runtime.

To fix this, we should have a dedicated object in our AST tree that represents a escaped character, that is Unicode escapes, and escapes for syntax chars.

This is half of #156 with a different rationale. The AST might look similar to #123 (comment).

The text was updated successfully, but these errors were encountered:

stasm · 2018-10-23T11:15:08Z

I've looked at how other languages deal with it and I didn't find any which would put escape sequences into separate AST nodes. I used https://astexplorer.net/ and https://docs.python.org/3/library/ast.html (also available online at https://python-ast-explorer.com/).

Can we achieve the same thing by specifying in written text how the known escapes, which are well defined by the grammar, should be handled?

zbraniecki · 2018-10-26T11:26:48Z

@Manishearth - can you advise us on this issue?

stasm · 2018-10-26T13:17:10Z

Looking at the ECMA262, the spec has a separate paragraph about the semantics of string values in which it defines the list of known escape sequences and their expect values. Maybe it would be enough to do the same in Fluent's specification?

Manishearth · 2018-10-27T13:28:38Z

Typically escape sequences are handled by the tokenizer, not the parser, and are absent from the AST.

Pike · 2018-10-28T11:42:48Z

One aspect that's perhaps a bit special about the tooling design around Fluent is that we care about round-tripping between parsing and serialization. In that sense, a parsing step that discards information is something we try to avoid.

I'm also looking at this from the POV of other people implementing runtimes. The more explicit our reference implementation is, the easier it should be for others to implement a spec-compliant Fluent implementation. I assume that it's easier to optimize things out of the algorithms in the reference implementation than to add them. That goes along the lines of what Manish said, often runtime implementations will deal with things like escape sequences really early, and throw the detail information away.

I wonder if we should re-introduce a StringExpression as a non-trivial string literal. On the AST side, StringLiteral could be a sibling to EscapeLiteral. That way, StringLiteral would continue to be atomic. And as part of abstract.mjs, we could optimize out a StringExpression that has just one element.

Manishearth · 2018-10-28T11:48:03Z

round-tripping between parsing and serialization

Can you expand on why? You're already filtering out whitespace, presumably.

Most languages deal with this by tagging the AST with spans, so you also know what the original code looked like if you ever need this from a tooling perspective (e.g. if writing an autoformatter). Spans are in general a really convenient way of supporting a lot of related tooling use cases and usually help keep the AST simple and usable.

stasm · 2018-10-29T10:25:03Z

Can you expand on why? You're already filtering out whitespace, presumably.

Tools like Pontoon parse translations stored in Fluent files, let localizers edit them, and then serialize them back to Fluent. It's true that the serialization currently discards the original whitespace and use its own pretty-printing rules. We still want to preserve as much of the original content as possible. If the localizer uses \u00A0, we want to serialize it back as \u00A0 rather than the actual non-breaking space which would be easy to miss by reviewers and future localizers.

stasm · 2018-10-29T10:26:42Z

I'm also looking at this from the POV of other people implementing runtimes. The more explicit our reference implementation is, the easier it should be for others to implement a spec-compliant Fluent implementation.

This could be specified as an implementation note accompanying the StringLiteral production, too, now that we decided to only recognize escape sequences in StringLiterals (#123).

Manishearth · 2018-10-29T10:36:25Z

If the localizer uses \u00A0, we want to serialize it back as \u00A0 rather than the actual non-breaking space which would be easy to miss by reviewers and future localizers.

Well, you can go the other way and serialize as escapes, but that's not great either.

Yeah, you want to use a span here so that the original string is obtainable. Alternatively, store the original string alongside the parsed one in case there were escape sequences. I think for tooling you'll eventually need spans anyway, but storing strings alongside is a valid solution that doesn't require adding support for spans.

stasm · 2018-10-29T11:10:34Z

So then we'd have value and rawValue of sorts, where value has all escape sequences unescaped, and rawValue—not, right? That's what I think I saw in a few AST explorers I tested.

Would you also recommend that rawValue be a simple string or rather a list of strings interleaved with special nodes for escape sequences? To illustrate, would rawValue be StringLiteral {"Privacy\\u00A0Policy"} or (StringLiteral {"Privacy"}, UnicodeEscapeLiteral {160}, StringLiteral {"Policy"})?

stasm · 2018-10-29T11:11:59Z

I should also mention that our tooling parsers do support spans. They are useful in many read-only scenarios, but are they also a good solution when tools like Pontoon allow localizers to edit the content of translations stored in the AST?

Manishearth · 2018-10-29T11:13:58Z

I'd have rawValue be the actual raw value, not any structured variant.

but are they also a good solution when tools like Pontoon allow localizers to edit the content of translations stored in the AST?

Yep, because you don't need to store a rawValue in that case, you can convert the span to a code snippet

Pike · 2018-10-29T11:37:36Z

Please also consider that we're creating Fluent AST from scratch for migrations.

In my experience, spans are often in the way of testing and mocking, as you end up creating fake content with fake offsets into it.

Manishearth · 2018-10-30T03:04:37Z

In that case just store the optional rawValue :)

Spans let you have an optimization so that you don't have to store rawValue, but that's just an optimization

stasm · 2018-11-05T19:34:05Z

Thanks for sharing your thoughts and experiences. I'd like to propose a way forward here. My goal is to keep things as simple as possible while still allowing the specification to be, well, specific about the expected behavior of the escapes.

Following @Manishearth's advice, let's add a raw field to StringLiterals which contains the literal contents of the StringLiterals, escape sequences copied verbatim from source. Let's also add the unescaping logic to abstract.mjs which produces the value field. This way, the unescaping logic will become part of the spec.

"a\u2013z" would parse as (serialized to JSON):

{
    "type": "StringLiteral",
    "raw": "a\\u2013z",
    "value": "a−z"
}

For simplicity, I would make raw present on all StringLiterals, but I'm open to being convinced that it should be optional.

I'll prepare a PR.

stasm · 2018-11-08T12:07:14Z

#203 added a raw field to StringLiterals. The value field now contains the unescaped value of the string literal. The unescaping behavior is now properly specified in abstract.mjs, which I consider a fix to the root problem described in this issue. I think we can close it now. Thanks for the discussion, everyone!

Pike added the syntax label Oct 23, 2018

stasm mentioned this issue Oct 24, 2018

Remove backslash escapes from TextElement #123

Closed

stasm mentioned this issue Nov 6, 2018

Store unescaped content in StringLiteral.value and raw content in StringLiteral.raw #203

Merged

stasm closed this as completed Nov 8, 2018

This was referenced Nov 29, 2018

Implement Syntax 0.8, part3 projectfluent/fluent.js#310

Merged

Consider renaming StringLiteral::raw to StringLiteral::value and StringLiteral::value to StringLiteral::decoded #222

Closed

stasm mentioned this issue Feb 21, 2019

Remove unescaped string literals from the AST #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create explicit AST node for escape sequences #195

Create explicit AST node for escape sequences #195

Pike commented Oct 23, 2018

stasm commented Oct 23, 2018 •

edited

Loading

zbraniecki commented Oct 26, 2018

stasm commented Oct 26, 2018

Manishearth commented Oct 27, 2018

Pike commented Oct 28, 2018

Manishearth commented Oct 28, 2018 •

edited

Loading

stasm commented Oct 29, 2018

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

stasm commented Oct 29, 2018 •

edited

Loading

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

Pike commented Oct 29, 2018

Manishearth commented Oct 30, 2018

stasm commented Nov 5, 2018

stasm commented Nov 8, 2018

Create explicit AST node for escape sequences #195

Create explicit AST node for escape sequences #195

Comments

Pike commented Oct 23, 2018

stasm commented Oct 23, 2018 • edited Loading

zbraniecki commented Oct 26, 2018

stasm commented Oct 26, 2018

Manishearth commented Oct 27, 2018

Pike commented Oct 28, 2018

Manishearth commented Oct 28, 2018 • edited Loading

stasm commented Oct 29, 2018

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

stasm commented Oct 29, 2018 • edited Loading

stasm commented Oct 29, 2018

Manishearth commented Oct 29, 2018

Pike commented Oct 29, 2018

Manishearth commented Oct 30, 2018

stasm commented Nov 5, 2018

stasm commented Nov 8, 2018

stasm commented Oct 23, 2018 •

edited

Loading

Manishearth commented Oct 28, 2018 •

edited

Loading

stasm commented Oct 29, 2018 •

edited

Loading