Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create explicit AST node for escape sequences #195

Closed
Pike opened this issue Oct 23, 2018 · 16 comments
Closed

Create explicit AST node for escape sequences #195

Pike opened this issue Oct 23, 2018 · 16 comments
Labels

Comments

@Pike
Copy link
Contributor

Pike commented Oct 23, 2018

Right now, the runtime behavior of \XXX is only implicitly defined by the js runtime.

To fix this, we should have a dedicated object in our AST tree that represents a escaped character, that is Unicode escapes, and escapes for syntax chars.

This is half of #156 with a different rationale. The AST might look similar to #123 (comment).

@Pike Pike added the syntax label Oct 23, 2018
@stasm
Copy link
Contributor

stasm commented Oct 23, 2018

I've looked at how other languages deal with it and I didn't find any which would put escape sequences into separate AST nodes. I used https://astexplorer.net/ and https://docs.python.org/3/library/ast.html (also available online at https://python-ast-explorer.com/).

Can we achieve the same thing by specifying in written text how the known escapes, which are well defined by the grammar, should be handled?

@zbraniecki
Copy link
Collaborator

@Manishearth - can you advise us on this issue?

@stasm
Copy link
Contributor

stasm commented Oct 26, 2018

Looking at the ECMA262, the spec has a separate paragraph about the semantics of string values in which it defines the list of known escape sequences and their expect values. Maybe it would be enough to do the same in Fluent's specification?

@Manishearth
Copy link

Typically escape sequences are handled by the tokenizer, not the parser, and are absent from the AST.

@Pike
Copy link
Contributor Author

Pike commented Oct 28, 2018

One aspect that's perhaps a bit special about the tooling design around Fluent is that we care about round-tripping between parsing and serialization. In that sense, a parsing step that discards information is something we try to avoid.

I'm also looking at this from the POV of other people implementing runtimes. The more explicit our reference implementation is, the easier it should be for others to implement a spec-compliant Fluent implementation. I assume that it's easier to optimize things out of the algorithms in the reference implementation than to add them. That goes along the lines of what Manish said, often runtime implementations will deal with things like escape sequences really early, and throw the detail information away.

I wonder if we should re-introduce a StringExpression as a non-trivial string literal. On the AST side, StringLiteral could be a sibling to EscapeLiteral. That way, StringLiteral would continue to be atomic. And as part of abstract.mjs, we could optimize out a StringExpression that has just one element.

@Manishearth
Copy link

Manishearth commented Oct 28, 2018

round-tripping between parsing and serialization

Can you expand on why? You're already filtering out whitespace, presumably.

Most languages deal with this by tagging the AST with spans, so you also know what the original code looked like if you ever need this from a tooling perspective (e.g. if writing an autoformatter). Spans are in general a really convenient way of supporting a lot of related tooling use cases and usually help keep the AST simple and usable.

@stasm
Copy link
Contributor

stasm commented Oct 29, 2018

Can you expand on why? You're already filtering out whitespace, presumably.

Tools like Pontoon parse translations stored in Fluent files, let localizers edit them, and then serialize them back to Fluent. It's true that the serialization currently discards the original whitespace and use its own pretty-printing rules. We still want to preserve as much of the original content as possible. If the localizer uses \u00A0, we want to serialize it back as \u00A0 rather than the actual non-breaking space which would be easy to miss by reviewers and future localizers.

@stasm
Copy link
Contributor

stasm commented Oct 29, 2018

I'm also looking at this from the POV of other people implementing runtimes. The more explicit our reference implementation is, the easier it should be for others to implement a spec-compliant Fluent implementation.

This could be specified as an implementation note accompanying the StringLiteral production, too, now that we decided to only recognize escape sequences in StringLiterals (#123).

@Manishearth
Copy link

If the localizer uses \u00A0, we want to serialize it back as \u00A0 rather than the actual non-breaking space which would be easy to miss by reviewers and future localizers.

Well, you can go the other way and serialize as escapes, but that's not great either.

Yeah, you want to use a span here so that the original string is obtainable. Alternatively, store the original string alongside the parsed one in case there were escape sequences. I think for tooling you'll eventually need spans anyway, but storing strings alongside is a valid solution that doesn't require adding support for spans.

@stasm
Copy link
Contributor

stasm commented Oct 29, 2018

So then we'd have value and rawValue of sorts, where value has all escape sequences unescaped, and rawValue—not, right? That's what I think I saw in a few AST explorers I tested.

Would you also recommend that rawValue be a simple string or rather a list of strings interleaved with special nodes for escape sequences? To illustrate, would rawValue be StringLiteral {"Privacy\\u00A0Policy"} or (StringLiteral {"Privacy"}, UnicodeEscapeLiteral {160}, StringLiteral {"Policy"})?

@stasm
Copy link
Contributor

stasm commented Oct 29, 2018

I should also mention that our tooling parsers do support spans. They are useful in many read-only scenarios, but are they also a good solution when tools like Pontoon allow localizers to edit the content of translations stored in the AST?

@Manishearth
Copy link

I'd have rawValue be the actual raw value, not any structured variant.

but are they also a good solution when tools like Pontoon allow localizers to edit the content of translations stored in the AST?

Yep, because you don't need to store a rawValue in that case, you can convert the span to a code snippet

@Pike
Copy link
Contributor Author

Pike commented Oct 29, 2018

Please also consider that we're creating Fluent AST from scratch for migrations.

In my experience, spans are often in the way of testing and mocking, as you end up creating fake content with fake offsets into it.

@Manishearth
Copy link

In that case just store the optional rawValue :)

Spans let you have an optimization so that you don't have to store rawValue, but that's just an optimization

@stasm
Copy link
Contributor

stasm commented Nov 5, 2018

Thanks for sharing your thoughts and experiences. I'd like to propose a way forward here. My goal is to keep things as simple as possible while still allowing the specification to be, well, specific about the expected behavior of the escapes.

Following @Manishearth's advice, let's add a raw field to StringLiterals which contains the literal contents of the StringLiterals, escape sequences copied verbatim from source. Let's also add the unescaping logic to abstract.mjs which produces the value field. This way, the unescaping logic will become part of the spec.

"a\u2013z" would parse as (serialized to JSON):

{
    "type": "StringLiteral",
    "raw": "a\\u2013z",
    "value": "a−z"
}

For simplicity, I would make raw present on all StringLiterals, but I'm open to being convinced that it should be optional.

I'll prepare a PR.

@stasm
Copy link
Contributor

stasm commented Nov 8, 2018

#203 added a raw field to StringLiterals. The value field now contains the unescaped value of the string literal. The unescaping behavior is now properly specified in abstract.mjs, which I consider a fix to the root problem described in this issue. I think we can close it now. Thanks for the discussion, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants