Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New syntax for Unicode literals #115

Closed
stasm opened this issue May 21, 2018 · 17 comments
Closed

New syntax for Unicode literals #115

stasm opened this issue May 21, 2018 · 17 comments
Labels
backwards incompatible Old files won't parse in new parsers. forwards incompatible Old parsers won't parse newer files.

Comments

@stasm
Copy link
Contributor

stasm commented May 21, 2018

Our current Unicode escape sequence uses the \u syntax. This is an established pattern and we agreed that in general using actual characters should be preferred. However, in particular in case of non-printable characters, I think we could to better to make them more readable to non-programmers.

I wonder about introducing a new literal type which evaluates to Unicode characters. The syntax would be:

U+[0-9A-F]{4,6}

For example:

# BEFORE
nbsp = An\u00a0example

# AFTER
nbsp = An{U+00A0}example.

To make this a viable solution inside of StringExpressions, we'd need to allow {} placeables inside of StringExpressions which would really mean making them into quoted Patterns.

@stasm
Copy link
Contributor Author

stasm commented May 21, 2018

To make this a viable solution, we'd need to allow {} placeables inside of StringExpressions which would really mean making them into quoted Patterns.

Or, we could implicitly concatenate expressions which are next to each other: { "word" U+00A0 "another word" }.

@stasm stasm added the forwards incompatible Old parsers won't parse newer files. label May 22, 2018
@Pike
Copy link
Contributor

Pike commented May 22, 2018

Do we have a way to get the actual breakage? CC @mathjazz @flodolo

My personal take is that I hate the readability problems at the end of \u00a0 as it totally blends into the following text. I also hate change.

I'd not make the {4,6}, if we go for it. The strict length restriction we have on the \u is for exactly the blending-in-trailing-text problem, if we're going for a walled token like {}, we don't need that. I'd even go for /\{u\+[0-9a-f]{1,6}\}/i

@flodolo
Copy link
Contributor

flodolo commented May 22, 2018

Do we have a way to get the actual breakage?

Define breakage?

@Pike
Copy link
Contributor

Pike commented May 22, 2018

How many of translations and strings contain \u unicode escapes right now? And did over the course of ftl time? Probably best answered by a pontoon db query :-/

@stasm
Copy link
Contributor Author

stasm commented May 22, 2018

I'd not make the {4,6}, if we go for it.

I suggested the lower bound because I think this is how Unicode code points are specified by convention. Also, {U+9} might look a bit cryptic, while my hope would be that {U+0009} looks more like a Unicode code point.

@Pike
Copy link
Contributor

Pike commented May 22, 2018

OTH, {4,6} allows 5, not sure if that was your intent or not. And if we'd allow 5, why not 3 or 2? (1 being silly, I agree, but then also, why not)

@flodolo
Copy link
Contributor

flodolo commented May 22, 2018

We have quite a few, but I don't think we ported anything to FTL (also not sure how they would be ported?). Looking into Pontoon's DB sounds like a good idea though.
https://transvision.mozfr.org/?recherche=%5Cu&repo=gecko_strings&sourcelocale=en-US&locale=it&search_type=strings_entities

@stasm
Copy link
Contributor Author

stasm commented May 22, 2018

Looking at https://en.wikipedia.org/wiki/Plane_(Unicode), 5 hexdigits is more commonly used than 6.

@Pike Pike added the backwards incompatible Old files won't parse in new parsers. label May 22, 2018
@Pike
Copy link
Contributor

Pike commented May 22, 2018

... actually, clarifying question:

Do you intend to remove \u00a0 support from the spec, or just add a second way to do the same thing?

@stasm
Copy link
Contributor Author

stasm commented May 22, 2018

I'd prefer to have a single way, but as I mentioned in the OP, removing \u has consequences for what's possible in StringExpressions (and by extension, if #90 lands, perhaps in variant keys, too).

Actually, let me file another issue clarifying my current thinking.

@stasm
Copy link
Contributor Author

stasm commented May 22, 2018

I filed #123 which outlines my plan to simplify the grammar of text inside of TextElements.

@mathjazz
Copy link
Contributor

>>> Entity.objects.filter(resource__format="ftl", string__contains="\u").count()
0
>>> Translation.objects.filter(entity__resource__format="ftl", string__contains="\u").count()
2

Both matches are false positives:
https://pontoon.mozilla.org/zh-TW/test-pilot-website/experiments.ftl/?string=169100
https://pontoon.mozilla.org/zh-TW/common-voice/messages.ftl/?string=175911

@flodolo
Copy link
Contributor

flodolo commented May 24, 2018

My option would be to go for allowing only /\{U\+[0-9A-F]{4,6}\}/, being case sensitive all over the place. That would avoid any possible confusion in writing these literals, and would also help in case of search/replace operations.

Side note: if we do this, we should also think about migrations, and replacing \u00AD with {U+00AD}

@stasm
Copy link
Contributor Author

stasm commented Aug 28, 2018

I haven't been able to find a way to use the proposed Unicode literals inside of StringLiterals without increasing their complexity.

To make this a viable solution inside of StringLiterals, we'd need to allow placeables inside of StringLiterals which would really mean making them quoted Patterns.

I think this would be a major change in the design of Fluent. It would allow quoted patterns as values of named arguments, possibly variant keys (#90), and as a way to nest complex expressions ({FUNC("{ $sel -> ... }")}).

Or, we could implicitly concatenate expressions which are next to each other: { "word" U+00A0 "another word" }.

This in turn would require something like a StringConcatenation node and more runtime resolution logic to handle it.

I'd like to open this issue up for feedback about these two approaches. My own opinion is that they bring in too much complexity for relatively little gain.

If we agree not to change StringLiterals, I suggest we close this issue as wontfix. {U+9} can also be written as {"\u0009"}. While I like the U+X syntax, I don't think it's a good idea to have two ways to cater to the same edge-case.

Furthermore, if we only allow \u escapes in StringLiterals as #123 proposes, we could extend their grammar to allow more or fewer than 4 hexdigits. ECMAScript 2015 calls this Unicode code point escapes.

{"\u{9}"}
{"\u{1F600}"}

@Pike
Copy link
Contributor

Pike commented Sep 21, 2018

The question around unicode literals in string literals is hard to solve. Also, {U+00A0} will constrain our ability to relax message identifiers, which might not be the right way to compromise here.

I agree that WONTFIX is a good resolution here.

@stasm
Copy link
Contributor Author

stasm commented Sep 26, 2018

Agreed. Thanks for your thoughts.

@stasm stasm closed this as completed Sep 26, 2018
@Pike
Copy link
Contributor

Pike commented Sep 28, 2018

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards incompatible Old files won't parse in new parsers. forwards incompatible Old parsers won't parse newer files.
Projects
None yet
Development

No branches or pull requests

4 participants