New syntax for Unicode literals #115

stasm · 2018-05-21T14:00:07Z

Our current Unicode escape sequence uses the \u syntax. This is an established pattern and we agreed that in general using actual characters should be preferred. However, in particular in case of non-printable characters, I think we could to better to make them more readable to non-programmers.

I wonder about introducing a new literal type which evaluates to Unicode characters. The syntax would be:

U+[0-9A-F]{4,6}

For example:

# BEFORE
nbsp = An\u00a0example

# AFTER
nbsp = An{U+00A0}example.

To make this a viable solution inside of StringExpressions, we'd need to allow {} placeables inside of StringExpressions which would really mean making them into quoted Patterns.

The text was updated successfully, but these errors were encountered:

stasm · 2018-05-21T14:54:45Z

To make this a viable solution, we'd need to allow {} placeables inside of StringExpressions which would really mean making them into quoted Patterns.

Or, we could implicitly concatenate expressions which are next to each other: { "word" U+00A0 "another word" }.

Pike · 2018-05-22T10:56:11Z

Do we have a way to get the actual breakage? CC @mathjazz @flodolo

My personal take is that I hate the readability problems at the end of \u00a0 as it totally blends into the following text. I also hate change.

I'd not make the {4,6}, if we go for it. The strict length restriction we have on the \u is for exactly the blending-in-trailing-text problem, if we're going for a walled token like {}, we don't need that. I'd even go for /\{u\+[0-9a-f]{1,6}\}/i

flodolo · 2018-05-22T10:58:40Z

Do we have a way to get the actual breakage?

Define breakage?

Pike · 2018-05-22T11:00:03Z

How many of translations and strings contain \u unicode escapes right now? And did over the course of ftl time? Probably best answered by a pontoon db query :-/

stasm · 2018-05-22T11:00:44Z

I'd not make the {4,6}, if we go for it.

I suggested the lower bound because I think this is how Unicode code points are specified by convention. Also, {U+9} might look a bit cryptic, while my hope would be that {U+0009} looks more like a Unicode code point.

Pike · 2018-05-22T11:02:24Z

OTH, {4,6} allows 5, not sure if that was your intent or not. And if we'd allow 5, why not 3 or 2? (1 being silly, I agree, but then also, why not)

flodolo · 2018-05-22T11:03:06Z

We have quite a few, but I don't think we ported anything to FTL (also not sure how they would be ported?). Looking into Pontoon's DB sounds like a good idea though.
https://transvision.mozfr.org/?recherche=%5Cu&repo=gecko_strings&sourcelocale=en-US&locale=it&search_type=strings_entities

stasm · 2018-05-22T11:04:37Z

Looking at https://en.wikipedia.org/wiki/Plane_(Unicode), 5 hexdigits is more commonly used than 6.

Pike · 2018-05-22T11:12:51Z

... actually, clarifying question:

Do you intend to remove \u00a0 support from the spec, or just add a second way to do the same thing?

stasm · 2018-05-22T11:25:34Z

I'd prefer to have a single way, but as I mentioned in the OP, removing \u has consequences for what's possible in StringExpressions (and by extension, if #90 lands, perhaps in variant keys, too).

Actually, let me file another issue clarifying my current thinking.

stasm · 2018-05-22T11:29:33Z

I filed #123 which outlines my plan to simplify the grammar of text inside of TextElements.

mathjazz · 2018-05-22T11:59:40Z

>>> Entity.objects.filter(resource__format="ftl", string__contains="\u").count()
0
>>> Translation.objects.filter(entity__resource__format="ftl", string__contains="\u").count()
2

Both matches are false positives:
https://pontoon.mozilla.org/zh-TW/test-pilot-website/experiments.ftl/?string=169100
https://pontoon.mozilla.org/zh-TW/common-voice/messages.ftl/?string=175911

flodolo · 2018-05-24T13:17:12Z

My option would be to go for allowing only /\{U\+[0-9A-F]{4,6}\}/, being case sensitive all over the place. That would avoid any possible confusion in writing these literals, and would also help in case of search/replace operations.

Side note: if we do this, we should also think about migrations, and replacing \u00AD with {U+00AD}

stasm · 2018-08-28T12:58:31Z

I haven't been able to find a way to use the proposed Unicode literals inside of StringLiterals without increasing their complexity.

To make this a viable solution inside of StringLiterals, we'd need to allow placeables inside of StringLiterals which would really mean making them quoted Patterns.

I think this would be a major change in the design of Fluent. It would allow quoted patterns as values of named arguments, possibly variant keys (#90), and as a way to nest complex expressions ({FUNC("{ $sel -> ... }")}).

Or, we could implicitly concatenate expressions which are next to each other: { "word" U+00A0 "another word" }.

This in turn would require something like a StringConcatenation node and more runtime resolution logic to handle it.

I'd like to open this issue up for feedback about these two approaches. My own opinion is that they bring in too much complexity for relatively little gain.

If we agree not to change StringLiterals, I suggest we close this issue as wontfix. {U+9} can also be written as {"\u0009"}. While I like the U+X syntax, I don't think it's a good idea to have two ways to cater to the same edge-case.

Furthermore, if we only allow \u escapes in StringLiterals as #123 proposes, we could extend their grammar to allow more or fewer than 4 hexdigits. ECMAScript 2015 calls this Unicode code point escapes.

{"\u{9}"}
{"\u{1F600}"}

Pike · 2018-09-21T16:14:48Z

The question around unicode literals in string literals is hard to solve. Also, {U+00A0} will constrain our ability to relax message identifiers, which might not be the right way to compromise here.

I agree that WONTFIX is a good resolution here.

stasm · 2018-09-26T10:21:24Z

Agreed. Thanks for your thoughts.

Pike · 2018-09-28T12:07:05Z

Thank you.

stasm added the forwards incompatible Old parsers won't parse newer files. label May 22, 2018

Pike added the backwards incompatible Old files won't parse in new parsers. label May 22, 2018

stasm mentioned this issue May 22, 2018

Remove backslash escapes from TextElement #123

Closed

stasm mentioned this issue Jul 26, 2018

Change AST to allow for zero-copy parsing #156

Open

stasm closed this as completed Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New syntax for Unicode literals #115

New syntax for Unicode literals #115

stasm commented May 21, 2018 •

edited

Loading

stasm commented May 21, 2018

Pike commented May 22, 2018

flodolo commented May 22, 2018

Pike commented May 22, 2018

stasm commented May 22, 2018

Pike commented May 22, 2018

flodolo commented May 22, 2018

stasm commented May 22, 2018

Pike commented May 22, 2018

stasm commented May 22, 2018

stasm commented May 22, 2018

mathjazz commented May 22, 2018

flodolo commented May 24, 2018

stasm commented Aug 28, 2018 •

edited

Loading

Pike commented Sep 21, 2018

stasm commented Sep 26, 2018

Pike commented Sep 28, 2018

New syntax for Unicode literals #115

New syntax for Unicode literals #115

Comments

stasm commented May 21, 2018 • edited Loading

stasm commented May 21, 2018

Pike commented May 22, 2018

flodolo commented May 22, 2018

Pike commented May 22, 2018

stasm commented May 22, 2018

Pike commented May 22, 2018

flodolo commented May 22, 2018

stasm commented May 22, 2018

Pike commented May 22, 2018

stasm commented May 22, 2018

stasm commented May 22, 2018

mathjazz commented May 22, 2018

flodolo commented May 24, 2018

stasm commented Aug 28, 2018 • edited Loading

Pike commented Sep 21, 2018

stasm commented Sep 26, 2018

Pike commented Sep 28, 2018

stasm commented May 21, 2018 •

edited

Loading

stasm commented Aug 28, 2018 •

edited

Loading