keycap emoji is treated as formatting #646

mikesamuel · 2020-04-24T14:56:46Z

The keycap emoji, *️⃣, used for the '*' telephone button is encoded via a sequence of 3 codepoints:

U+2A (Asterisk)
U+FE0F
U+20E3

Sometimes CommonMark treats the leading asterisk as a formatting character as in **️⃣abc** (
\x{2A 2A FE0F 20E3 61 62 63 2A 2A} )

To reproduce

permalink to REPL

Observe that there is a placeholder glyph followed by bold "abc".
Note that the HTML tab shows ️⃣abc.

I expect that instead, the output should contain all three UTF-16 code units for the *️⃣ emoji.

Relevant specifications

Unicode TR#51 explains

ED-14c. emoji keycap sequence — A sequence of the following form:
emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3}

Possibly out of scope, but to get the keycap on the first line of this issue to show up properly in Github flavoured markdown, I needed to precede it with a backslash (\).

The text was updated successfully, but these errors were encountered:

jgm · 2020-04-24T16:27:01Z

Simple solution is to backslash-escape it. Commonmark regards its input as a sequence of characters and doesn't know about this keycap encoding (which I'd never heard of before).

But maybe it would be worth changing the spec so that an emphasis character followed by a variation selector (U+FE00..U+FE0F) is always treated as literal.

Crissov · 2020-04-24T23:45:41Z

This would also apply to digits 0️⃣1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣ and hash mark #️⃣.

wooorm · 2020-07-04T17:26:13Z

@Crissov While that could theoretically be a problem, it wouldn’t practically occur, right? As CM needs a following .) for lists or a space for headings?

*️⃣a*

*a*️⃣

Yields:

️⃣a

a️⃣

I believe that, to change this behavior in CM, we could add FE0F to 2a) of left-flanking delimiter run:

 A [left-flanking delimiter run](@) is
 a [delimiter run] that is (1) not followed by [Unicode whitespace],
-and either (2a) not followed by a [Unicode punctuation character], or
+and either (2a) not followed by a [Unicode punctuation character] or `U+FE0F, or
 (2b) followed by a [Unicode punctuation character] and
 preceded by [Unicode whitespace] or a [Unicode punctuation character].
 For purposes of this definition, the beginning and the end of
 the line count as Unicode whitespace.

…and to change 2a) of right-flanking delimiter run too:

 A [right-flanking delimiter run](@) is
 a [delimiter run] that is (1) not preceded by [Unicode whitespace],
-and either (2a) not preceded by a [Unicode punctuation character], or
+and either (2a) not preceded by a [Unicode punctuation character] and not followed by `U+FE0F`, or
 (2b) preceded by a [Unicode punctuation character] and
 followed by [Unicode whitespace] or a [Unicode punctuation character].
 For purposes of this definition, the beginning and the end of
 the line count as Unicode whitespace.

Crissov · 2020-07-04T19:33:32Z

Indeed, it is less of a problem for digits and # than it is for *.

ghost · 2020-10-28T17:42:35Z

I wanted to mention that I personally think the best way to handle this in the specification level is to work with the text as grapheme clusters, rather than as code points.

Of course, implementations that do not wish to implement the whole segmentation algorithm can use ad‐hoc criteria like @wooorm’s.

rsc · 2021-09-04T15:58:18Z

@zamfofex are there instances ~~other than this one~~ where the two approaches would differ?

Edited: Struck out "other than this one" because they don't differ here. The question was whether they ever differ. If not, that's a good sign because an implementation can do whichever is more convenient.

tats-u · 2024-10-13T12:39:45Z

Here is a simple corner case:

**foo**️⃣

Bad (current): foo️⃣
Bad: **foo**️⃣
Good: *foo*️⃣

I wanted to mention that I personally think the best way to handle this in the specification level is to work with the text as grapheme clusters, rather than as code points.

How about peeking two next codepoints? All we have to do is treat * followed by the following sequences as neither-flanking:

U+FE0F U+20E3 (*️⃣: emoji)
U+FE0E U+20E3 (*︎⃣: text symbol)
U+20E3 (can be either)

I think this can't be fixed with the current delimiter run; we have to exclude such * from delimiter run of *.

wooorm mentioned this issue Oct 7, 2024

Emphasis with CJK punctuation #650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keycap emoji is treated as formatting #646

keycap emoji is treated as formatting #646

mikesamuel commented Apr 24, 2020

jgm commented Apr 24, 2020

Crissov commented Apr 24, 2020

wooorm commented Jul 4, 2020

Crissov commented Jul 4, 2020

ghost commented Oct 28, 2020

rsc commented Sep 4, 2021 •

edited

Loading

tats-u commented Oct 13, 2024 •

edited

Loading

keycap emoji is treated as formatting #646

keycap emoji is treated as formatting #646

Comments

mikesamuel commented Apr 24, 2020

To reproduce

Relevant specifications

jgm commented Apr 24, 2020

Crissov commented Apr 24, 2020

wooorm commented Jul 4, 2020

Crissov commented Jul 4, 2020

ghost commented Oct 28, 2020

rsc commented Sep 4, 2021 • edited Loading

tats-u commented Oct 13, 2024 • edited Loading

rsc commented Sep 4, 2021 •

edited

Loading

tats-u commented Oct 13, 2024 •

edited

Loading