Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces around East Asian punctuations in decorated text should not be required #1076

Closed
ikedas opened this issue Jun 19, 2017 · 5 comments
Closed
Labels

Comments

@ikedas
Copy link

ikedas commented Jun 19, 2017

Problem Description

In East Asian texts in general, word separators (spaces) never be written explicitly. So

> 前の**文字列**の後

should be rendered as

前の文字列の後

(Image)
ex000

and in practice this works as expected.

However, if the text fragment to be decorated ends and/or starts with punctuation:

> 前の**前の「文字列」**の後、前の**「文字列」の後**の後、そのあと。
> 
> 前の**「文字列」**の後。

they won't render as expected:

前の**前の「文字列」の後、前の「文字列」**の後、そのあと。

前の**「文字列」**の後。

(Image)
ex001

Possible workaround is inserting space before or after punctuations ("␣" means space):

> 前の**前の「文字列」**␣の後、前の␣**「文字列」の後**の後、そのあと。
> 
> 前の␣**「文字列」**␣の後。

but it will generate ugry text with an extra space before or after punctuations:

前の前の「文字列」 の後、前の 「文字列」の後の後、そのあと。

前の 「文字列」 の後。

(Image)
ex002

Suggested modification

East Asian punctuations should be treated in the way same as normal East Asian characters (Chinese ideographs and so on).

FYI: Almost all of East Asian punctuations are listed here:

@kivikakk
Copy link
Contributor

kivikakk commented Jun 20, 2017

👋 Thanks for the report. Please note that the github/markup repository's issues are really just for issues regarding the github-markup gem itself, which doesn't have anything to do with Markdown processing. You'd be better off contacting our support team with these kinds of issues in future, because we have lots of support staff but only a couple busy engineers who monitor this repo.

For this issue specifically, the root cause is in the CommonMark specification, which we adhere to. The section of the specification on emphasis states:

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, and (b) either not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, and (b) either not preceded by a punctuation character, or followed by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

"punctuation character" is defined as "an ASCII punctuation character or anything in the Unicode classes Pc, Pd, Pe, Pf, Pi, Po, or Ps", and and are in the Ps and Pe categories respectively.

The problem here is that this definition of punctuation character makes sense in the context of the specification if we assume "Unicode whitespace" is a part of the text used (as with most Latin alphabet-derived languages); we expect to see The cat is called "Nodoka". but not 猫は「のどか」という。, where the latter has no space or punctuation character separating the 「」 from the surrounding text.

Hence, when we add emphasis (e.g. around "Nodoka"), we get: The cat is called **"Nodoka"**. but not 猫は**「のどか」**という。

With the English text, the opening ** satisfies the definition of a "left-flanking delimiter run": it is (a) not followed by Unicode whitespace ("), and (b) preceded by Unicode whitespace. The closing ** satisfies the definition of a "right-flanking delimiter run": it is (a) not preceded by Unicode whitespace ("), and (b) followed by a punctuation character (.).

With the Japanese text, however, the opening ** does not satisfy the definition of a "left-flanking delimiter run": it is (a) not followed by Unicode whitespace (), but (b) it is followed by a punctuation character, and it is not preceded by Unicode whitespace or a punctuation character (). Likewise, the closing ** does not satisfy the definition of a "right-flanking delimiter" run: it is (a) not preceded by Unicode whitespace (), but (b) it is preceded by a punctuation character, and it is not followed by Unicode whitespace or punctuation ().

In short, this is a deficiency with the CommonMark specification's handling of East Asian text in general, because of the way the specification assumes interaction between punctuation characters and whitespace characters. I'll raise this issue (along with all the above information) in the CommonMark Discussion forum and work toward a solution.

Thanks for your patience and for the report!

@kivikakk
Copy link
Contributor

@ikedas
Copy link
Author

ikedas commented Jun 21, 2017

@kivikakk thanks. I'll comment on the new thread.

@kivikakk
Copy link
Contributor

kivikakk commented Sep 5, 2018

It's been over a year and we still haven't had movement here; pinging the upstream repo now.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Dec 11, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants