Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax #2531

KSR-Yasuda · 2022-07-12T00:55:01Z

Marked version:

v4.0.18

Describe the bug
A clear and concise description of what the bug is.

The case below, it does not parse syntax correctly.

% cat test.md
* ×: あれ、**`foo`これ**、それ
* ○: あれ、 **`foo`これ**、それ
* ×: あれ、**`foo`これ** 、それ

* ○: あれ、**fooこれ**、それ
* ○: あれ、 **fooこれ**、それ
* ○: あれ、**fooこれ** 、それ

% npx marked --version
4.0.18

% npx marked < test.md
<ul>
<li><p>×: あれ、**<code>foo</code>これ**、それ</p>
</li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p>
</li>
<li><p>×: あれ、**<code>foo</code>これ** 、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p>
</li>
</ul>

With Japanese punctuation (、), strong syntax (**), and code syntax (`),
it needs some space to make them parsed correctly (The former 3 examples).

Although, without code syntax, no extra space is required (The latter 3 examples).

So it isn't a syntax parsing problem with CJK symbol characters?

To Reproduce
Steps to reproduce the behavior:

As above.

Expected behavior
A clear and concise description of what you expected to happen.

Parse the syntax correctly as Pandoc.

% pandoc --version
pandoc.exe 2.18
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4
User data directory: C:\Users\yasuda\AppData\Roaming\pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

% pandoc < test.md
<ul>
<li><p>×: あれ、<strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>×: あれ、<strong><code>foo</code>これ</strong> 、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p></li>
</ul>

The text was updated successfully, but these errors were encountered:

UziTech · 2022-07-12T16:09:02Z

looks like the issue is that 、 is not included as punctuation for left delimiter.

According to the spec the puctuation should include:

an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

KSR-Yasuda · 2022-07-13T00:49:06Z

So, now you support only ASCII punctuations, right?

The character 、 (U+3001, Ideographic Comma) being in Unicode Po category,
it's one of 'Unicode punctuation character'.

Could you support such Unicode punctuations?

KSR-Yasuda · 2022-07-13T00:52:36Z

And, 　 (U+3000, Ideographic Space) is a 'Unicode whitespace character' as Zs category character.

I think it should be also supported as a space character besides space (U+0020) and tab (U+0009), if not yet.

azmy60 · 2023-05-19T02:41:07Z

Hi @UziTech can I work on this too? This one looks interesting 😀 . I might need to have some tests for japanese and chinese texts too.

UziTech · 2023-05-19T02:42:52Z

@azmy60 ya you can take any that you think you can help with

azmy60 · 2023-05-20T06:49:15Z

There is an exhaustive collection of utf8 punctuation in CommonMark. Do you think we should add all of it @UziTech ? I'm not really sure how to make the tests though. Adding the Ideographic Comma (as @KSR-Yasuda suggested) to the punctuation list works just fine with his example.

[UPDATE]
There is a stackoverflow answer for the punctuation codes. It's only up to 4 hex-digits since JavaScript only support up to \uFFFF.

Apparently, adding the rest of unicode punctuations also fixes #2041 by having \uFF01.

KSR-Yasuda mentioned this issue Jul 12, 2022

[Bug] Fails with Japanese punctuation (、), strong syntax and code syntax volca/markdown-preview#135

Closed

UziTech added L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue category: mixed content labels Jul 12, 2022

UziTech mentioned this issue Mar 28, 2023

The Chinese colon broke sibling parsing #2765

Closed

azmy60 mentioned this issue May 21, 2023

fix: Add Unicode punctuations #2811

Merged

5 tasks

UziTech closed this as completed in #2811 May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax #2531

Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax #2531

KSR-Yasuda commented Jul 12, 2022

UziTech commented Jul 12, 2022 •

edited

Loading

KSR-Yasuda commented Jul 13, 2022

KSR-Yasuda commented Jul 13, 2022

azmy60 commented May 19, 2023 •

edited

Loading

UziTech commented May 19, 2023

azmy60 commented May 20, 2023 •

edited

Loading

Syntax parse fails with Japanese punctuation (、), strong syntax and code syntax #2531

Syntax parse fails with Japanese punctuation (、), strong syntax and code syntax #2531

Comments

KSR-Yasuda commented Jul 12, 2022

UziTech commented Jul 12, 2022 • edited Loading

KSR-Yasuda commented Jul 13, 2022

KSR-Yasuda commented Jul 13, 2022

azmy60 commented May 19, 2023 • edited Loading

UziTech commented May 19, 2023

azmy60 commented May 20, 2023 • edited Loading

Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax #2531

Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax #2531

UziTech commented Jul 12, 2022 •

edited

Loading

azmy60 commented May 19, 2023 •

edited

Loading

azmy60 commented May 20, 2023 •

edited

Loading