Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax parse fails with Japanese punctuation (), strong syntax and code syntax #2531

Closed
KSR-Yasuda opened this issue Jul 12, 2022 · 6 comments · Fixed by #2811
Closed
Labels
category: mixed content L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue

Comments

@KSR-Yasuda
Copy link

Marked version:

  • v4.0.18

Describe the bug
A clear and concise description of what the bug is.

Copy from volca/markdown-preview#135.

The case below, it does not parse syntax correctly.

% cat test.md
* ×: あれ、**`foo`これ**、それ
* ○: あれ、 **`foo`これ**、それ
* ×: あれ、**`foo`これ** 、それ

* ○: あれ、**fooこれ**、それ
* ○: あれ、 **fooこれ**、それ
* ○: あれ、**fooこれ** 、それ

% npx marked --version
4.0.18

% npx marked < test.md
<ul>
<li><p>×: あれ、**<code>foo</code>これ**、それ</p>
</li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p>
</li>
<li><p>×: あれ、**<code>foo</code>これ** 、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p>
</li>
</ul>

With Japanese punctuation (), strong syntax (**), and code syntax (`),
it needs some space to make them parsed correctly (The former 3 examples).

Although, without code syntax, no extra space is required (The latter 3 examples).

So it isn't a syntax parsing problem with CJK symbol characters?

To Reproduce
Steps to reproduce the behavior:

As above.

Expected behavior
A clear and concise description of what you expected to happen.

Parse the syntax correctly as Pandoc.

% pandoc --version
pandoc.exe 2.18
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4
User data directory: C:\Users\yasuda\AppData\Roaming\pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

% pandoc < test.md
<ul>
<li><p>×: あれ、<strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>×: あれ、<strong><code>foo</code>これ</strong> 、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p></li>
</ul>
@UziTech
Copy link
Member

UziTech commented Jul 12, 2022

looks like the issue is that is not included as punctuation for left delimiter.

According to the spec the puctuation should include:

an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

@KSR-Yasuda
Copy link
Author

So, now you support only ASCII punctuations, right?

The character (U+3001, Ideographic Comma) being in Unicode Po category,
it's one of 'Unicode punctuation character'.

Could you support such Unicode punctuations?

@KSR-Yasuda
Copy link
Author

And,   (U+3000, Ideographic Space) is a 'Unicode whitespace character' as Zs category character.

I think it should be also supported as a space character besides space (U+0020) and tab (U+0009), if not yet.

@azmy60
Copy link
Contributor

azmy60 commented May 19, 2023

Hi @UziTech can I work on this too? This one looks interesting 😀 . I might need to have some tests for japanese and chinese texts too.

@UziTech
Copy link
Member

UziTech commented May 19, 2023

@azmy60 ya you can take any that you think you can help with

@azmy60
Copy link
Contributor

azmy60 commented May 20, 2023

There is an exhaustive collection of utf8 punctuation in CommonMark. Do you think we should add all of it @UziTech ? I'm not really sure how to make the tests though. Adding the Ideographic Comma (as @KSR-Yasuda suggested) to the punctuation list works just fine with his example.

[UPDATE]
There is a stackoverflow answer for the punctuation codes. It's only up to 4 hex-digits since JavaScript only support up to \uFFFF.

Apparently, adding the rest of unicode punctuations also fixes #2041 by having \uFF01.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: mixed content L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants