Inconsistent with Commonmark Spec #306

ikatyang · 2017-10-06T16:42:28Z

Thanks for the awesome package, I'm able to implement markdown support for prettier using remark-parse (prettier/prettier#2943).

While I was implementing the pretty printer, I found there are some cases that parsed incorrectly according to the CommonMark spec 0.25:

(The commonmark option is linked to Commonmark Spec 0.25, so I guess it's based on 0.25?)

The text was updated successfully, but these errors were encountered:

wooorm · 2017-12-03T13:16:10Z

Yes, that’s unfortunately correct. There’s a few cases where CommonMark differs.

remark-html tests for CommonMark compliance, but skips failing tests. You can run its tests to see the differences.

Through the time CommonMark was developed, a lot has changed, so more problems arise and I haven’t been able to keep up. Initially, when I added CommonMark support, complete CommonMark compatibility wasn’t a goal (because CommonMark wasn‘t all that common)

Commonmark is still in beta (not having a major semver yet), semver states:

Major version zero (0.y.z) is for initial development. Anything may change at any time. The public API should not be considered stable.

I’d like to add 100% CommonMark compat though, but that involves a rewrite of the parser, and that takes a lot of time. I envision in the future supporting just common mark with remark-parse, and moving the GitHub extension (now it moved to CommonMark as a base) to another project (remark-parse-github, remark-gfm?)

Anyway, I don’t have the bandwidth to do it myself currently, but I’d like assist anyone who’s interested in attempting it!

geyang · 2018-01-29T00:18:06Z

@wooorm Also there are quite a few CommonMark decisions that makes it hard to treat it as a AST. It would totally make sense to start another standard called "GoodMark" that makes CommonMark more regular and less contexual, and have a standard interpolation with embedded HTML, jsx, LaTeX and other languages.

For example, CommonMark's handling of html tags is quite annoying. It doesn't treat markdown and html as a tree, but segments of html string that is later strung together and then parsed by the html parser. This means <pre> tags and <div> tags are treated differently, and text in-between html tags are sometimes parsed as markdown but sometimes not.

mb21 · 2019-09-17T16:58:42Z

+1 for Commonmark compliance!

To counter @episodeyang comment, since it's got a few upvotes:

there are quite a few CommonMark decisions that makes it hard to treat it as a AST.

uh, no? copying from https://spec.commonmark.org/0.29/#about-this-document

this document describes how Markdown is to be parsed into an abstract syntax tree

It would totally make sense to start another standard called "GoodMark" that makes CommonMark more regular and less contexual

well, then it wouldn't have a lot in common with markdown anymore though. I agree that markdown is not the easiest to parse language... so yes, we could all just switch to RST or something, but that's not the point of this discussion.

have a standard interpolation with embedded HTML, jsx, LaTeX and other languages

There are definitely markdown parsers that have extensions that do that very well, for example https://pandoc.org/MANUAL.html#generic-raw-attribute

For example, CommonMark's handling of html tags is quite annoying.

I concede that raw HTML inside markdown sometimes parses to surprising and weird results. But that's always been the case with markdown, in all markdown parsers, and the point of commonmark is exactly that at least different parsers could agree on which weird way. It's quite tricky to come up with a solution that works for most of the markdown out in the wild.

It doesn't treat markdown and html as a tree, but segments of html string that is later strung together and then parsed by the html parser.

Commonmark conceptually does parse markdown to a tree (see the quote above about the AST), although implementations may choose to not materialize that tree. And yes, that tree does not include all the HTML elements as specific nodes (otherwise, markdown would have to be a superset of the entire HTML specification), but instead has a node type raw HTML block.

wooorm · 2020-08-22T15:01:49Z

Heya, just wanted to give an update about micromark, it’s sort-of a new motor that we’ll soon use in remark to parse markdown. It’s not yet 100% ready but will be relatively soon. The good news is, it fixes this issue! (P.S. see this twitter thread for some more info!)

geyang · 2020-08-27T05:13:36Z

I have switched from physics to machine learning. So hopefully next time we discuss this, I will be training a sequence model that reads the CommonMark spec, and automatically induces this parser :)

This is a giant change for remark. It replaces the 5+ year old internals with a new low-level parser: <https://github.com/micromark/micromark> The old internals have served billions of users well over the years, but markdown has changed over that time. micromark comes with 100% CommonMark (and GFM as an extension) compliance, and (WIP) docs on parsing rules for how to tokenize markdown with a state machine: <https://github.com/micromark/common-markup-state-machine>. micromark, and micromark in remark, is a good base for the future. `remark-parse` now defers its work to [`micromark`][micromark] and [`mdast-util-from-markdown`][from-markdown]. `micromark` is a new, small, complete, and CommonMark compliant low-level markdown parser. `from-markdown` turns its tokens into the previously (and still) used syntax tree: [mdast][]. Extensions to `remark-parse` work differently: they’re a two-part act. See for example [`micromark-extension-footnote`][micromark-footnote] and [`mdast-util-footnote`][from-markdown-footnote]. * change: `commonmark` is no longer an option — it’s the default * move: `gfm` is no longer an option — moved to `remark-gfm` * remove: `pedantic` is no longer an option — this legacy and buggy flavor of markdown is no longer widely used * remove: `blocks` is no longer an options — it’s no longer suggested to change the internal list of HTML “block” tag names remark-stringify now defers its work to [`mdast-util-to-markdown`][to-markdown]. It’s a new and better serializer with powerful features to ensure serialized markdown represents the syntax tree (mdast), no matter what plugins do. Extensions to it work differently: see for example [`mdast-util-footnote`][to-markdown-footnote]. * change: `commonmark` is no longer an option, it’s the default * change: `emphasis` now defaults to `*` * change: `bullet` now defaults to `*` * move: `gfm` is no longer an option — moved to `remark-gfm` * move: `tableCellPadding` — moved to `remark-gfm` * move: `tablePipeAlign` — moved to `remark-gfm` * move: `stringLength` — moved to `remark-gfm` * remove: `pedantic` is no longer an option — this legacy and buggy flavor of markdown is no longer widely used * remove: `entities` is no longer an option — with CommonMark there is almost never a need to use character references, as character escapes are preferred * new: `quote` — you can now prefer single quotes (`'`) over double quotes (`"`) in titles All of these are for CommonMark compatibility. Most of them are inconsequential. * **notable**: references (as in, links `[text][id]` and images `![alt][id]`) are no longer present as such in the syntax tree if they don’t have a corresponding definition (`[id]: example.com`). The reason for this is that CommonMark requires `[text *emphasis start][undefined] emphasis end*` to be emphasis. * **notable**: it is no longer possible to use two blank lines between two lists or a list and indented code. CommonMark prohibits it. For a solution, use an empty comment to end lists (``) * inconsequential: whitespace at the start and end of lines in paragraphs is now ignored * inconsequential: `<mailto:foobarbaz>` are now correctly parsed, and the scheme is part of the tree * inconsequential: indented code can now follow a block quote w/o blank line * inconsequential: trailing indented blank lines after indented code are no longer part of that code * inconsequential: character references and escapes are no longer present as separate text nodes * inconsequential: character references which HTML allows but CommonMark doesn’t, such as `&copy` w/o the semicolon, are no longer recognized * inconsequential: the `indent` field is no longer available on `position` * fix: multiline setext headings * fix: lazy lists * fix: attention (emphasis, strong) * fix: tabs * fix: empty alt on images is now present as an empty string * …plus a ton of other minor previous differences from CommonMark * get folks to use this and report problems! * make `remark-gfm` * start making next branches for plugins * get types into {from,to}-markdown and use them here Closes GH-218. Closes GH-306. Closes GH-315. Closes GH-324. Closes GH-398. Closes GH-402. Closes GH-407. Closes GH-439. Closes GH-450. Closes GH-459. Closes GH-493. Closes GH-494. Closes GH-497. Closes GH-504. Closes GH-517. Closes GH-521. Closes GH-523. Closes remarkjs/remark-lint#111. [micromark]: https://github.com/micromark/micromark [from-markdown]: https://github.com/syntax-tree/mdast-util-from-markdown [to-markdown]: https://github.com/syntax-tree/mdast-util-to-markdown [micromark-footnote]: https://github.com/micromark/micromark-extension-footnote/blob/main/index.js [to-markdown-footnote]: https://github.com/syntax-tree/mdast-util-footnote/blob/main/to-markdown.js [from-markdown-footnote]: https://github.com/syntax-tree/mdast-util-footnote/blob/main/from-markdown.js [mdast]: https://github.com/syntax-tree/mdast

wooorm · 2020-10-01T15:17:48Z

Sorry for the wait! I just wanted to share that there’s now a PR that solves this issue: #536.

This is a giant change for remark. It replaces the 5+ year old internals with a new low-level parser: <https://github.com/micromark/micromark> The old internals have served billions of users well over the years, but markdown has changed over that time. micromark comes with 100% CommonMark (and GFM as an extension) compliance, and (WIP) docs on parsing rules for how to tokenize markdown with a state machine: <https://github.com/micromark/common-markup-state-machine>. micromark, and micromark in remark, is a good base for the future. `remark-parse` now defers its work to [`micromark`][micromark] and [`mdast-util-from-markdown`][from-markdown]. `micromark` is a new, small, complete, and CommonMark compliant low-level markdown parser. `from-markdown` turns its tokens into the previously (and still) used syntax tree: [mdast][]. Extensions to `remark-parse` work differently: they’re a two-part act. See for example [`micromark-extension-footnote`][micromark-footnote] and [`mdast-util-footnote`][from-markdown-footnote]. * change: `commonmark` is no longer an option — it’s the default * move: `gfm` is no longer an option — moved to `remark-gfm` * remove: `pedantic` is no longer an option — this legacy and buggy flavor of markdown is no longer widely used * remove: `blocks` is no longer an options — it’s no longer suggested to change the internal list of HTML “block” tag names remark-stringify now defers its work to [`mdast-util-to-markdown`][to-markdown]. It’s a new and better serializer with powerful features to ensure serialized markdown represents the syntax tree (mdast), no matter what plugins do. Extensions to it work differently: see for example [`mdast-util-footnote`][to-markdown-footnote]. * change: `commonmark` is no longer an option, it’s the default * change: `emphasis` now defaults to `*` * change: `bullet` now defaults to `*` * move: `gfm` is no longer an option — moved to `remark-gfm` * move: `tableCellPadding` — moved to `remark-gfm` * move: `tablePipeAlign` — moved to `remark-gfm` * move: `stringLength` — moved to `remark-gfm` * remove: `pedantic` is no longer an option — this legacy and buggy flavor of markdown is no longer widely used * remove: `entities` is no longer an option — with CommonMark there is almost never a need to use character references, as character escapes are preferred * new: `quote` — you can now prefer single quotes (`'`) over double quotes (`"`) in titles All of these are for CommonMark compatibility. Most of them are inconsequential. * **notable**: references (as in, links `[text][id]` and images `![alt][id]`) are no longer present as such in the syntax tree if they don’t have a corresponding definition (`[id]: example.com`). The reason for this is that CommonMark requires `[text *emphasis start][undefined] emphasis end*` to be emphasis. * **notable**: it is no longer possible to use two blank lines between two lists or a list and indented code. CommonMark prohibits it. For a solution, use an empty comment to end lists (``) * inconsequential: whitespace at the start and end of lines in paragraphs is now ignored * inconsequential: `<mailto:foobarbaz>` are now correctly parsed, and the scheme is part of the tree * inconsequential: indented code can now follow a block quote w/o blank line * inconsequential: trailing indented blank lines after indented code are no longer part of that code * inconsequential: character references and escapes are no longer present as separate text nodes * inconsequential: character references which HTML allows but CommonMark doesn’t, such as `&copy` w/o the semicolon, are no longer recognized * inconsequential: the `indent` field is no longer available on `position` * fix: multiline setext headings * fix: lazy lists * fix: attention (emphasis, strong) * fix: tabs * fix: empty alt on images is now present as an empty string * …plus a ton of other minor previous differences from CommonMark * get folks to use this and report problems! * make `remark-gfm` * start making next branches for plugins * get types into {from,to}-markdown and use them here Closes GH-218. Closes GH-306. Closes GH-315. Closes GH-324. Closes GH-398. Closes GH-402. Closes GH-407. Closes GH-439. Closes GH-450. Closes GH-459. Closes GH-493. Closes GH-494. Closes GH-497. Closes GH-504. Closes GH-517. Closes GH-521. Closes GH-523. Closes remarkjs/remark-lint#111. [micromark]: https://github.com/micromark/micromark [from-markdown]: https://github.com/syntax-tree/mdast-util-from-markdown [to-markdown]: https://github.com/syntax-tree/mdast-util-to-markdown [micromark-footnote]: https://github.com/micromark/micromark-extension-footnote/blob/main/index.js [to-markdown-footnote]: https://github.com/syntax-tree/mdast-util-footnote/blob/main/to-markdown.js [from-markdown-footnote]: https://github.com/syntax-tree/mdast-util-footnote/blob/main/from-markdown.js [mdast]: https://github.com/syntax-tree/mdast

wooorm · 2020-10-14T08:56:44Z

This is now released in remark@13.0.0

fk mentioned this issue Nov 14, 2017

[gatsby-remark-responsive-image] Allow per-image responsivity gatsbyjs/gatsby#2609

Closed

wooorm added 🐛 type/bug This is a problem future remark-parse labels Dec 3, 2017

tmcw mentioned this issue Dec 6, 2017

Does the dx-spec data model need to inherit from unist/mdast? tmcw/dx-spec#25

Open

a-ignatov-parc mentioned this issue Dec 18, 2017

Feature request: Add a commonmark option remarkjs/react-markdown#129

Closed

This was referenced Jan 29, 2018

Markdown: Excessive escaping of underscore and asterisk prettier/prettier#3836

Closed

Excessive escaping of asterisk prettier/prettier#3837

Open

ikatyang mentioned this issue Apr 24, 2018

markdown bug when converting underscores to asterisks prettier/prettier#4362

Closed

fk mentioned this issue Jun 11, 2018

Add a note to the top of parts 5, 6, and 7 that they make sense only if you start with part 4 and go all the way through. gatsbyjs/gatsby#5612

Closed

wooorm mentioned this issue Oct 21, 2018

Emphasis takes precedence over inline code span #75

Closed

wooorm added 🙉 open/needs-info This needs some more info and removed future labels Aug 12, 2019

This was referenced May 7, 2020

Backslash + Special Char = Shifted Highlight davidlday/vscode-languagetool-linter#132

Closed

Escape Character prosegrinder/annotatedtext-remark#26

Closed

Source from Original Text prosegrinder/annotatedtext#15

Closed

ChristianMurphy mentioned this issue Jun 15, 2020

Bold italics adjacent to italics parsed incorrectly #504

Closed

ChristianMurphy mentioned this issue Jul 25, 2020

Strikethrough does not work with a space just inside of it #520

Closed

wooorm mentioned this issue Oct 1, 2020

Change to use micromark #536

Merged

wooorm closed this as completed in #536 Oct 13, 2020

wooorm added ⛵️ status/released and removed 🙉 open/needs-info This needs some more info labels Oct 14, 2020

wooorm added the 💪 phase/solved Post is done label Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent with Commonmark Spec #306

Inconsistent with Commonmark Spec #306

ikatyang commented Oct 6, 2017

wooorm commented Dec 3, 2017

geyang commented Jan 29, 2018

mb21 commented Sep 17, 2019

wooorm commented Aug 22, 2020

geyang commented Aug 27, 2020

wooorm commented Oct 1, 2020

wooorm commented Oct 14, 2020

Inconsistent with Commonmark Spec #306

Inconsistent with Commonmark Spec #306

Comments

ikatyang commented Oct 6, 2017

wooorm commented Dec 3, 2017

geyang commented Jan 29, 2018

mb21 commented Sep 17, 2019

wooorm commented Aug 22, 2020

geyang commented Aug 27, 2020

wooorm commented Oct 1, 2020

wooorm commented Oct 14, 2020