Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add block/inline HTML parsing #23

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davexunit
Copy link

At Spritely, we'd really like it if we could embed arbitrary HTML in our Markdown files that we use in our Haunt website. It's also a longstanding issue with guile-commonmark first reported in 2018: #8

The fundamental difficulty, as I understand it, is that since the CommonMark format allows embedding any arbitrary HTML (even garbage), the resulting CommonMark AST does not necessarily reflect the shape of the HTML node tree. So, you cannot directly convert a CommonMark AST to SXML when block/inline HTML nodes are present. You have to serialize to HTML first and then use an HTML to SXML parser.

This pull request does the following:

  1. Adds new html-block and inline-html node types in (commonmark node).

  2. Adds support for parsing block and inline HTML to (commonmark blocks) and (commonmark inlines).

  3. Adds support for direct conversion of CommonMark AST to HTML text with a new commonmark->html procedure in a new (commonmark html) module.

  4. For compatibility with existing behavior, HTML nodes are converted to simple text nodes in commonmark->sxml, which means they will be escaped in the output as if they weren't parsed in the first place.

I think item 4 is particularly important because it will allow guile-commonmark to continue to work as it does today, without support for embedded HTML. The new commonmark->html interface will allow users to directly serialize to HTML (which is enough for many use-cases) or use their preferred HTML parser to convert it to SXML, such as guile-lib's (htmlprag) (which is what I'd want to do with Haunt). This avoids adding dependencies to guile-commonmark and punts on the complicated subject of HTML parsing.

The test suite file I added incorporates all 64 tests of inline/block HTML included in the CommonMark specification. Additionally, I tested that my fork of guile-commonmark can successfully parse all of the existing Spritely blog posts, serialize them to HTML, and then parse them again using html->shtml in (htmlprag).

(The test suite in general is not green, though. There are tests failing on master. I have not made the situation worse, in any case.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant