Additional paragraph when using Markdown in raw HTML #595

jannschu · 2017-11-15T15:56:02Z

First thank you for this implementation! 👍

I provide a minimal working example of some unexpected behavior I encountered:

import markdown
md = markdown.Markdown(extensions=['extra'])
html = md.convert("<div markdown><p>Hello _World!_</p></div>")
print(html)

The output was

<div>
<p><p>Hello <em>World!</em></p></p>
</div>

But I expected

<div>
<p>Hello <em>World!</em></p>
</div>

My workaround ~~is to use two line breaks~~ (edit)

html = md.convert("<div markdown>\n\n<p>Hello _World!_</p></div>")

does not include the additional paragraph, but the Markdown is not replaced.

waylan · 2017-11-15T16:18:25Z

Actually this is the correct behavior. The problem is that the Markdown-in-raw-HTML behavior is not well documented.

For example, it may help to understand that a markdown="1" attribute does not apply to the nested <p> element. It only applies to the <div> it is assigned to. Any content inside the <div> is now regular Markdown content and must follow the rules of Markdown.

As a reminder, the rules state:

The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.

That rule is strictly enforced. You only avoid the extra (unwanted) <p> tags if those three conditions are met (block-level element, blank lines and no indentation). Of course, if you add a markdown="1" attribute to a tag, then the rule doesn't apply to that tag, but it still applies to any nested tags.

Therefore, this is what you want as input:

<div markdown="1">

<p  markdown="1">Hello _World!_</p>

</div>

jannschu · 2017-11-15T16:25:08Z

I had already suspected that this might be an user error. Thanks for your answer with details about the background.

jannschu · 2017-11-15T16:44:40Z

Is the following expected to work?

html = md.convert("""<div markdown="1">

<div markdown="1">

<p markdown="1">Hello _World!_<p>

</div>

</div>""")

I get an IndexError running this.

Edit: This is #584 probably.

waylan · 2017-11-15T16:58:48Z

We have a policy that an error should never be raised when parsing, which makes this a bug regardless of what behavior is expected there. Could you provide the error?

Unfortunately, we don't use a proper HTML parser, but a simplistic set of regex (for historical reasons I won't get into here). For standard Markdown, that is sufficient, but when using the markdown="1" syntax, it gets much more complicated pretty quickly. I don't think we ever envisioned anyone using such deeply nested raw HTML with markdown processing. If you want complicated/deeply nested raw HTML, the assumption is that it will be all raw HTML.

facelessuser · 2017-11-15T17:00:17Z

Is the issue because the opening and closing tags of <p markdown="1">Hello _World!_<p> are not on separate lines?

jannschu · 2017-11-15T17:03:39Z

I just tried it with the tags being on separate lines and I got the same error. The error is

Traceback (most recent call last):
  File "test.py", line 14, in <module>
    </div>""")
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/__init__.py", line 371, in convert
    root = self.parser.parseDocument(self.lines).getroot()
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 65, in parseDocument
    self.parseChunk(self.root, '\n'.join(lines))
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 80, in parseChunk
    self.parseBlocks(parent, text.split('\n\n'))
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 98, in parseBlocks
    if processor.run(parent, blocks) is not False:
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 130, in run
    block = self._process_nests(element, block)
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 98, in _process_nests
    block[nest_index[-1][1]:], True)                      # nest
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 104, in run
    tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range

facelessuser · 2017-11-15T17:06:40Z

Cool. That raw markdown parsing is a bit of a mess. I'll take a look at it and at least figure out why it is failing. I still haven't had time to come up with a final solution on this though.

waylan · 2017-11-15T18:25:18Z

I'm wondering if it makes sense to replace the contents of the RawHtml preprocessor with something like this, which uses the HTMLParser in the standard lib. What I have there is a rough proof-of-concept, but I expect we could get that to work. The concern is how to handle autolinks and invalid raw HTML.

facelessuser · 2017-11-15T18:32:45Z

Maybe, I haven't played with your implementation on this, but I'd love anything that is easier to navigate then what we currently have for raw processing. It is one of the reasons #585 is still open; I'm dreading digging into that code so I keep procrastinating.

facelessuser · 2017-11-25T23:06:24Z

Just some investigation (based on my experience with raw HTML), this works:

md.convert("""<div markdown="1">
<div markdown="1">
<p markdown="1">

Hello _World!_
</p>
</div>
</div>""")

The parsing of HTML has always been a little janky. It is really sensitive to spacing of the HTML elements and such. Even the markdown content spacing between elements can be weird.

Anyways, I haven't dug deep enough into the parser to find the actual failure yet, but I plan to. To be honest rewriting all of this HTML handling and block handling is going to be key to fixing a lot of issues with Python Markdown.

The current block processors and the HTML processing really needs an overhaul as they don't handle things very well. Nested indented code blocks lose new lines when they have multiple consecutive new lines. Raw HTML is kind of funky.

I kind of feel HTML processing should be a block processor. Or maybe a line processor (which doesn't currently exist). Python Markdown sorely needs a way to identify the beginning of a block and be able to process the lines until it knows where the block's end is instead of relying on \n\n to denote a line end. And sometimes, those \n\n that we split on and throw away are useful, like in the case of an indented code block in a list. I feel the \n\n that we break on should always be appended to to the previous block so they can be processed if the block desires to. Maybe a line processor is a suitable replacement for block processors. I haven't put much thought into this yet.

Anyways, hopefully this issue will be patchable, but I feel the next iteration needs an overhaul in this area.

waylan · 2017-11-25T23:55:21Z

I kind of feel HTML processing should be a block processor. Or maybe a line processor (which doesn't currently exist).

I agree. In fact, this was part of the original plan for 3.0. However, I just don't have the time to do the work right now and don't expect to be able to any time in the foreseeable future.

jannschu closed this as completed Nov 15, 2017

waylan added the bug Bug report. label Nov 15, 2017

waylan reopened this Nov 15, 2017

waylan added someday-maybe Approved low priority request. core Related to the core parser code. labels Oct 23, 2018

waylan mentioned this issue Sep 15, 2020

Refactor HTML Parser #803

Merged

waylan closed this as completed in b701c34 Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Additional paragraph when using Markdown in raw HTML #595

Additional paragraph when using Markdown in raw HTML #595

jannschu commented Nov 15, 2017 •

edited

Loading

waylan commented Nov 15, 2017 •

edited

Loading

Uh oh!

jannschu commented Nov 15, 2017

Uh oh!

jannschu commented Nov 15, 2017 •

edited

Loading

Uh oh!

waylan commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

jannschu commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

waylan commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 25, 2017

Uh oh!

waylan commented Nov 25, 2017

Uh oh!

Additional paragraph when using Markdown in raw HTML #595

Additional paragraph when using Markdown in raw HTML #595

Comments

jannschu commented Nov 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

waylan commented Nov 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jannschu commented Nov 15, 2017

Uh oh!

jannschu commented Nov 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waylan commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

jannschu commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

waylan commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 15, 2017

Uh oh!

facelessuser commented Nov 25, 2017

Uh oh!

waylan commented Nov 25, 2017

Uh oh!

jannschu commented Nov 15, 2017 •

edited

Loading

waylan commented Nov 15, 2017 •

edited

Loading

jannschu commented Nov 15, 2017 •

edited

Loading