Skip to content

Additional paragraph when using Markdown in raw HTML #595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jannschu opened this issue Nov 15, 2017 · 11 comments
Closed

Additional paragraph when using Markdown in raw HTML #595

jannschu opened this issue Nov 15, 2017 · 11 comments
Labels
bug Bug report. core Related to the core parser code. someday-maybe Approved low priority request.

Comments

@jannschu
Copy link

jannschu commented Nov 15, 2017

First thank you for this implementation! 👍

I provide a minimal working example of some unexpected behavior I encountered:

import markdown
md = markdown.Markdown(extensions=['extra'])
html = md.convert("<div markdown><p>Hello _World!_</p></div>")
print(html)

The output was

<div>
<p><p>Hello <em>World!</em></p></p>
</div>

But I expected

<div>
<p>Hello <em>World!</em></p>
</div>

My workaround is to use two line breaks (edit)

html = md.convert("<div markdown>\n\n<p>Hello _World!_</p></div>")

does not include the additional paragraph, but the Markdown is not replaced.

@waylan
Copy link
Member

waylan commented Nov 15, 2017

Actually this is the correct behavior. The problem is that the Markdown-in-raw-HTML behavior is not well documented.

For example, it may help to understand that a markdown="1" attribute does not apply to the nested <p> element. It only applies to the <div> it is assigned to. Any content inside the <div> is now regular Markdown content and must follow the rules of Markdown.

As a reminder, the rules state:

The only restrictions are that block-level HTML elements — e.g. <div>, <table>, <pre>, <p>, etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) <p> tags around HTML block-level tags.

That rule is strictly enforced. You only avoid the extra (unwanted) <p> tags if those three conditions are met (block-level element, blank lines and no indentation). Of course, if you add a markdown="1" attribute to a tag, then the rule doesn't apply to that tag, but it still applies to any nested tags.

Therefore, this is what you want as input:

<div markdown="1">

<p  markdown="1">Hello _World!_</p>

</div>

@jannschu
Copy link
Author

I had already suspected that this might be an user error. Thanks for your answer with details about the background.

@jannschu
Copy link
Author

jannschu commented Nov 15, 2017

Is the following expected to work?

html = md.convert("""<div markdown="1">

<div markdown="1">

<p markdown="1">Hello _World!_<p>

</div>

</div>""")

I get an IndexError running this.

Edit: This is #584 probably.

@waylan waylan added the bug Bug report. label Nov 15, 2017
@waylan
Copy link
Member

waylan commented Nov 15, 2017

We have a policy that an error should never be raised when parsing, which makes this a bug regardless of what behavior is expected there. Could you provide the error?

Unfortunately, we don't use a proper HTML parser, but a simplistic set of regex (for historical reasons I won't get into here). For standard Markdown, that is sufficient, but when using the markdown="1" syntax, it gets much more complicated pretty quickly. I don't think we ever envisioned anyone using such deeply nested raw HTML with markdown processing. If you want complicated/deeply nested raw HTML, the assumption is that it will be all raw HTML.

@waylan waylan reopened this Nov 15, 2017
@facelessuser
Copy link
Collaborator

Is the issue because the opening and closing tags of <p markdown="1">Hello _World!_<p> are not on separate lines?

@jannschu
Copy link
Author

I just tried it with the tags being on separate lines and I got the same error. The error is

Traceback (most recent call last):
  File "test.py", line 14, in <module>
    </div>""")
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/__init__.py", line 371, in convert
    root = self.parser.parseDocument(self.lines).getroot()
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 65, in parseDocument
    self.parseChunk(self.root, '\n'.join(lines))
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 80, in parseChunk
    self.parseBlocks(parent, text.split('\n\n'))
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/blockparser.py", line 98, in parseBlocks
    if processor.run(parent, blocks) is not False:
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 130, in run
    block = self._process_nests(element, block)
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 98, in _process_nests
    block[nest_index[-1][1]:], True)                      # nest
  File "/home/jannschu/.venv/lib/python3.6/site-packages/markdown/extensions/extra.py", line 104, in run
    tag = self._tag_data[self.parser.blockprocessors.tag_counter]
IndexError: list index out of range

@facelessuser
Copy link
Collaborator

Cool. That raw markdown parsing is a bit of a mess. I'll take a look at it and at least figure out why it is failing. I still haven't had time to come up with a final solution on this though.

@waylan
Copy link
Member

waylan commented Nov 15, 2017

I'm wondering if it makes sense to replace the contents of the RawHtml preprocessor with something like this, which uses the HTMLParser in the standard lib. What I have there is a rough proof-of-concept, but I expect we could get that to work. The concern is how to handle autolinks and invalid raw HTML.

@facelessuser
Copy link
Collaborator

Maybe, I haven't played with your implementation on this, but I'd love anything that is easier to navigate then what we currently have for raw processing. It is one of the reasons #585 is still open; I'm dreading digging into that code so I keep procrastinating.

@facelessuser
Copy link
Collaborator

Just some investigation (based on my experience with raw HTML), this works:

md.convert("""<div markdown="1">
<div markdown="1">
<p markdown="1">

Hello _World!_
</p>
</div>
</div>""")

The parsing of HTML has always been a little janky. It is really sensitive to spacing of the HTML elements and such. Even the markdown content spacing between elements can be weird.

Anyways, I haven't dug deep enough into the parser to find the actual failure yet, but I plan to. To be honest rewriting all of this HTML handling and block handling is going to be key to fixing a lot of issues with Python Markdown.

The current block processors and the HTML processing really needs an overhaul as they don't handle things very well. Nested indented code blocks lose new lines when they have multiple consecutive new lines. Raw HTML is kind of funky.

I kind of feel HTML processing should be a block processor. Or maybe a line processor (which doesn't currently exist). Python Markdown sorely needs a way to identify the beginning of a block and be able to process the lines until it knows where the block's end is instead of relying on \n\n to denote a line end. And sometimes, those \n\n that we split on and throw away are useful, like in the case of an indented code block in a list. I feel the \n\n that we break on should always be appended to to the previous block so they can be processed if the block desires to. Maybe a line processor is a suitable replacement for block processors. I haven't put much thought into this yet.

Anyways, hopefully this issue will be patchable, but I feel the next iteration needs an overhaul in this area.

@waylan
Copy link
Member

waylan commented Nov 25, 2017

I kind of feel HTML processing should be a block processor. Or maybe a line processor (which doesn't currently exist).

I agree. In fact, this was part of the original plan for 3.0. However, I just don't have the time to do the work right now and don't expect to be able to any time in the foreseeable future.

@waylan waylan added someday-maybe Approved low priority request. core Related to the core parser code. labels Oct 23, 2018
@waylan waylan closed this as completed in b701c34 Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. core Related to the core parser code. someday-maybe Approved low priority request.
Projects
None yet
Development

No branches or pull requests

3 participants