Problems parsing HTML code in Markdown file #353

eksperimental · 2020-06-21T04:08:27Z

Originally reported here: elixir-lang/ex_doc#1189 (comment)

When parsing HTML code, it wlil leave the closing tag as part of the text,
here's an example.

iex(1)> string = """
...(1)> XXX
...(1)> 
...(1)> <h1 align="center">Extra Page with HTML</h1>
...(1)> 
...(1)> <p align="center"><img src="image.svg"/></p>
...(1)> 
...(1)> Elixir & Erlang
...(1)> 
...(1)> ## Section One
...(1)> 
...(1)> more text
...(1)> 
...(1)> <h1>Second Main Title</h1>
...(1)> 
...(1)> more text
...(1)> """
"XXX\n\n<h1 align=\"center\">Extra Page with HTML</h1>\n\n<p align=\"center\"><img src=\"image.svg\"/></p>\n\nElixir & Erlang\n\n## Section One\n\nmore text\n\n<h1>Second Main Title</h1>\n\nmore text\n"
iex(2)> Earmark.as_ast
as_ast/1    as_ast/2    
iex(2)> Earmark.as_ast(string)
{:ok,
 [
   {"p", [], ["XXX"]},
   {"h1", [{"align", "center"}], ["Extra Page with HTML</h1>"]},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/></p>"]},
   {"p", [], ["Elixir & Erlang"]},
   {"h2", [], ["Section One"]},
   {"p", [], ["more text"]},
   {"h1", [], ["Second Main Title</h1>"]},
   {"p", [], ["more text"]}
 ], []}

You can see it in

   {"h1", [{"align", "center"}], ["Extra Page with HTML</h1>"]},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/></p>"]},

elixir-lang/ex_doc#1190 is the PR in ExDoc trying to close this bug.
Thank you.

The text was updated successfully, but these errors were encountered:

RobertDober · 2020-06-21T08:24:10Z

Thanks for the report, bad regression here 🤕

Refs: #353; Setting up tests WIP [skip-ci]

eksperimental · 2020-06-21T15:40:51Z

Awesome.

eksperimental · 2020-06-21T15:56:42Z

hi @RobertDober,
Is there a way to parse recursively all the HTML passed to the string? such as in this example

iex(5)> string = """
...(5)> XXX
...(5)> 
...(5)> <h1 align="center">Extra <span class="bold">Page with HTML</span></h1>
...(5)> 
...(5)> <p align="center"><img src="image.svg"/></p>
...(5)> 
...(5)> Elixir & Erlang
...(5)> 
...(5)> ## Section One
...(5)> 
...(5)> more text
...(5)> * 1
...(5)> * 2
...(5)>   - A
...(5)>   - B
...(5)> * 3
...(5)> 
...(5)> <h1>Second Main Title</h1>
...(5)> 
...(5)> more text
...(5)> """
"XXX\n\n<h1 align=\"center\">Extra <span class=\"bold\">Page with HTML</span></h1>\n\n<p align=\"center\"><img src=\"image.svg\"/></p>\n\nElixir & Erlang\n\n## Section One\n\nmore text\n* 1\n* 2\n  - A\n  - B\n* 3\n\n<h1>Second Main Title</h1>\n\nmore text\n"
iex(6)> Earmark.as_ast(string)
{:ok,
 [
   {"p", [], ["XXX"]},
   {"h1", [{"align", "center"}],
    ["Extra <span class=\"bold\">Page with HTML</span>"],
    %{meta: %{verbatim: true}}},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/>"],
    %{meta: %{verbatim: true}}},
   {"p", [], ["Elixir & Erlang"]},
   {"h2", [], ["Section One"]},
   {"p", [], ["more text"]},
   {"ul", [],
    [
      {"li", [], ["1"]},
      {"li", [], ["2", {"ul", [], [{"li", [], ["A"]}, {"li", [], ["B"]}]}]},
      {"li", [], ["3"]}
    ]},
   {"h1", [], ["Second Main Title"], %{meta: %{verbatim: true}}},
   {"p", [], ["more text"]}
 ], []}

I would like to get an HTML tree in {"h1", [{"align", "center"}], ["Extra <span class=\"bold\">Page with HTML</span>"],

Thank you.

RobertDober · 2020-06-21T17:28:38Z

😨 Everything is possible 😉

But when and how is the question, this looks like quite a massive rewrite to me and I have still some bugs. What bugs (stupid pun) me is that I cannot myself estimate the priorities for ex_doc

So maybe Milestone 1.5, maybe only with a new option which might even allow to not parse HTML at all!!!

Personally I ~~do not like~~ hate, HTML inside markdown, but I am sure you have good use cases, would you care telling me a little bit more please?

RobertDober · 2020-06-21T17:29:51Z

Please do so in a new ticket, I prefer that I see who had the idea of an enhancement so I would prefer if you could open a new ticket. Thx

eksperimental · 2020-06-21T17:36:56Z

fearful Everything is possible wink

But when and how is the question, this looks like quite a massive rewrite to me and I have still some bugs. What bugs (stupid pun) me is that I cannot myself estimate the priorities for ex_doc

So maybe Milestone 1.5, maybe only with a new option which might even allow to not parse HTML at all!!!

Personally I ~~do not like~~ hate, HTML inside markdown, but I am sure you have good use cases, would you care telling me a little bit more please?

Well. main case is that before using AST in ExDoc it was supported, so it's a backward compatibility issue. Not a big one since we haven't reached v1.0, but something to have in mind.

The main reason for having it is because we need to escape special chars such as < and >, and the html would end up being escaped instead of being rendered. I was just asking because maybe there was an option in Earmark to allow this, and I wasn't aware of this. But it is a feature that I can Implement in the Earkmark module in ExDoc. I have something in the cook. I will post it here.

I will create a new issue with the feature request. Thank you!

RobertDober · 2020-06-21T18:39:21Z

Actually that is not really so, the inner html was just parsed as text and rendered as text, do you do something fancy with the text nodes that are just html when rendering them?

I might see another problem here, the missing verbatim: true for oneline tags, which I have fixed, so if you escaped the html in the text nodes because of that missing meta information, which you should I guess, now you should be quite backwards compatible...

Have you actually tested that already, or is that not the problem?

eksperimental · 2020-06-21T18:53:46Z

I might see another problem here, the missing verbatim: true for oneline tags, which I have fixed, so if you escaped the html in the text nodes because of that missing meta information, which you should I guess, now you should be quite backwards compatible...

Have you actually tested that already, or is that not the problem?

The master version of doesnt work with ExDoc, because it expects 3-element tuples, but the verbartim info is in a 4th element.
I left two comments in your commit.

Actually that is not really so, the inner html was just parsed as text and rendered as text, do you do something fancy with the text nodes that are just html when rendering them?
I only want to escape everything that is not HTML code.

RobertDober · 2020-06-21T19:00:45Z

The master version of doesnt work with ExDoc, because it expects 3-element tuples, but the > verbartim info is in a 4th element.
I left two comments in your commit.

yep I replied there thx, all parsed html tags have this meta information, I forgot it for oneliners

eksperimental mentioned this issue Jun 21, 2020

Fix escaping of special HTML chars elixir-lang/ex_doc#1190

Closed

RobertDober self-assigned this Jun 21, 2020

RobertDober added bug Priority: HIGH labels Jun 21, 2020

RobertDober added this to the 1.4.6 milestone Jun 21, 2020

RobertDober added a commit that referenced this issue Jun 21, 2020

Refs: #353; Setting up tests WIP [skip-ci]

37b7d2e

RobertDober linked a pull request Jun 21, 2020 that will close this issue

Refs: #353; Setting up tests WIP [skip-ci] #355

Merged

RobertDober added a commit that referenced this issue Jun 21, 2020

Fixes: #353;

a2a85bc

RobertDober closed this as completed in #355 Jun 21, 2020

RobertDober added a commit that referenced this issue Jun 21, 2020

Merge pull request #355 from pragdave/i353-oneline-tags-broken

49842a5

Refs: #353; Setting up tests WIP [skip-ci]

This was referenced Jun 22, 2020

Parse HTML recursively #358

Open

Document how HTML parsing works #357

Closed

RobertDober mentioned this issue Apr 10, 2021

Parse HTML recursively RobertDober/earmark_parser#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems parsing HTML code in Markdown file #353

Problems parsing HTML code in Markdown file #353

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020

eksperimental commented Jun 21, 2020

eksperimental commented Jun 21, 2020 •

edited

Loading

RobertDober commented Jun 21, 2020 •

edited

Loading

RobertDober commented Jun 21, 2020

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020 •

edited

Loading

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020 •

edited

Loading

Problems parsing HTML code in Markdown file #353

Problems parsing HTML code in Markdown file #353

Comments

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020

eksperimental commented Jun 21, 2020

eksperimental commented Jun 21, 2020 • edited Loading

RobertDober commented Jun 21, 2020 • edited Loading

RobertDober commented Jun 21, 2020

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020 • edited Loading

eksperimental commented Jun 21, 2020

RobertDober commented Jun 21, 2020 • edited Loading

eksperimental commented Jun 21, 2020 •

edited

Loading

RobertDober commented Jun 21, 2020 •

edited

Loading

RobertDober commented Jun 21, 2020 •

edited

Loading

RobertDober commented Jun 21, 2020 •

edited

Loading