Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems parsing HTML code in Markdown file #353

Closed
eksperimental opened this issue Jun 21, 2020 · 9 comments · Fixed by #355
Closed

Problems parsing HTML code in Markdown file #353

eksperimental opened this issue Jun 21, 2020 · 9 comments · Fixed by #355
Assignees

Comments

@eksperimental
Copy link
Contributor

Originally reported here: elixir-lang/ex_doc#1189 (comment)

When parsing HTML code, it wlil leave the closing tag as part of the text,
here's an example.

iex(1)> string = """
...(1)> XXX
...(1)> 
...(1)> <h1 align="center">Extra Page with HTML</h1>
...(1)> 
...(1)> <p align="center"><img src="image.svg"/></p>
...(1)> 
...(1)> Elixir & Erlang
...(1)> 
...(1)> ## Section One
...(1)> 
...(1)> more text
...(1)> 
...(1)> <h1>Second Main Title</h1>
...(1)> 
...(1)> more text
...(1)> """
"XXX\n\n<h1 align=\"center\">Extra Page with HTML</h1>\n\n<p align=\"center\"><img src=\"image.svg\"/></p>\n\nElixir & Erlang\n\n## Section One\n\nmore text\n\n<h1>Second Main Title</h1>\n\nmore text\n"
iex(2)> Earmark.as_ast
as_ast/1    as_ast/2    
iex(2)> Earmark.as_ast(string)
{:ok,
 [
   {"p", [], ["XXX"]},
   {"h1", [{"align", "center"}], ["Extra Page with HTML</h1>"]},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/></p>"]},
   {"p", [], ["Elixir & Erlang"]},
   {"h2", [], ["Section One"]},
   {"p", [], ["more text"]},
   {"h1", [], ["Second Main Title</h1>"]},
   {"p", [], ["more text"]}
 ], []}

You can see it in

   {"h1", [{"align", "center"}], ["Extra Page with HTML</h1>"]},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/></p>"]},

elixir-lang/ex_doc#1190 is the PR in ExDoc trying to close this bug.
Thank you.

@RobertDober
Copy link
Collaborator

Thanks for the report, bad regression here 🤕

@RobertDober RobertDober self-assigned this Jun 21, 2020
@RobertDober RobertDober added this to the 1.4.6 milestone Jun 21, 2020
@RobertDober RobertDober linked a pull request Jun 21, 2020 that will close this issue
RobertDober added a commit that referenced this issue Jun 21, 2020
RobertDober added a commit that referenced this issue Jun 21, 2020
@eksperimental
Copy link
Contributor Author

Awesome.

@eksperimental
Copy link
Contributor Author

eksperimental commented Jun 21, 2020

hi @RobertDober,
Is there a way to parse recursively all the HTML passed to the string? such as in this example

iex(5)> string = """
...(5)> XXX
...(5)> 
...(5)> <h1 align="center">Extra <span class="bold">Page with HTML</span></h1>
...(5)> 
...(5)> <p align="center"><img src="image.svg"/></p>
...(5)> 
...(5)> Elixir & Erlang
...(5)> 
...(5)> ## Section One
...(5)> 
...(5)> more text
...(5)> * 1
...(5)> * 2
...(5)>   - A
...(5)>   - B
...(5)> * 3
...(5)> 
...(5)> <h1>Second Main Title</h1>
...(5)> 
...(5)> more text
...(5)> """
"XXX\n\n<h1 align=\"center\">Extra <span class=\"bold\">Page with HTML</span></h1>\n\n<p align=\"center\"><img src=\"image.svg\"/></p>\n\nElixir & Erlang\n\n## Section One\n\nmore text\n* 1\n* 2\n  - A\n  - B\n* 3\n\n<h1>Second Main Title</h1>\n\nmore text\n"
iex(6)> Earmark.as_ast(string)
{:ok,
 [
   {"p", [], ["XXX"]},
   {"h1", [{"align", "center"}],
    ["Extra <span class=\"bold\">Page with HTML</span>"],
    %{meta: %{verbatim: true}}},
   {"p", [{"align", "center"}], ["<img src=\"image.svg\"/>"],
    %{meta: %{verbatim: true}}},
   {"p", [], ["Elixir & Erlang"]},
   {"h2", [], ["Section One"]},
   {"p", [], ["more text"]},
   {"ul", [],
    [
      {"li", [], ["1"]},
      {"li", [], ["2", {"ul", [], [{"li", [], ["A"]}, {"li", [], ["B"]}]}]},
      {"li", [], ["3"]}
    ]},
   {"h1", [], ["Second Main Title"], %{meta: %{verbatim: true}}},
   {"p", [], ["more text"]}
 ], []}

I would like to get an HTML tree in {"h1", [{"align", "center"}], ["Extra <span class=\"bold\">Page with HTML</span>"],

Thank you.

@RobertDober
Copy link
Collaborator

RobertDober commented Jun 21, 2020

😨 Everything is possible 😉

But when and how is the question, this looks like quite a massive rewrite to me and I have still some bugs. What bugs (stupid pun) me is that I cannot myself estimate the priorities for ex_doc

So maybe Milestone 1.5, maybe only with a new option which might even allow to not parse HTML at all!!!

Personally I do not like hate, HTML inside markdown, but I am sure you have good use cases, would you care telling me a little bit more please?

@RobertDober
Copy link
Collaborator

Please do so in a new ticket, I prefer that I see who had the idea of an enhancement so I would prefer if you could open a new ticket. Thx

@eksperimental
Copy link
Contributor Author

fearful Everything is possible wink

But when and how is the question, this looks like quite a massive rewrite to me and I have still some bugs. What bugs (stupid pun) me is that I cannot myself estimate the priorities for ex_doc

So maybe Milestone 1.5, maybe only with a new option which might even allow to not parse HTML at all!!!

Personally I do not like hate, HTML inside markdown, but I am sure you have good use cases, would you care telling me a little bit more please?

Well. main case is that before using AST in ExDoc it was supported, so it's a backward compatibility issue. Not a big one since we haven't reached v1.0, but something to have in mind.

The main reason for having it is because we need to escape special chars such as < and >, and the html would end up being escaped instead of being rendered. I was just asking because maybe there was an option in Earmark to allow this, and I wasn't aware of this. But it is a feature that I can Implement in the Earkmark module in ExDoc. I have something in the cook. I will post it here.

I will create a new issue with the feature request. Thank you!

@RobertDober
Copy link
Collaborator

RobertDober commented Jun 21, 2020

Actually that is not really so, the inner html was just parsed as text and rendered as text, do you do something fancy with the text nodes that are just html when rendering them?

I might see another problem here, the missing verbatim: true for oneline tags, which I have fixed, so if you escaped the html in the text nodes because of that missing meta information, which you should I guess, now you should be quite backwards compatible...

Have you actually tested that already, or is that not the problem?

@eksperimental
Copy link
Contributor Author

I might see another problem here, the missing verbatim: true for oneline tags, which I have fixed, so if you escaped the html in the text nodes because of that missing meta information, which you should I guess, now you should be quite backwards compatible...

Have you actually tested that already, or is that not the problem?

The master version of doesnt work with ExDoc, because it expects 3-element tuples, but the verbartim info is in a 4th element.
I left two comments in your commit.

Actually that is not really so, the inner html was just parsed as text and rendered as text, do you do something fancy with the text nodes that are just html when rendering them?
I only want to escape everything that is not HTML code.

@RobertDober
Copy link
Collaborator

RobertDober commented Jun 21, 2020

The master version of doesnt work with ExDoc, because it expects 3-element tuples, but the > verbartim info is in a 4th element.
I left two comments in your commit.

yep I replied there thx, all parsed html tags have this meta information, I forgot it for oneliners

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants