Introduce Markdown.AST #1196

eksperimental · 2020-06-26T01:53:59Z

Introduce ExDoc.Markdown.AST and t:Markdown.AST.html/0.
Replaces ExDoc.Formatter.HTML.ast_to_html/1 with ExDoc.Markdown.AST.to_html/2 which is a port of Earmark.Transform
It uses a more recent git version of Earkmark since due to a bug in the library.
It adds Floki to deal with HTML code in Markdown docs.

josevalim · 2020-06-26T08:07:12Z

Thank you @eksperimental, this looks amazing but I am afraid it is nudging us in the wrong direction. Earmark is already capable of parsing HTML, so why do we need to reimplement it here and rely on Floki?

In particular, I would expect markdown |> Earmark.to_html and markdown |> Earmark.as_ast |> simple_ast_to_html to be enough. If it isn't, I would say Earmark.as_ast is not functioning correctly and we should push those fixes upstream.

eksperimental · 2020-06-26T14:14:47Z

The use of Floki was a last resort. Actually this PR was done when I added more HTML tests to realize that we were still not covering them.
Earkmarks has two main limitations:

Currently it does not parse HTML recursively. Parse HTML recursively pragdave/earmark#358
HTML needs to be written in a certain way. Document how HTML parsing works pragdave/earmark#357, https://github.com/pragdave/earmark/pull/360/files

Boths PRs stemmed from reports I have made upstream by working in this PR.
I have reported many others, but still I don't think all cases will be covered unless they use something like Floki.
Floki is exclusively used in exactly one line, in one case, when verbatim: true, and I'm in the talks to see if can get a copy of the original HTML code in the Earmark AST pragdave/earmark#356 (comment)

josevalim · 2020-06-26T14:22:00Z

Right, so I would rather push those concerns to Earmark rather than in ExDoc. It is not our job to solve them.

Perhaps one solution you can adopt in Earmark is to have a copy of floki_mochi_html, properly renamed to earmark_mochi_html. It doesn't bring a whole dependency and it will import only the parts that you need (which is effectively the parser).

eksperimental · 2020-06-26T14:27:08Z

@RobertDober What do you think can be done about this?

RobertDober · 2020-06-26T14:52:38Z

@eksperimental has had a tough time discovering how basic and also broken HTML parsing is in Earmark. Now I believe that I can come up with a pretty good HTML parser in 1.5 but I admit that

Floki already does this and probably better than Earmark will ever
personally I am totally confused how HTML should be parsed in Markdown and therefor try to orient myself on GFM but that is a high bar
I really liked @eksperimental's idea to provide the pure HTML string in the AST, (it would be a great safety net, I mean Earmark is not really professionally maintained, I am very much aware of my shortcomings) and that I could push into 1.4.6 I guess
I'll have a look at https://github.com/philss/floki/blob/master/src/floki_mochi_html.erl and come back to you on that

All that said I am working hard on the 1.4.6 release with the new AST format and most of the gross errors fixed (ETA Sunday 😨 ) do we agree that this is the first priority? If not I can rearrange my time.

Please note too that pragdave/earmark#358 is downwards incompatible and therefore scheduled for 1.5 (ETA????)

josevalim · 2020-06-26T14:56:59Z

My understanding is that the issues described here are "old issues", so I would keep your current priority.

@eksperimental if we close this, we should still merge #1190, right?

eksperimental · 2020-06-26T15:12:50Z

My understanding is that the issues described here are "old issues", so I would keep your current priority.

@eksperimental if we close this, we should still merge #1190, right?

no, never. That PR will render the Epub unusable whenever a < or > is not escaped

wojtekmach · 2020-07-17T16:56:20Z

My understanding is in this issue we're discussing and trying to tackle:

Breaking change in 0.22: HTML entity support #1168 (note: it's already fixed)
Rendering regression in 0.22 #1189
other improvements to html parsing (via floki/floki_mochi/etc)

does that sound about right? @eksperimental could we just tackle 2. for now in this PR and leave 3. for the future?

eksperimental · 2020-07-17T17:05:21Z

We could just leave out the Floki call which is one or two lines, and leave what Earmark gives as is, and merge this PR.

josevalim · 2020-07-17T17:22:33Z

Right, it is not our job to parse html and adding markdown specific logic is going to be counter intuitive to direction we are taking ExDoc which is to support multiple input formats. :)

eksperimental · 2020-07-17T17:28:58Z

Right, it is not our job to parse html and adding markdown specific logic is going to be counter intuitive to direction we are taking ExDoc which is to support multiple input formats. :)

I know, but we have no other options, and we still don't know whether this will be fixed in Earkmark or not

josevalim · 2020-07-17T17:43:00Z

If we have no other options, then we will stick with no other options. We have lived without those so far and I really don't want to be responsible for handling this complexity. :)

- Introduce ExDoc.Markdown.AST and t:Markdown.AST.html/0. - Replaces ExDoc.Formatter.HTML.ast_to_html/1 with ExDoc.Markdown.AST.to_html/2 which is a port of Earmark.Transform - It uses a more recent git version of Earkmark since due to a bug in the library. - It adds Floki to deal with HTML code in Markdown docs. Closes elixir-lang#1168, elixir-lang#1189 Supersedes elixir-lang#1190

eksperimental · 2020-07-17T19:42:23Z

I have rebased to master, meaning now we work with EarmarkParser. I have added some temporary fixes until EarkmarkParser always works with 4-element tuples in the AST.
Floki has been removed and the comments that are not passing due to the limitation in nested HTML code have been commented out.

eksperimental · 2020-07-17T19:49:28Z

If we have no other options, then we will stick with no other options. We have lived without those so far and I really don't want to be responsible for handling this complexity. :)

For the record, this limitation recently came in with the introduction of AST if I'm not mistaken.

I think the easiest solution would be to store the original HTML in the metadata, as I have suggested to @RobertDober and we can inject it back in on our side.

josevalim · 2020-07-17T19:53:52Z

But why do we need the original HTML? Doesn’t markdown say we should parse any markdown inside the HTML too? So returning the original HTML could override the nested parsing.

josevalim · 2020-07-17T19:55:12Z

Btw, thanks for the clarification! :D

eksperimental · 2020-07-17T20:03:53Z

But why do we need the original HTML? Doesn’t markdown say we should parse any markdown inside the HTML too? So returning the original HTML could override the nested parsing.

Earmark is returning a crippled version of the original HTML or something like that (I can't recall exactly). As of when I wrote this PR we couldn't fix this limitation, as we were not able to reconstruct the HTML that we passed to it. I will look later if something changed in EarmarkParser

eksperimental · 2020-07-17T20:24:59Z

Sorry, the crippled version it was a bug that has been fixed. We are getting the HTML string. The issue is that we cannot escape the text within the HTML elements. This PR and also the current version of ExDoc escape the special characters of the whole string

RobertDober · 2020-07-17T20:53:39Z

@josevalim Almost no Markdown parser parses content inside HTML https://babelmark.github.io/?text=%3Cdiv%3E+%0A++-+a%0A++-+b%0A%3C%2Fdiv%3E , nor does EaramrkParser

I try to orient Earmark as much as GFM as possible, GFM does not parse Markdown inside HTML.

josevalim · 2020-07-17T21:03:22Z

Ah, apologies for the confusion then. So storing the original HTML should work, if that’s what we want.

eksperimental · 2020-07-17T21:31:52Z

Sorry, since Earkmark was parsing the first level of the HTML I thought the intention was to build the whole tree.
What we need is to escape the content of those html tags, but not the tags themselves.

josevalim · 2020-07-17T21:33:16Z

Why do we need to escape them? Maybe they shouldn’t be unescaped in the first place?

RobertDober · 2020-07-18T07:24:07Z

@eksperimental parsing the whole tree is planned for 1.5, scheduled for 2020-08-30

and to be clear

parsing the whole tree, leaving all text nodes untransformed using Floki as standard

iex(3)> Floki.parse ("<div> a & b < </div>")         
{"div", [], [" a & b < "]}

eksperimental · 2020-08-12T18:24:02Z

Regardless the parsing of the HTML code which is just two lines and I will remove them later today., I would like to know what is your opinion on this PR

josevalim · 2020-08-12T18:41:28Z

I think it is too early for us to formalize the AST. We also don't really care about Markdown AST per se as we will likely introduce our own so it works with both Markdown and the Erlang docs AST. So I would rather focus on fixing bugs instead of trying to formalize something that will definitely change.

eksperimental force-pushed the markdown_ast branch 2 times, most recently from 414359c to 401ac0f Compare June 26, 2020 02:13

eksperimental added 3 commits July 17, 2020 14:03

Remove Floki

2c7bbef

Comment out test that won't pass due to limitation in Earkmark

d67bb34

eksperimental force-pushed the markdown_ast branch from 401ac0f to 46fd405 Compare July 17, 2020 19:40

Fix code to deal with AST with irregular format in EarkMarkParser

46fd405

eksperimental closed this Sep 3, 2020

eksperimental deleted the markdown_ast branch September 3, 2020 13:57

Introduce Markdown.AST #1196

Introduce Markdown.AST #1196

Uh oh!

Conversation

eksperimental commented Jun 26, 2020

Uh oh!

josevalim commented Jun 26, 2020

Uh oh!

eksperimental commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jun 26, 2020

Uh oh!

eksperimental commented Jun 26, 2020

Uh oh!

RobertDober commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jun 26, 2020

Uh oh!

eksperimental commented Jun 26, 2020

Uh oh!

wojtekmach commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eksperimental commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

eksperimental commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

eksperimental commented Jul 17, 2020

Uh oh!

eksperimental commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

eksperimental commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eksperimental commented Jul 17, 2020

Uh oh!

RobertDober commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

eksperimental commented Jul 17, 2020

Uh oh!

josevalim commented Jul 17, 2020

Uh oh!

RobertDober commented Jul 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eksperimental commented Aug 12, 2020

Uh oh!

josevalim commented Aug 12, 2020

Uh oh!

Uh oh!

eksperimental commented Jun 26, 2020 •

edited

Loading

RobertDober commented Jun 26, 2020 •

edited

Loading

wojtekmach commented Jul 17, 2020 •

edited

Loading

eksperimental commented Jul 17, 2020 •

edited

Loading

eksperimental commented Jul 17, 2020 •

edited

Loading

eksperimental commented Jul 17, 2020 •

edited

Loading

eksperimental commented Jul 17, 2020 •

edited

Loading

RobertDober commented Jul 17, 2020 •

edited

Loading

RobertDober commented Jul 18, 2020 •

edited

Loading