Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floki using the built in parser does not handle the optional closing p tag #395

Open
derek-zhou opened this issue Mar 23, 2022 · 5 comments
Labels

Comments

@derek-zhou
Copy link
Contributor

Description

According to HTML5 spec, closing </p> tag is optional. ie:

<p>p1
<p>p2

is equivalent to:

<p>p1</p>
<p>p2</p>

However, Floki with the builtin parser does not handle this correctly.

To Reproduce

  • Using Floki v0.32.0
  • Using Elixir v1.12.3
  • Using Erlang OTP v24
  • With this code:
Floki.parse_document("<p>p1<p>p2")
{:ok, [{"p", [], ["p1", {"p", [], ["p2"]}]}]}
iex(5)> Floki.parse_document("<p>p1</p><p>p2</p>")
{:ok, [{"p", [], ["p1"]}, {"p", [], ["p2"]}]}

It looks like Floki fills in the missing </p> at the end of the document.

Expected behavior

<p> tag shall not contain another <p>

@derek-zhou derek-zhou added the Bug label Mar 23, 2022
@philss
Copy link
Owner

philss commented Mar 24, 2022

Yeah, this is a bug :/
It won't be fixed easily because of #37
But at least we are half way there https://github.com/philss/floki/projects/2

@derek-zhou
Copy link
Contributor Author

Do you mean that the mochiweb is too fragile to fix, and a brand new parser is on the way?

@philss
Copy link
Owner

philss commented Mar 24, 2022

@derek-zhou It's not that is too fragile, but I think the HTML parsing state machine is too damn complicated to fix when the parser never followed the specs 😅

I plan to finish the built-in parser one day. But in the meanwhile, I suggest you to give it a try to the html5ever parser https://github.com/philss/floki#using-html5ever-as-the-html-parser, now that comes with precompiled NIFs (you don't need Rust to use it anymore).

@derek-zhou
Copy link
Contributor Author

I am not afraid of a little of rust tool chain. However, I need to do some ad-hoc XML parsing in the same application and I am afraid if the html5ever parser could be too strict on things.

@philss
Copy link
Owner

philss commented Mar 25, 2022

@derek-zhou I see. You can use both if you need. Just pass the parser as an option to parse_document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants