-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Both start and end tags of html / head / body elements can't be omitted #98
Comments
This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:
Nokogiri supports this distinction in the form of The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g. If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the
They then serialize the document back to XML and lo and behold they get this:
This is very different compared to just receiving One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise. I intend to document this choice, but it seems you beat me to it before I could write it down :) |
Do you think it makes sense to insert start tags if end tags are present? (in situations where it should be done according to HTML spec) Oga.parse_html('</html>') |
@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L413 vs https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L479). Changing this will introduce a pretty hefty performance pentalty (due to extra string allocations) and I'd rather not do that any time soon. Besides this I can't really think of any use cases where this would be useful. |
Nokogiri was driving me crazy assuming too much either adding tags when using full docs or removing them when using fragments. |
The HTML5, HTML5.1, WHATWG HTML specs say:
However, that "feature" isn't supported. html / head / body elements don't seem to inserted to DOM by oga if both start and end tags were omitted.
The text was updated successfully, but these errors were encountered: