-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646
Comments
Thanks for opening this issue! Great question. I dug in a bit and here's a code snippet that explains a bit about what's happening, which is that #! /usr/bin/env ruby
require "nokogiri"
tags = "<p>A <table><tr><td>B</table> C"
doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p></div>"
doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>"
# making explicit the calls which implement Node#inner_html=
doc = Nokogiri::HTML5.parse("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(doc, tags, doc.at_css("div"))
frag.to_s # => "<p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C"
doc.at_css("div").children = frag
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>" I expect that the DOM structure should be identical, whether parsed as a document or as a fragment with a I'll schedule some time to dig into the gumbo fragment parsing. Pinging @stevecheckoway for awareness and advice. |
But the strange thing is that this isn't just about fragment parsing. require "nokogiri"
tags = "<p>A <table><tr><td>B</table> C"
base = Nokogiri::HTML5.fragment("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(base.document, tags, base.at_css("div"))
frag.to_s #=> "<p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p>" In this case the result is correct even though we're using fragment parsing. The bug seems to occur only when parsing a fragment WHILE using a document that was previously created via document parsing. |
OK, I've narrowed this down to whether the parser is in "quirks mode" or not. Minimal repro#! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
end
require "nokogiri"
tags = "<p><table></table>"
doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"
doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p></p><table></table></div>" DiagnosisIn the C code for // Quirks mode.
VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
VALUE dtd = rb_funcall(doc, internal_subset, 0);
if (NIL_P(dtd)) {
quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
} else {
VALUE dtd_name = rb_funcall(dtd, name, 0);
VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
VALUE sysid = rb_funcall(dtd, rb_intern_const("system_id"), 0);
quirks_mode = gumbo_compute_quirks_mode(
NIL_P(dtd_name) ? NULL : StringValueCStr(dtd_name),
NIL_P(pubid) ? NULL : StringValueCStr(pubid),
NIL_P(sysid) ? NULL : StringValueCStr(sysid)
);
} What's happening in this case is that if we parse if (tag_is(token, kStartTag, GUMBO_TAG_TABLE)) {
if (
get_document_node(parser)->v.document.doc_type_quirks_mode
!= GUMBO_DOCTYPE_QUIRKS
) {
maybe_implicitly_close_p_tag(parser, token);
} However, when we parse Possible solutions@stevecheckoway I wanted to ask you about this behavior. Rather than try to re-compute quirks mode during fragment parsing initialization, would it be preferable to persist the quirks mode as an attribute on the This is what I'm suggesting: diff --git a/ext/nokogiri/gumbo.c b/ext/nokogiri/gumbo.c
index c4211d3..5150145 100644
--- a/ext/nokogiri/gumbo.c
+++ b/ext/nokogiri/gumbo.c
@@ -361,6 +361,7 @@ parse_continue(VALUE parse_args)
build_tree(doc, (xmlNodePtr)doc, output->document);
VALUE rdoc = noko_xml_document_wrap(args->klass, doc);
rb_iv_set(rdoc, "@url", args->url_or_frag);
+ rb_iv_set(rdoc, "@quirks_mode", INT2NUM(output->document->v.document.doc_type_quirks_mode));
args->doc = NULL; // The Ruby runtime now owns doc so don't delete it.
add_errors(output, rdoc, args->input, args->url_or_frag);
return rdoc;
@@ -517,7 +518,10 @@ fragment(
// Quirks mode.
VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
VALUE dtd = rb_funcall(doc, internal_subset, 0);
- if (NIL_P(dtd)) {
+ VALUE document_quirks_mode = rb_iv_get(doc, "@quirks_mode");
+ if (!NIL_P(document_quirks_mode)) {
+ quirks_mode = NUM2INT(document_quirks_mode);
+ } else if (NIL_P(dtd)) {
quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
} else {
VALUE dtd_name = rb_funcall(dtd, name, 0); The result of this patch is that the fragment "inherits" the quirks mode from the document, and applies it, and so the result is consistent: #! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri", path: "."
end
require "nokogiri"
tags = "<p><table></table>"
doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"
doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p><table></table></p></div>" If that seems like the right thing to do, I'm happy to write up some tests and submit a PR. Let me know if you have other ideas? Update: I guess one more idea is simply to have the fragment default to |
coerce
<p>
is different when a context is specified via coerce
<p>
is different when a context is specified via coerce
I'm sorry I didn't respond earlier. I missed this issue entirely. The underlying bug is the See "Anything else" in The "initial" insertion mode.
I think this is exactly the right approach (modulo fixing my bug when there's no
If the context node comes from a real document, then we should do what the standard says and copy the quirks. If there is no real context node, then this is a judgment call but I think we should go with no quirks mode for two reasons.
This is the only place in the parser that I'm aware of that the quirks mode actually matters. (There's a limited quirks mode that never seems to matter.) irb(main):001:0> Nokogiri::HTML5.parse('<!DOCTYPE html><p><table></table>').to_s
=> "<!DOCTYPE html><html><head></head><body><p></p><table></table></body></html>"
irb(main):002:0> Nokogiri::HTML5.parse('<p><table></table>').to_s
=> "<html><head></head><body><p><table></table></p></body></html>" |
@stevecheckoway Thanks for the complete and thoughtful response. I'll open a PR in the next day or so and ask for your review to make sure the implementation matches this description. |
and make sure the behavior of quirks mode matches the discussion at issue #2646.
see PR at #2736 |
and make sure the behavior of quirks mode matches the discussion at issue #2646.
Will be fixed in the v1.14.0 release! |
This more completely implements the guidance originally agreed upon in issue #2646, as there is no "context node com[ing] from a real document" when a tag name is provided.
…ext (#3246) **What problem is this PR intended to solve?** Coming from the discussion at #3203, I wanted to improve the fragment parsing API - as discussed in #2646, parse in no-quirks mode if a context element name is provided (not a context `Node`, just the name) - allow passing `:context` kwarg to `DocumentFragment.new` and `.parse` - deprecate the positional options hash to `.parse` per notes at #3200 **Have you included adequate test coverage?** Included additional coverage for the API changes **Does this change affect the behavior of either the C or the Java implementations?** HTML5 is only in CRuby.
I'm not sure if the title above is correct, but please consider these two ways of parsing a
<table>
within a<p>
:output:
I'm fairly certain the output should not be different. In both cases we're parsing the same tags, and the parent element is a
<div>
. Note thatNokogiri::HTML5.fragment(tags)
produces the same structure asNokogiri::HTML5.parse(tags)
. So this bug/behavior looks specific tocoerce
somehow. But in the code sample above if we replaceHTML5.parse
byHTML5.fragment
, the twoputs doc
produce the same output. So honestly I'm not sure what's going on.The text was updated successfully, but these errors were encountered: