Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fragment context parsing #2736

Merged
merged 3 commits into from
Dec 21, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -52,6 +52,7 @@ We welcome feedback on this API at [#2360](https://github.com/sparklemotion/noko

* `Node#wrap` and `NodeSet#wrap` now also accept a `Node` type argument, which will be `dup`ed for each wrapper. For cases where many nodes are being wrapped, creating a `Node` once using `Document#create_element` and passing that `Node` multiple times is significantly faster than re-parsing markup on each call. [[#2657](https://github.com/sparklemotion/nokogiri/issues/2657)]
* [CRuby] Invocation of custom XPath or CSS handler functions may now use the `nokogiri` namespace prefix. Historically, the JRuby implementation _required_ this namespace but the CRuby implementation did not support it. It's recommended that all XPath and CSS queries use the `nokogiri` namespace going forward. Invocation without the namespace is planned for deprecation in v1.15.0 and removal in a future release. [[#2147](https://github.com/sparklemotion/nokogiri/issues/2147)]
* `HTML5::Document#quirks_mode` and `HTML5::DocumentFragment#quirks_mode` expose the quirks mode used by the parser.


### Improved
@@ -94,6 +95,7 @@ We welcome feedback on this API at [#2360](https://github.com/sparklemotion/noko
* [CRuby] `Nokogiri::HTML5::Document#url` now correctly returns the URL passed to the constructor method. Previously it always returned `nil`. [[#2583](https://github.com/sparklemotion/nokogiri/issues/2583)]
* [CRuby] `HTML5` encoding detection is now case-insensitive with respect to `meta` tag charset declaration. [[#2693](https://github.com/sparklemotion/nokogiri/issues/2693)]
* [CRuby] `HTML5` fragment parsing in context of an annotation-xml node now works. Previously this rarely-used path invoked rb_funcall with incorrect parameters, resulting in an exception, a fatal error, or potentially a segfault. [[#2692](https://github.com/sparklemotion/nokogiri/issues/2692)]
* [CRuby] `HTML5` quirks mode during fragment parsing more closely matches document parsing. [[#2646](https://github.com/sparklemotion/nokogiri/issues/2646)]
* [JRuby] Fixed a bug with adding the same namespace to multiple nodes via `#add_namespace_definition`. [[#1247](https:<//github.com/sparklemotion/nokogiri/issues/1247)]
* [JRuby] `NodeSet#[]` now raises a TypeError if passed an invalid parameter type. [[#2211](https://github.com/sparklemotion/nokogiri/issues/2211)]

7 changes: 6 additions & 1 deletion ext/nokogiri/gumbo.c
Original file line number Diff line number Diff line change
@@ -361,6 +361,7 @@ parse_continue(VALUE parse_args)
build_tree(doc, (xmlNodePtr)doc, output->document);
VALUE rdoc = noko_xml_document_wrap(args->klass, doc);
rb_iv_set(rdoc, "@url", args->url_or_frag);
rb_iv_set(rdoc, "@quirks_mode", INT2NUM(output->document->v.document.doc_type_quirks_mode));
args->doc = NULL; // The Ruby runtime now owns doc so don't delete it.
add_errors(output, rdoc, args->input, args->url_or_frag);
return rdoc;
@@ -517,8 +518,11 @@ fragment(
// Quirks mode.
VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
VALUE dtd = rb_funcall(doc, internal_subset, 0);
if (NIL_P(dtd)) {
VALUE doc_quirks_mode = rb_iv_get(doc, "@quirks_mode");
if (NIL_P(ctx) || NIL_P(doc_quirks_mode)) {
quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
} else if (NIL_P(dtd)) {
quirks_mode = GUMBO_DOCTYPE_QUIRKS;
} else {
VALUE dtd_name = rb_funcall(dtd, name, 0);
VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
@@ -565,6 +569,7 @@ fragment_continue(VALUE parse_args)
args->doc = NULL; // The Ruby runtime owns doc so make sure we don't delete it.
xmlNodePtr xml_frag = extract_xml_node(doc_fragment);
build_tree(xml_doc, xml_frag, output->root);
rb_iv_set(doc_fragment, "@quirks_mode", INT2NUM(output->document->v.document.doc_type_quirks_mode));
add_errors(output, doc_fragment, args->input, rb_utf8_str_new_static("#fragment", 9));
return Qnil;
}
20 changes: 20 additions & 0 deletions lib/nokogiri/html5/document.rb
Original file line number Diff line number Diff line change
@@ -21,6 +21,18 @@

module Nokogiri
module HTML5
# Enum for the HTML5 parser quirks mode values. Values returned by HTML5::Document#quirks_mode
#
# See https://dom.spec.whatwg.org/#concept-document-quirks for more information on HTML5 quirks
# mode.
#
# Since v1.14.0
module QuirksMode
NO_QUIRKS = 0 # The document was parsed in "no-quirks" mode
QUIRKS = 1 # The document was parsed in "quirks" mode
LIMITED_QUIRKS = 2 # The document was parsed in "limited-quirks" mode
end

# Since v1.12.0
#
# 💡 HTML5 functionality is not available when running JRuby.
@@ -29,6 +41,13 @@ class Document < Nokogiri::HTML4::Document
# Document.read_memory
attr_reader :url

# Get the parser's quirks mode value. See HTML5::QuirksMode.
#
# This method returns `nil` if the parser was not invoked (e.g., `Nokogiri::HTML5::Document.new`).
#
# Since v1.14.0
attr_reader :quirks_mode

class << self
# :call-seq:
# parse(input)
@@ -110,6 +129,7 @@ def do_parse(string_or_io, url, encoding, options)
def initialize(*args) # :nodoc:
super
@url = nil
@quirks_mode = nil
end

# :call-seq:
7 changes: 7 additions & 0 deletions lib/nokogiri/html5/document_fragment.rb
Original file line number Diff line number Diff line change
@@ -28,6 +28,13 @@ class DocumentFragment < Nokogiri::HTML4::DocumentFragment
attr_accessor :document
attr_accessor :errors

# Get the parser's quirks mode value. See HTML5::QuirksMode.
#
# This method returns `nil` if the parser was not invoked (e.g., `Nokogiri::HTML5::DocumentFragment.new(doc)`).
#
# Since v1.14.0
attr_reader :quirks_mode

# Create a document fragment.
def initialize(doc, tags = nil, ctx = nil, options = {})
self.document = doc
119 changes: 119 additions & 0 deletions test/html5/test_quirks_mode.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# encoding: utf-8
# frozen_string_literal: true

require "helper"

describe Nokogiri::HTML5 do
describe "Document#quirks_mode" do
let(:document) { Nokogiri::HTML5::Document.parse(html) }

describe "without parsing anything" do
it "returns nil" do
assert_nil(Nokogiri::HTML5::Document.new.quirks_mode)
end
end

describe "on a document with a doctype" do
let(:html) { "<!DOCTYPE html><p>hello</p>" }

it "returns NO_QUIRKS" do
assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, document.quirks_mode)
end
end

describe "on a document without a doctype" do
let(:html) { "<html><p>hello</p>" }

it "returns QUIRKS" do
assert_equal(Nokogiri::HTML5::QuirksMode::QUIRKS, document.quirks_mode)
end
end
end

describe "DocumentFragment#quirks_mode" do
let(:input) { "<p><table>" }
let(:no_quirks_output) { "<p></p><table></table>" }
let(:quirks_output) { "<p><table></table></p>" }

describe "without parsing anything" do
let(:fragment) { Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new) }

it "returns nil" do
assert_nil(fragment.quirks_mode)
end
end

describe "in context" do
describe "document did not invoke the parser" do
let(:document) { Nokogiri::HTML5::Document.new }

it "parses the fragment in no-quirks mode" do
context_node = document.create_element("div")
fragment = context_node.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end

describe "document has a doctype" do
let(:document) { Nokogiri::HTML5::Document.parse("<!DOCTYPE html><div>") }

it "parses the fragment in no-quirks mode" do
context_node = document.at_css("div")
fragment = context_node.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end

describe "document does not have a doctype" do
let(:document) { Nokogiri::HTML5::Document.parse("<div>") }

it "parses the fragment in quirks mode" do
context_node = document.at_css("div")
fragment = context_node.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::QUIRKS, fragment.quirks_mode)
assert_equal(quirks_output, fragment.to_html)
end
end
end

describe "no context" do
describe "document did not invoke the parser" do
let(:document) { Nokogiri::HTML5::Document.new }

it "parses the fragment in no-quirks mode" do
fragment = document.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end

describe "document has a doctype" do
let(:document) { Nokogiri::HTML5::Document.parse("<!DOCTYPE html><div>") }

it "parses the fragment in no-quirks mode" do
fragment = document.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end

describe "document does not have a doctype" do
let(:document) { Nokogiri::HTML5::Document.parse("<div>") }

it "parses the fragment in no-quirks mode" do
fragment = document.fragment(input)

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end
end
end
end if Nokogiri.uses_gumbo?