Skip to content

Commit

Permalink
Merge pull request #2111 from doriantaylor/2110-pathname-special-case
Browse files Browse the repository at this point in the history
Pathname special case on parse fixes #2110, #1821
  • Loading branch information
flavorjones authored Nov 13, 2020
2 parents e56de28 + 88479cb commit f7edd3a
Show file tree
Hide file tree
Showing 5 changed files with 80 additions and 12 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ This release ends support for:
* Remove unnecessary array allocations from Node serialization methods [[#1911](https://github.com/sparklemotion/nokogiri/issues/1911)] (Thanks, [@ashmaroli](https://github.com/ashmaroli)!)
* Avoid creation of unnecessary zero-length String objects. [[#1970](https://github.com/sparklemotion/nokogiri/issues/1970)] (Thanks, [@ashmaroli](https://github.com/ashmaroli)!)
* Always compile libxml2 and libxslt with '-O2' [[#2022](https://github.com/sparklemotion/nokogiri/issues/2022), [#2100](https://github.com/sparklemotion/nokogiri/issues/2100)] (Thanks, [@ilyazub](https://github.com/ilyazub)!)
* {HTML,XML}::Document#parse now accept `Pathname` objects. Previously this worked only if the referenced file was less than 4096 bytes long; longer files resulted in undefined behavior because the `read` method would be repeatedly invoked. [[#1821](https://github.com/sparklemotion/nokogiri/issues/1821), [#2110](https://github.com/sparklemotion/nokogiri/issues/2110)] (Thanks, [@doriantaylor](https://github.com/doriantaylor) and [@phokz](https://github.com/phokz)!)
* [JRuby] Lots of code cleanup and performance improvements. [[#1934](https://github.com/sparklemotion/nokogiri/issues/1934)] (Thanks, [@kares](https://github.com/kares)!)
* [JRuby] Clean up deprecated calls into JRuby. [[#2027](https://github.com/sparklemotion/nokogiri/issues/2027)] (Thanks, [@headius](https://github.com/headius)!)

Expand Down
15 changes: 12 additions & 3 deletions lib/nokogiri/html/document.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# frozen_string_literal: true

require 'pathname'

module Nokogiri
module HTML
class Document < Nokogiri::XML::Document
Expand Down Expand Up @@ -161,19 +164,25 @@ class << self
# Nokogiri::XML::ParseOptions::RECOVER. See the constants in
# Nokogiri::XML::ParseOptions.
def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML

options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
# Give the options to the user

yield options if block_given?

url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil

if string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end

if string_or_io.respond_to?(:read)
url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
if string_or_io.is_a?(Pathname)
# resolve the Pathname to the file and open it as an IO object, see #2110
string_or_io = string_or_io.expand_path.open
url ||= string_or_io.path
end

unless encoding
# Libxml2's parser has poor support for encoding
# detection. First, it does not recognize the HTML5
Expand Down
24 changes: 17 additions & 7 deletions lib/nokogiri/xml/document.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# frozen_string_literal: true

require 'pathname'

module Nokogiri
module XML
##
Expand Down Expand Up @@ -44,9 +47,11 @@ class Document < Nokogiri::XML::Node
#
def self.parse string_or_io, url = nil, encoding = nil, options = ParseOptions::DEFAULT_XML
options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
# Give the options to the user

yield options if block_given?

url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil

if empty_doc?(string_or_io)
if options.strict?
raise Nokogiri::XML::SyntaxError.new("Empty document")
Expand All @@ -56,12 +61,17 @@ def self.parse string_or_io, url = nil, encoding = nil, options = ParseOptions::
end

doc = if string_or_io.respond_to?(:read)
url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
read_io(string_or_io, url, encoding, options.to_i)
else
# read_memory pukes on empty docs
read_memory(string_or_io, url, encoding, options.to_i)
end
if string_or_io.is_a?(Pathname)
# resolve the Pathname to the file and open it as an IO object, see #2110
string_or_io = string_or_io.expand_path.open
url ||= string_or_io.path
end

read_io(string_or_io, url, encoding, options.to_i)
else
# read_memory pukes on empty docs
read_memory(string_or_io, url, encoding, options.to_i)
end

# do xinclude processing
doc.do_xinclude(options) if options.xinclude?
Expand Down
26 changes: 25 additions & 1 deletion test/html/test_document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,9 @@ def test_document_parse_method
end

def test_document_parse_method_with_url
doc = Nokogiri::HTML "<html></html>", "http:/foobar.foobar/", "UTF-8"
doc = Nokogiri::HTML "<html></html>", "http://foobar.example.com/", "UTF-8"
refute_empty doc.to_s, "Document should not be empty"
assert_equal "http://foobar.example.com/", doc.url
end

###
Expand Down Expand Up @@ -627,6 +628,29 @@ def test_parse_can_take_io
html = Nokogiri::HTML(f)
}
assert html.html?
assert_equal HTML_FILE, html.url
end

def test_parse_works_with_an_object_that_responds_to_path
html = String.new("<html><body>hello</body></html>")
def html.path
"/i/should/be/the/document/url"
end

doc = Nokogiri::HTML.parse(html)

assert_equal "/i/should/be/the/document/url", doc.url
end

# issue #1821, #2110
def test_parse_can_take_pathnames
assert(File.size(HTML_FILE) > 4096) # file must be big enough to trip the read callback more than once

doc = Nokogiri::HTML.parse(Pathname.new(HTML_FILE))

# an arbitrary assertion on the structure of the document
assert_equal 166, doc.css("a").length
assert_equal HTML_FILE, doc.url
end

def test_html?
Expand Down
26 changes: 25 additions & 1 deletion test/xml/test_document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -674,7 +674,7 @@ def test_XML_function

def test_url
assert @xml.url
assert_equal XML_FILE, URI.unescape(@xml.url).sub("file:///", "")
assert_equal XML_FILE, @xml.url
end

def test_document_parent
Expand All @@ -695,6 +695,7 @@ def test_parse_can_take_io
xml = Nokogiri::XML(f)
}
assert xml.xml?
assert_equal XML_FILE, xml.url
set = xml.search("//employee")
assert set.length > 0
end
Expand All @@ -719,6 +720,29 @@ def read(*args)
assert_equal "foo", doc.at_css("div").content
end

def test_parse_works_with_an_object_that_responds_to_path
xml = String.new("<root><sub>hello</sub></root>")
def xml.path
"/i/should/be/the/document/url"
end

doc = Nokogiri::XML.parse(xml)

assert_equal "/i/should/be/the/document/url", doc.url
end

# issue #1821, #2110
def test_parse_can_take_pathnames
assert(File.size(XML_ATOM_FILE) > 4096) # file must be big enough to trip the read callback more than once

doc = Nokogiri::XML.parse(Pathname.new(XML_ATOM_FILE))

# an arbitrary assertion on the structure of the document
assert_equal 20, doc.xpath("/xmlns:feed/xmlns:entry/xmlns:author",
"xmlns" => "http://www.w3.org/2005/Atom").length
assert_equal XML_ATOM_FILE, doc.url
end

def test_search_on_empty_documents
doc = Nokogiri::XML::Document.new
ns = doc.search("//foo")
Expand Down

0 comments on commit f7edd3a

Please sign in to comment.