Releases: sparklemotion/nokogiri
1.12.0.rc1 / 2021-07-09
1.12.0.rc1 / 2021-07-09
Notable Addition: HTML5 Support (CRuby only)
HTML5 support has been added (to CRuby only) by merging Nokogumbo into Nokogiri. The Nokogumbo public API has been preserved, so this functionality is available under the Nokogiri::HTML5
namespace. [#2204]
Please note that HTML5 support is not available for JRuby in this version. However, we feel it is important to think about JRuby and we hope to work on this in the future. If you're interested in helping with HTML5 support on JRuby, please reach out to the maintainers by commenting on issue #2227.
Many thanks to Sam Ruby, Steve Checkoway, and Craig Barnes for creating and maintaining Nokogumbo and supporting the Gumbo HTML5 parser. They're now Nokogiri core contributors with all the powers and privileges pertaining thereto. 🙌
Notable Change: Nokogiri::HTML4
module and namespace
Nokogiri::HTML
has been renamed to Nokogiri::HTML4
, and Nokogiri::HTML
is aliased to preserve backwards-compatibility. Nokogiri::HTML
and Nokogiri::HTML4
parse methods still use libxml2's (or NekoHTML's) HTML4 parser in the v1.12 release series.
Take special note that if you rely on the class name of an object in your code, objects will now report a class of Nokogiri::HTML4::Foo
where they previously reported Nokogiri::HTML::Foo
. Instead of relying on the string returned by Object#class
, prefer Class#===
or Object#is_a?
or Object#instance_of?
.
Future releases of Nokogiri may deprecate HTML
methods or otherwise change this behavior, so please start using HTML4
in place of HTML
.
Added
- [CRuby]
Nokogiri::VERSION_INFO["libxslt"]["datetime_enabled"]
is a new boolean value which describes whether libxslt (or, more properly, libexslt) has compiled-in datetime support. This generally going to betrue
, but some distros ship without this support (e.g., some mingw UCRT-based packages, see msys2/MINGW-packages#8957). See #2272 for more details.
Changed
- Introduce a new constant,
Nokogiri::XML::ParseOptions::DEFAULT_XSLT
, which adds the libxslt-preferred options ofNOENT | DTDLOAD | DTDATTR | NOCDATA
toParseOptions::DEFAULT_XML
. Nokogiri.XSLT
parses stylesheets usingParseOptions::DEFAULT_XSLT
, which should make some edge-case XSL transformations match libxslt's default behavior. [#1940]
Fixed
- [CRuby] Namespaced attributes are handled properly when their parent node is reparented into another document. Previously, the namespace may have gotten dropped. [#2228]
- [CRuby] Reparented nodes no longer inherit their parent's namespace. Previously, a node without a namespace was forced to adopt its parent's namespace. [#1712]
Improved
- [CRuby] Speed up (slightly) the compile time of packaged libraries
libiconv
,libxml2
, andlibxslt
by using autoconf's--disable-dependency-tracking
option. ("ruby" platform gem only.)
Deprecated
- Deprecating Nokogumbo's
Nokogiri::HTML5.get
. This method will be removed in a future version of Nokogiri.
Dependencies
- [CRuby] Upgrade mini_portile2 dependency from
~> 2.5.0
to~> 2.6.1
. ("ruby" platform gem only.)
Checksums:
cb38e1023d5e1b6a33a1b5c7659b68ce7c88449eb69430db128d4d53731b1638 gems/nokogiri-1.12.0.rc1.gem
b5e8e912013cc73e78a1817c5b131cdbc3e224dd4c3158063b562f0a89cb9adb gems/nokogiri-1.12.0.rc1-java.gem
598b9ed6b98fea43dfc74dbd0cbe24994a26fb1e3dff1a727ba79392495d40d5 gems/nokogiri-1.12.0.rc1-x64-mingw32.gem
7a11a5d911d98a8ddc6a88e712aae82a953fe291f9bb150d4cfe34539489792a gems/nokogiri-1.12.0.rc1-x86-mingw32.gem
41ace0fcff1901a8d6661cac815fa573d934d9e8280e21e2ec16dd1bd3a6ff7a gems/nokogiri-1.12.0.rc1-x86-linux.gem
5843752b3d989954ace6fee40ba0634c615b8c579f885c70ff067a8fcc62fa69 gems/nokogiri-1.12.0.rc1-x86_64-linux.gem
8e0ecef0dd76a640f4e1ba4dd9b5df8c5ee352ec944ad7f6beedb89c0b49bfcb gems/nokogiri-1.12.0.rc1-arm64-darwin.gem
ae56204ca3d8154c46c9fc55f526ff8a71b9a3f4bc879dca26674f4714d7dff6 gems/nokogiri-1.12.0.rc1-x86_64-darwin.gem
1.11.7 / 2021-06-02
1.11.7 / 2021-06-02
- [CRuby] Backporting an upstream fix to XPath recursion depth limits which impacted some users of complex XPath queries. This issue is present in libxml 2.9.11 and 2.9.12. [#2257]
Checksums
SHA256:
4976a9c9e796527d51dc6c311b9bd93a0233f6a7962a0f569aa5c782461836ef nokogiri-1.11.7.gem
9d69f57f6c024d86e358a8aef7a273f574721e48a6b2e1426cca007827325413 nokogiri-1.11.7-java.gem
6017dee25feb80292b04554cc1bf8a0a2ede3b6c3daeac811902157bbc6a3bdc nokogiri-1.11.7-x64-mingw32.gem
38892350c1e695eab9bd77483300d681c32a22714d0e2d04d10a4c343b424bdd nokogiri-1.11.7-x86-mingw32.gem
1d15603cd878fa2b710a3ba3028a99d9dd0c14b75711faebf9fb6ff40bac3880 nokogiri-1.11.7-x86-linux.gem
7ad9741e7a2fee1ffb4a4b2e20b00e87992c9efd969f557ca3b83fb2653b9bfc nokogiri-1.11.7-x86_64-linux.gem
c93d66d9413ea7c37d30f95e2c54606fec638e556d454e08124d9a33b7fa82c8 nokogiri-1.11.7-arm64-darwin.gem
8761d9c7baacb26546869ed56dbc78d3eb3cabf49b85d91b1cd827cd6e94fb25 nokogiri-1.11.7-x86_64-darwin.gem
1.11.6 / 2021-05-26
1.11.6 / 2021-05-26
Fixed
- [CRuby]
DocumentFragment#path
now does proper error-checking to handle behavior introduced in libxml > 2.9.10. In v1.11.4 and v1.11.5, callingDocumentFragment#path
could result in a segfault.
1.11.5 / 2021-05-19
1.11.5 / 2021-05-19
Fixed
[Windows CRuby] Work around segfault at process exit on Windows when using libxml2 system DLLs.
libxml 2.9.12 introduced new behavior to avoid memory leaks when unloading libxml2 shared libraries (see libxml/!66). Early testing caught this segfault on non-Windows platforms (see #2059 and libxml@956534e) but it was incompletely fixed and is still an issue on Windows platforms that are using system DLLs.
We work around this by configuring libxml2 in this situation to use its default memory management functions. Note that if Nokogiri is not on Windows, or is not using shared system libraries, it will will continue to configure libxml2 to use Ruby's memory management functions. Nokogiri::VERSION_INFO["libxml"]["memory_management"]
will allow you to verify when the default memory management functions are being used. [#2241]
Added
Nokogiri::VERSION_INFO["libxml"]
now contains the key "memory_management"
to declare whether libxml2 is using its default
memory management functions, or whether it uses the memory management functions from ruby
. See above for more details.
1.11.4 / 2021-05-14
1.11.4 / 2021-05-14
Security
[CRuby] Vendored libxml2 upgraded to v2.9.12 which addresses:
Note that two additional CVEs were addressed upstream but are not relevant to this release. CVE-2021-3516 via xmllint
is not present in Nokogiri, and CVE-2020-7595 has been patched in Nokogiri since v1.10.8 (see #1992).
Please see nokogiri/GHSA-7rrm-v45f-jp64 or #2233 for a more complete analysis of these CVEs and patches.
Dependencies
- [CRuby] vendored libxml2 is updated from 2.9.10 to 2.9.12. (Note that 2.9.11 was skipped because it was superseded by 2.9.12 a few hours after its release.)
1.11.3 / 2021-04-07
1.11.3 / 2021-04-07
Fixed
- [CRuby] Passing non-
Node
objects toDocument#root=
now raises anArgumentError
exception. Previously this likely segfaulted. [#1900] - [JRuby] Passing non-
Node
objects toDocument#root=
now raises anArgumentError
exception. Previously this raised aTypeError
exception. - [CRuby] arm64/aarch64 systems (like Apple's M1) can now compile libxml2 and libxslt from source (though we continue to strongly advise users to install the native gems for the best possible experience)
1.11.2 / 2021-03-11
1.11.2 / 2021-03-11
Fixed
- [CRuby]
NodeSet
may now safely containNode
objects from multiple documents. Previously the GC lifecycle of the parentDocument
objects could lead to nodes being GCed while still in scope. [#1952] - [CRuby] Patch libxml2 to avoid "huge input lookup" errors on large CDATA elements. (See upstream GNOME/libxml2#200 and GNOME/libxml2!100.) [#2132].
- [CRuby+Windows] Enable Nokogumbo (and other downstream gems) to compile and link against
nokogiri.so
by includingLDFLAGS
inNokogiri::VERSION_INFO
. [#2167] - [CRuby]
{XML,HTML}::Document.parse
now invokes#initialize
exactly once. Previously#initialize
was invoked twice on each object. - [JRuby]
{XML,HTML}::Document.parse
now invokes#initialize
exactly once. Previously#initialize
was not called, which was a problem for subclassing such as done byLoofah
.
Improved
- Reduce the number of object allocations needed when parsing an HTML::DocumentFragment. [#2087] (Thanks, @ashmaroli!)
- [JRuby] Update the algorithm used to calculate
Node#line
to be wrong less-often. The underlying parser, Xerces, does not track line numbers, and so we've always used a hacky solution for this method. [#1223, #2177] - Introduce
--enable-system-libraries
and--disable-system-libraries
flags toextconf.rb
. These flags provide the same functionality as--use-system-libraries
and theNOKOGIRI_USE_SYSTEM_LIBRARIES
environment variable, but are more idiomatic. [#2193] (Thanks, @eregon!) - [TruffleRuby]
--disable-static
is now the default on TruffleRuby when the packaged libraries are used. This is more flexible and compiles faster. (Note, though, that the default on TR is still to use system libraries.) [#2191, #2193] (Thanks, @eregon!)
Changed
Nokogiri::XML::Path
is now a Module (previously it has been a Class). It has been acting solely as a Module since v1.0.0. See 8461c74.
v1.11.1 / 2021-01-06
v1.11.1 / 2021-01-06
Fixed
- [CRuby] If
libxml-ruby
is loaded beforenokogiri
, the SAX and Push parsers no longer calllibxml-ruby
's handlers. Instead, they defensively override the libxml2 global handler before parsing. [#2168]
SHA-256 Checksums of published gems
a41091292992cb99be1b53927e1de4abe5912742ded956b0ba3383ce4f29711c nokogiri-1.11.1-arm64-darwin.gem
d44fccb8475394eb71f29dfa7bb3ac32ee50795972c4557ffe54122ce486479d nokogiri-1.11.1-java.gem
f760285e3db732ee0d6e06370f89407f656d5181a55329271760e82658b4c3fc nokogiri-1.11.1-x64-mingw32.gem
dd48343bc4628936d371ba7256c4f74513b6fa642e553ad7401ce0d9b8d26e1f nokogiri-1.11.1-x86-linux.gem
7f49138821d714fe2c5d040dda4af24199ae207960bf6aad4a61483f896bb046 nokogiri-1.11.1-x86-mingw32.gem
5c26111f7f26831508cc5234e273afd93f43fbbfd0dcae5394490038b88d28e7 nokogiri-1.11.1-x86_64-darwin.gem
c3617c0680af1dd9fda5c0fd7d72a0da68b422c0c0b4cebcd7c45ff5082ea6d2 nokogiri-1.11.1-x86_64-linux.gem
42c2a54dd3ef03ef2543177bee3b5308313214e99f0d1aa85f984324329e5caa nokogiri-1.11.1.gem
v1.11.0 / 2021-01-03
v1.11.0 / 2021-01-03
Notes
Faster, more reliable installation: Native Gems for Linux and OSX/Darwin
"Native gems" contain pre-compiled libraries for a specific machine architecture. On supported platforms, this removes the need for compiling the C extension and the packaged libraries. This results in much faster installation and more reliable installation, which as you probably know are the biggest headaches for Nokogiri users.
We've been shipping native Windows gems since 2009, but starting in v1.11.0 we are also shipping native gems for these platforms:
- Linux:
x86-linux
andx86_64-linux
-- including musl platforms like alpine - OSX/Darwin:
x86_64-darwin
andarm64-darwin
We'd appreciate your thoughts and feedback on this work at #2075.
Dependencies
Ruby
This release introduces support for Ruby 2.7 and 3.0 in the precompiled native gems.
This release ends support for:
- Ruby 2.3, for which official support ended on 2019-03-31 [#1886] (Thanks @ashmaroli!)
- Ruby 2.4, for which official support ended on 2020-04-05
- JRuby 9.1, which is the Ruby 2.3-compatible release.
Gems
- Explicitly add racc as a runtime dependency. [#1988] (Thanks, @voxik!)
- [MRI] Upgrade mini_portile2 dependency from
~> 2.4.0
to~> 2.5.0
[#2005] (Thanks, @alejandroperea!)
Security
See note below about CVE-2020-26247 in the "Changed" subsection entitled "XML::Schema parsing treats input as untrusted by default".
Added
- Add Node methods for manipulating "keyword attributes" (for example,
class
andrel
):#kwattr_values
,#kwattr_add
,#kwattr_append
, and#kwattr_remove
. [#2000] - Add support for CSS queries
a:has(> b)
,a:has(~ b)
, anda:has(+ b)
. [#688] (Thanks, @jonathanhefner!) - Add
Node#value?
to better match expected semantics of a Hash-like object. [#1838, #1840] (Thanks, @MatzFan!) - [CRuby] Add
Nokogiri::XML::Node#line=
for use by downstream libs like nokogumbo. [#1918] (Thanks, @stevecheckoway!) nokogiri.gemspec
is back after a 10-year hiatus. We still prefer you use the official releases, but master is pretty stable these days, and YOLO.
Performance
- [CRuby] The CSS
~=
operator and class selector.
are about 2x faster. [#2137, #2135] - [CRuby] Patch libxml2 to call
strlen
fromxmlStrlen
rather than the naive implementation, becausestrlen
is generally optimized for the architecture. [#2144] (Thanks, @ilyazub!) - Improve performance of some namespace operations. [#1916] (Thanks, @ashmaroli!)
- Remove unnecessary array allocations from Node serialization methods [#1911] (Thanks, @ashmaroli!)
- Avoid creation of unnecessary zero-length String objects. [#1970] (Thanks, @ashmaroli!)
- Always compile libxml2 and libxslt with '-O2' [#2022, #2100] (Thanks, @ilyazub!)
- [JRuby] Lots of code cleanup and performance improvements. [#1934] (Thanks, @kares!)
- [CRuby]
RelaxNG.from_document
no longer leaks memory. [#2114]
Improved
- [CRuby] Handle incorrectly-closed HTML comments as WHATWG recommends for browsers. [#2058] (Thanks to HackerOne user mayflower for reporting this!)
- {HTML,XML}::Document#parse now accept
Pathname
objects. Previously this worked only if the referenced file was less than 4096 bytes long; longer files resulted in undefined behavior because theread
method would be repeatedly invoked. [#1821, #2110] (Thanks, @doriantaylor and @phokz!) - [CRuby] Nokogumbo builds faster because it can now use header files provided by Nokogiri. [#1788] (Thanks, @stevecheckoway!)
- Add
frozen_string_literal: true
magic comment to alllib
files. [#1745] (Thanks, @oniofchaos!) - [JRuby] Clean up deprecated calls into JRuby. [#2027] (Thanks, @headius!)
Fixed
- HTML Parsing in "strict" mode (i.e., the
RECOVER
parse option not set) now correctly raises aXML::SyntaxError
exception. Previously the value of theRECOVER
bit was being ignored by CRuby and was misinterpreted by JRuby. [#2130] - The CSS
~=
operator now correctly handles non-space whitespace in theclass
attribute. commit e45dedd - The switch to turn off the CSS-to-XPath cache is now thread-local, rather than being shared mutable state. [#1935]
- The Node methods
add_previous_sibling
,previous=
,before
,add_next_sibling
,next=
,after
,replace
, andswap
now correctly use their parent as the context node for parsing markup. These methods now also raise aRuntimeError
if they are called on a node with no parent. [nokogumbo#160] - [JRuby] XML::Schema XSD validation errors are captured in
XML::Schema#errors
. These errors were previously ignored. - [JRuby] Standardize reading from IO like objects, including StringIO. [#1888, #1897]
- [JRuby] Fix how custom XPath function namespaces are inferred to be less naive. [#1890, #2148]
- [JRuby] Clarify exception message when custom XPath functions can't be resolved.
- [JRuby] Comparison of Node to Document with
Node#<=>
now matches CRuby/libxml2 behavior. - [CRuby] Syntax errors are now correctly captured in
Document#errors
for short HTML documents. Previously the SAX parser used for encoding detection was clobbering libxml2's global error handler. - [CRuby] Fixed installation on AIX with respect to
vasprintf
. [#1908] - [CRuby] On some platforms, avoid symbol name collision with glibc's
canonicalize
. [#2105] - [Windows Visual C++] Fixed compiler warnings and errors. [#2061, #2068]
- [CRuby] Fixed Nokogumbo integration which broke in the v1.11.0 release candidates. [#1788] (Thanks, @stevecheckoway!)
- [JRuby] Fixed document encoding regression in v1.11.0 release candidates. [#2080, #2083] (Thanks, @thbar!)
Removed
- The internal method
Nokogiri::CSS::Parser.cache_on=
has been removed. Use.set_cache
if you need to muck with the cache internals. - The class method
Nokogiri::CSS::Parser.parse
has been removed. This was originally deprecated in 2009 in 13db61b. UseNokogiri::CSS.parse
instead.
Changed
XML::Schema
input is now "untrusted" by default
Address CVE-2020-26247.
In Nokogiri versions <= 1.11.0.rc3, XML Schemas parsed by Nokogiri::XML::Schema
were trusted by default, allowing external resources to be accessed over the network, potentially enabling XXE or SSRF attacks.
This behavior is counter to the security policy intended by Nokogiri maintainers, which is to treat all input as untrusted by default whenever possible.
Please note that this security fix was pushed into a new minor version, 1.11.x, rather than a patch release to the 1.10.x branch, because it is a breaking change for some schemas and the risk was assessed to be "Low Severity".
More information and instructions for enabling "trusted input" behavior in v1.11.0.rc4 and later is available at the [publi...
v1.11.0.rc4 / 2020-12-29
v1.11.0.rc4 / 2020-12-29
Latest is v1.11.0.rc4
(2020-12-29). To try out release candidates, use gem install --prerelease
or gem install nokogiri -v1.11.0.rc4
If you're using bundler, try updating your Gemfile with:
gem "nokogiri", "~> 1.11.0.rc4"`
Delta since v1.11.0.rc3:
Notes
- Added precompiled native gem support for Darwin (OSX) platform
arm64-darwin
Dependencies
Ruby
- End of support for Ruby 2.4, for which official support ended on 2020-04-05
Gems
Security
See note below about CVE-2020-26247 in the "Changed" subsection entitled "XML::Schema parsing treats input as untrusted by default".
Performance
- [CRuby] The CSS
~=
operator and class selector.
are about 2x faster. [#2137, #2135] - [CRuby] Patch libxml2 to call
strlen
fromxmlStrlen
rather than the naive implementation, becausestrlen
is generally optimized for the architecture. [#2144] (Thanks, @ilyazub!) - Always compile libxml2 and libxslt with '-O2' [#2022, #2100] (Thanks, @ilyazub!)
- [CRuby]
RelaxNG.from_document
no longer leaks memory. [#2114]
Improved
- [CRuby] Handle incorrectly-closed HTML comments as WHATWG recommends for browsers. [#2058] (Thanks to HackerOne user mayflower for reporting this!)
- {HTML,XML}::Document#parse now accept
Pathname
objects. Previously this worked only if the referenced file was less than 4096 bytes long; longer files resulted in undefined behavior because theread
method would be repeatedly invoked. [#1821, #2110] (Thanks, @doriantaylor and @phokz!) - [CRuby] Nokogumbo builds faster because it can now use header files provided by Nokogiri. [#1788] (Thanks, @stevecheckoway!)
- [JRuby] Clean up deprecated calls into JRuby. [#2027] (Thanks, @headius!)
Fixed
- HTML Parsing in "strict" mode (i.e., the
RECOVER
parse option not set) now correctly raises aXML::SyntaxError
exception. Previously the value of theRECOVER
bit was being ignored by CRuby and was misinterpreted by JRuby. [#2130] - The CSS
~=
operator now correctly handles non-space whitespace in theclass
attribute. commit e45dedd - The Node methods
add_previous_sibling
,previous=
,before
,add_next_sibling
,next=
,after
,replace
, andswap
now correctly use their parent as the context node for parsing markup. These methods now also raise aRuntimeError
if they are called on a node with no parent. [nokogumbo#160] - [JRuby] XML::Schema XSD validation errors are captured in
XML::Schema#errors
. These errors were previously ignored. - [JRuby] Fix how custom XPath function namespaces are inferred to be less naive. [#1890, #2148]
- [JRuby] Clarify exception message when custom XPath functions can't be resolved.
- [JRuby] Comparison of Node to Document with
Node#<=>
now matches CRuby/libxml2 behavior. - [CRuby] Syntax errors are now correctly captured in
Document#errors
for short HTML documents. Previously the SAX parser used for encoding detection was clobbering libxml2's global error handler. - [CRuby] On some platforms, avoid symbol name collision with glibc's
canonicalize
. [#2105] - [CRuby] Fixed Nokogumbo integration which broke in the v1.11.0 release candidates. [#1788] (Thanks, @stevecheckoway!)
- [JRuby] Fixed document encoding regression in v1.11.0 release candidates. [#2080, #2083] (Thanks, @thbar!)
Changed
XML::Schema
input is now "untrusted" by default
Address CVE-2020-26247.
In Nokogiri versions <= 1.11.0.rc3, XML Schemas parsed by Nokogiri::XML::Schema
were trusted by default, allowing external resources to be accessed over the network, potentially enabling XXE or SSRF attacks.
This behavior is counter to the security policy intended by Nokogiri maintainers, which is to treat all input as untrusted by default whenever possible.
Please note that this security fix was pushed into a new minor version, 1.11.x, rather than a patch release to the 1.10.x branch, because it is a breaking change for some schemas and the risk was assessed to be "Low Severity".
More information and instructions for enabling "trusted input" behavior in v1.11.0.rc4 and later is available at the public advisory.
HTML parser now obeys the strict
or norecover
parsing option
(Also noted above in the "Fixed" section) HTML Parsing in "strict" mode (i.e., the RECOVER
parse option not set) now correctly raises a XML::SyntaxError
exception. Previously the value of the RECOVER
bit was being ignored by CRuby and was misinterpreted by JRuby.
If you're using the default parser options, you will be unaffected by this fix. If you're passing strict
or norecover
to your HTML parser call, you may be surprised to see that the parser now fails to recover and raises a XML::SyntaxError
exception. Given the number of HTML documents on the internet that libxml2 would consider to be ill-formed, this is probably not what you want, and you can omit setting that parse option to restore the behavior that you have been relying upon.
Apologies to anyone inconvenienced by this breaking bugfix being present in a minor release, but I felt it was appropriate to introduce this fix because it's straightforward to fix any code that has been relying on this buggy behavior.
VersionInfo
, the output of nokogiri -v
, and related constants
This release changes the metadata provided in Nokogiri::VersionInfo
which also affects the output of nokogiri -v
. Some related constants have also been changed. If you're using VersionInfo
programmatically, or relying on constants related to underlying library versions, please read the detailed changes for Nokogiri::VersionInfo
at #2139 and accept our apologies for the inconvenience.