v1.17.0 / 2024-12-08
v1.17.0 / 2024-12-08
Dependencies
- [CRuby] Vendored libxml2 is updated to v2.13.5. @flavorjones
- [CRuby] Vendored libxslt is updated to v1.1.42. @flavorjones
- [CRuby] Minimum supported version of libxml2 raised to v2.9.2 (released 2014-10-16) from v2.6.21. [#3232, #3287] @flavorjones
- [JRuby] Minimum supported version of Java raised to 8 (released 2014-03-18) from 7. [#3134] @flavorjones
- [CRuby] Update to rake-compiler-dock v1.5.1 for building precompiled native gems. [#3216] @flavorjones
Notable changes
SAX Parsers
The XML and HTML4 SAX parsers have received a lot of attention in this release, and we've fixed multiple long-standing bugs with encoding and entity handling. In addition, libxml2 v2.13 has also made some underlying fixes and improvements to encoding and entity handling.
We're shipping these fixes in a minor release because we firmly believe the resulting behavior is correct and standards-compliant, however applications that have been depending on the buggy behavior may be impacted.
If your application relies on the SAX parsers, and in particular if you're SAX-parsing documents with parsed entities or incorrect encoding declarations, please read the changelog below carefully.
Fragment parsing
Document fragment parsing has been improved, particularly with respect to handling malformed fragments or fragments with implicit namespace prefixes. Namespace reconciliation still isn't where we want it to be, but it's an improvement.
HTML5 fragment parsing now allows the context node to be specified as a context:
keyword argument to the HTML5::DocumentFragment.parse
and .new
methods, which should allow for more flexible sanitization and future support for the draft HTML Sanitizer API in downstream libraries.
Error handling
In scenarios where multiple errors could be reported by the underlying parser, the errors will be aggregated into a single Nokogiri::XML::SyntaxError
that is raised. Previously only the final error reported by libxml2 was raised (which was often misleading if it was only a warning and not the fatal error).
Schema validation
We've resolved many long-standing bugs in the various schema classes, validation methods, and their error reporting. Behavior is now consistent across schema types and input types, as well as parser backends (Xerces and libxml2).
Keyword arguments
The following methods now accept keyword arguments in addition to positional arguments, and use ...
parameter forwarding when possible:
HTML4()
, HTML4.fragment
, HTML4.parse
, HTML4::Document.parse
, HTML4::DocumentFragment#initialize
, HTML4::DocumentFragment.parse
, HTML5()
, HTML5.fragment
, HTML5.parse
, HTML5::Document.parse
, HTML5::Document.read_io
, HTML5::Document.read_memory
, HTML5::DocumentFragment#initialize
, HTML5::DocumentFragment.parse
, XML()
, XML.fragment
, XML.parse
, XML::Document.parse
, XML::DocumentFragment#initialize
, XML::DocumentFragment.parse
, XML::Node#canonicalize
, XML::Node.parse
, XML::Reader()
, XML::RelaxNG()
, XML::RelaxNG.new
, XML::RelaxNG.read_memory
, XML::SAX::PushParser#initialize
, XML::Schema()
, XML::Schema.new
, XML::Schema.read_memory
, and XSLT()
.
Special thanks to those contributors who participated in the RubyConf 2024 Hack Day to work on #3323 to help modernize Nokogiri by adding keyword arguments and using parameter forwarding in many methods, and expanding some of the documentation! We intend to continue adding keyword argument support to more methods. #3323 #3324 #3326 #3327 #3329 #3330 #3332 #3333 #3334 #3335 #3336 #3342 #3355 #3356 @infews @matiasow @MattJones @mononoken @openbl @flavorjones
Added
- Introduce support for a new SAX callback
XML::SAX::Document#reference
, which is called to report some parsed XML entities whenXML::SAX::ParserContext#replace_entities
is set to the default valuefalse
. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs forNokogiri::XML::SAX::Document
. [#1926] @flavorjones XML::SAX::Parser#parse_memory
and#parse_file
now accept an optionalencoding
argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjonesXML::SAX::ParserContext.memory
now accepts an optionalencoding
argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones- New readonly attributes
XML::DocumentFragment#parse_options
andHTML4::DocumentFragment#parse_options
return the options used to parse the document fragment. @flavorjones - New method
XML::Reader.new
is the primary constructor to whichXML::Reader()
forwards. Both methods now takeurl:
,encoding:
, andoptions:
kwargs in addition to the previous calling convention of passing positional parameters. #3326 @infews @flavorjones - [CRuby] The HTML5 parse methods accept a
:parse_noscript_content_as_text
keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway - [CRuby]
HTML5::DocumentFragment.parse
and.new
accept a:context
keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to.new
and not at all to.parse
. @flavorjones - [CRuby]
Nokogiri::HTML5::Builder
is similar toHTML4::Builder
but returns anHTML5::Document
. [#3119] @flavorjones - [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
- [CRuby] Introduce a compile-time option,
--disable-xml2-legacy
, to remove from libxml2 its dependencies onzlib
andliblzma
and disable implicitHTTP
network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones - [CRuby] If errors are returned from schema validation, a new attribute
SyntaxError#path
will contain the XPath path of the node that caused the validation failure. [#3316] @ryanong
Improved
- Documentation has been improved for
XML::RelaxNG
,XML::Schema
,XML::Reader
,HTML5
,HTML5::Document
,HTML5::DocumentFragment
,HTML4::Document
,HTML4::DocumentFragment
,XML
,XML::Document
,XML::DocumentFragment
. #3355 @flavorjones - Documentation has been improved for
CSS.xpath_for
. [#3224] @flavorjones - Documentation for the SAX parsing classes has been greatly improved, including encoding overrides and the complex entity-handling behavior. [#3265] @flavorjones
XML::Schema#read_memory
andXML::RelaxNG#read_memory
are now Ruby methods that call#from_document
. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjonesXML::SAX::ParserContext.io
'sencoding
argument is now optional, and can now be anEncoding
or an encoding name. When not provided will default to autodetecting the encoding. [#3288] @flavorjones- [CRuby] The update to libxml v2.13 improves "in context" fragment parsing recovery. We removed our hacky workaround for recovery that led to silently-degraded functionality when parsing fragments with parse errors. Specifically, malformed XML fragments that used implicit namespace prefixes will now "link up" to the namespaces in the parent document or node, where previously they did not. [#2092] @flavorjones
- [CRuby] When multiple errors could be detected by the parser and there's no obvious document to save them in (for example, when parsing a document with the recovery parse option turned off), the libxml2 errors are aggregated into a single
Nokogiri::XML::SyntaxError
. Previously, only the last error recorded by libxml2 was raised, which might be misleading if it's merely a warning and not the fatal error preventing the operation. [#2562] @flavorjones - [CRuby] The SAX parser context and handler implementation has been simplified and now takes advantage of some of libxml2's default SAX handlers for entities and DTD management. [#3265] @flavorjones
- [CRuby] When compiling packaged libraries from source, allow users'
AR
andLD
environment variables to set the archiver and linker commands, respectively. This augments the existingCC
environment variable to set the compiler command. [#3165] @ziggythehamster - [CRuby] When building from source on MacOS, environment variables
AR
andRANLIB
are now respected when set instead of being overridden to /usr/bin/{ar,ranlib} (which is still the default). [#3338] @joshheinrichs-shopify
Fixed
Node#clone
,NodeSet#clone
, and*::Document#clone
all properly copy the metaclass of the original as expected. Previously,#clone
had been aliased to#dup
for these classes (since v1.3.0 in 2009). [#316, #3117] @flavorjones- CSS queries for pseudo-selectors that cannot be translated into XPath expressions now raise a more descriptive
Nokogiri::CSS::SyntaxError
when they are parsed. Previously, an invalid XPath expression was evaluated and a hard-to-understand XPath error was raised by the query engine. [#3193] @flavorjones Schema#validate
returns errors on empty and malformed files. Previously, it would return errors on empty/malformed Documents, but not when reading from files. [#642] @flavorjonesXML::Builder
is now consistent with how it sets block scope. Previously, missing methods with blocks on dynamically-created nodes were always handled by invokinginstance_eval(&block)
on the Builder, even when the Builder was yielding self for all other missing methods with blocks. [#1041] @flavorjonesHTML4::DocumentFragment.parse
acceptsIO
input. Previously, it required a string and would raise aTypeError
when passed anIO
. [#2069] @sharvy- [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
- [CRuby] Update node GC lifecycle to avoid a potential memory leak with fragments in libxml 2.13.0 caused by changes in
xmlAddChild
. [#3156] @flavorjones - [CRuby] libgumbo correctly prints nonstandard element names in error messages. [#3219] @stevecheckoway
- [CRuby] External entity references no long cause the SAX parser to register errors. [#1926] @flavorjones
- [JRuby] Fixed entity reference serialization, which rendered both the reference and the replacement text. Incredibly nobody noticed this bug for over a decade. [#3272] @flavorjones
- [JRuby] Fixed some bugs in how
Node#attributes
handles attributes with namespaces. [#2677, #2679] @flavorjones - [JRuby] Fix
Schema#validate
to only return the most recent Document's errors. Previously, if multiple documents were validated, this method returned the accumulated errors of all previous documents. [#1282] @flavorjones - [JRuby] Fix
Schema#validate
to not clobber the@errors
instance variable. [#1282] @flavorjones - [JRuby] Empty documents fail schema validation as they should. [#783] @flavorjones
- [JRuby] SAX parsing now respects the
#replace_entities
attribute, which defaults tofalse
. Previously this flag defaulted totrue
and was completely ignored. [#614] @flavorjones - [JRuby] The SAX callback
Document#start_element_namespace
received a blank string for the URI when a namespace was not present. It now receivesnil
(as does the CRuby impl). [#3265] @flavorjones - [JRuby]
Reader#outer_xml
and#inner_xml
encode entities properly. [#1523] @flavorjones
Changed
- [CRuby]
Nokogiri::XML::CData.new
no longer acceptsnil
as the content argument, makingCData
behave like other character data classes (likeComment
andText
). This change was necessitated by behavioral changes in libxml2 v2.13.0. If you wish to create an empty CDATA node, pass an empty string. [#3156] @flavorjones - Internals:
- The internal
CSS::XPathVisitor
class now accepts the xpath prefix and the context namespaces as constructor arguments. Theprefix:
andns:
keyword arguments toCSS.xpath_for
cannot be specified if thevisitor:
keyword argument is also used.CSS::XPathVisitor
now exposes#builtins
,#doctype
,#prefix
, and#namespaces
attributes. [#3225] @flavorjones - The internal CSS selector cache has been extracted into a distinct class,
CSS::SelectorCache
. Previously it was part of theCSS::Parser
class. [#3226] @flavorjones - The internal
Gumbo.parse
andGumbo.fragment
methods now take keyword arguments instead of positional arguments. [#3199] @flavorjones
- The internal
Deprecated
- The undocumented and unused method
Nokogiri::CSS.parse
is now deprecated and will generate a warning. The AST returned by this method is private and subject to change and removal in future versions of Nokogiri. This method will be removed in a future version of Nokogiri. - Passing an options hash to
CSS.xpath_for
is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri. - Passing libxml2 encoding IDs to
SAX::ParserContext
methods is now deprecated and will generate a warning. The use ofSAX::Parser::ENCODINGS
is also deprecated. UseEncoding
objects or encoding names instead.
Thank you!
Supporters
The following people and organizations were kind enough to sponsor @flavorjones or the Nokogiri project during the development of v1.17.0:
- via Github sponsors
- renuo @renuo
- Ajaya Agrawalla @ajaya
- Rob Stringer @Mycobee
- Better Stack Community @betterstack-community
- Prowly @prowlycom
- Maxime Gauthier @biximilien
- Harry Lascelles @hlascelles
- Evil Martians @evilmartians
- Typesense @typesense
- YOSHIDA Katsuhiko @kyoshidajp
- Quan Nguyen @qu8n
- Sentry @getsentry
- Codecov @codecov
- Frank Groeneveld @frenkel
- Hiroshi SHIBATA @hsbt
- Nando Vieira @fnando
- Orien Madgwick @orien
- Avo @avo-hq
- Zoran Pesic @zokioki
- @zzak
- Graham Watts @GingerGraham
- Nandang Permana Kusuma @nandangpk
- Mr. Henry @mrhenry
- Götz Görisch @GoetzGoerisch
- Andrew Nesbitt @andrew
- via Thanks.dev
- Sentry @getsentry
- Codecov @codecov
- Keygen @keygen-sh
- Keith Bauson @kwbauson
- Nicco Kunzmann @niccokunzmann
- timhaynes @timhaynes
- via Open Collective
- Airbnb @airbnb
- Nemo @captn3m0
- Velocity Labs @velocity-labs
We'd also like to thank @github who donate a ton of compute time for our CI pipelines!
New Contributors
- @adfoster-r7 made their first contribution in #3090
- @kianmeng made their first contribution in #3166
- @ziggythehamster made their first contribution in #3165
- @myabc made their first contribution in #3194
- @sharvy made their first contribution in #3298
- @ryanong made their first contribution in #3316
- @MattJones made their first contribution in #3328
- @joshheinrichs-shopify made their first contribution in #3338
- @matiasow made their first contribution in #3342
- @mononoken made their first contribution in #3329
- @openbl made their first contribution in #3333
- @infews made their first contribution in #3326
sha256 checksums
95cdf0d33fe29dd2478d6a34656c9dd909e4b7dae9467b24721af67e1944d6e6 nokogiri-1.17.0-aarch64-linux.gem
a0364ad985eb4c0a235e95896324969c20795be941a621fe753734bdee8cfa73 nokogiri-1.17.0-arm64-darwin.gem
f0c1e71e6f4cd64a6efea4761c85e280318a450968262d02bb917c13874c1c48 nokogiri-1.17.0-arm-linux.gem
4200f1c9525ad91b7226d35849f2c7909b20a5e4571ab1204cc3cda1debe59ef nokogiri-1.17.0.gem
21b8f5022c018a72d97bc1841bb67a8a391456491c08c744141bb6a8f39b3d04 nokogiri-1.17.0-java.gem
408ecf5bb34074bc4551f5f41388a3746cb96fdc932b06a686c142038ba7aa38 nokogiri-1.17.0-x64-mingw32.gem
b4dd8ed5f8de6814ec5ee18cb2708e716babed998f5ee7b67e62aec19d5ffbf0 nokogiri-1.17.0-x64-mingw-ucrt.gem
8d9d5bd2db1aa6b41b4ed9c0b890a9e76c33cb031008971b1fd34a35b1f525a5 nokogiri-1.17.0-x86_64-darwin.gem
fd34467481d6c50f800a516e5db029ca3ad3fb8fcdec032bae581a2d80a4a74b nokogiri-1.17.0-x86_64-linux.gem
ac2a4eff755d00d8e8534f2af51cd5622321f3b2481cc4277df4e2cd32fabfc2 nokogiri-1.17.0-x86-linux.gem
c478d7168db29511085630280719fd23a5864ae88a5ed879e7fff2954906e727 nokogiri-1.17.0-x86-mingw32.gem
Full Changelog: v1.16.0...v1.17.0