-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML5 documents should not require namespaces in CSS selector queries #2403
Conversation
15fb707
to
e1cffed
Compare
e1cffed
to
b52dde7
Compare
I could use a review in particular of the libxml2 patch b52dde7, because this feels a little bit risky. Worth noting, though: with that patch applied, libxml2's full test suite still passes. |
I spent a little while looking at the xpath.c patch. Nothing jumps out at me as being wrong, but the file is huge and it's difficult to know what the implications are precisely. I'll try to look more closely this coming week. |
b52dde7
to
f3af677
Compare
@stevecheckoway Thanks for the spot check. I'm going to merge this and accept a little bit of risk, because both libxml2 and nokogiri's full test suites still pass with that libxml2 patch applied. |
I think that makes sense. Sorry for not doing a better review in a reasonable amount of time. |
Oh, please don't apologize! I appreciate your attention. |
This makes it explicit that visitor must be injected. Although this is technically a breaking change, CSS::Node is essentially an internal API and I would be extremely surprised if it was being used directly by anyone. This commit also ensures that CSS::Node, CSS::Parser, and CSS::Tokenizer are omitted from API documentation (to reflect their status as an internal-only and unstable API).
Note that the method signature of internal-only method CSS::Parser.xpath_for has changed to prefer required parameters rather than an options hash with assumed defaults. Also note that I've introduced some XPath constants for XPath query prefixes, which we should start using.
which prevents incorrect cache results being returned in cases where different visitor configurations are used. We also deprecate XPathVisitorAlwaysUseBuiltins and XPathVisitorOptimallyUseBuiltins in favor of a simple XPathVisitor with constructor arguments.
which will allow us to more easily special-case element names in an upcoming commit.
and refactor the XPathVisitor#css_class to not rely on aliasing during object initialization, which feels funny to me
we're about to introduce variations in how CSS selectors are translated into XPath and I want to ensure we have adequate test coverage.
In HTML5, foreign elements have namespaces; but those namespaces should not be considered for the purposes of CSS searches. Unfortunately, this implementation's use of local-name() is ~10x slower than using the normal inline element name matching, which subsequent commits will explore.
Also tweak :nodoc: directives to avoid accidentally excluding important modules (like Nokogiri::CSS) and avoid including code that's not usually relevant to the user API (like Nokogiri::XML:PP, Nokogiri::CSS:Tokenizer, and Nokogiri::HTML4::Document::EncodingReader)
and other small changes to doc strings
This is almost as fast as a standard child-axis search, and much faster than the builtin or using local-name(): //span 18.923k (± 8.9%) i/s - 93.906k in 5.010792s //*[local-name()='span'] 1.849k (± 2.8%) i/s - 9.261k in 5.011560s //*[nokogiri-builtin:local-name-is('span')] 3.191k (± 2.4%) i/s - 16.150k in 5.064798s //*:span 18.016k (± 4.6%) i/s - 89.900k in 5.003444s Comparison: //span: 18922.5 i/s //*:span: 18016.5 i/s - same-ish: difference falls within error //*[nokogiri-builtin:local-name-is('span')]: 3190.6 i/s - 5.93x (± 0.00) slower //*[local-name()='span']: 1849.4 i/s - 10.23x (± 0.00) slower
af58bbc
to
d1a710e
Compare
**What problem is this PR intended to solve?** Before a minor release, I generally review deprecations and look for things we can remove. * Removed `Nokogiri::HTML5.get` which was deprecated in v1.12.0. [#2278] (@flavorjones) * Removed the CSS-to-XPath utility modules `XPathVisitorAlwaysUseBuiltins` and `XPathVisitorOptimallyUseBuiltins`, which were deprecated in v1.13.0 in favor of `XPathVisitor` constructor args. [#2403] (@flavorjones) * Removed `XML::Reader#attribute_nodes` which was deprecated in v1.13.8 in favor of `#attribute_hash`. [#2598, #2599] (@flavorjones) Also we're now specifying version numbers in remaining deprecation warnings. **Have you included adequate test coverage?** Tests have been removed, otherwise no new coverage needed. **Does this change affect the behavior of either the C or the Java implementations?** As documented above.
What problem is this PR intended to solve?
Discussion at #2376 led to the decision to continue to parse HTML5 foreign elements with namespaces, but not to require that namespace in CSS queries.
So, in the canonical example:
Nokogiri will respond to the following searches:
In order to accomplish this, this PR is rather invasive, but leaves a lot of code better factored, better documented, more flexible, and more extensible.
Specific changes:
Nokogiri::CSS.xpath_for
allows aNokogiri::CSS::XPathVisitor
to be injected for fine-grained control over how CSS selectors are translated into XPath queriesHTML5
documents and fragments translate child-axis searches into a wildcard-namespace query if libxml2 is patched; else use a builtin xpath function if libxml2; else use standard XPath local-name() (this order of preference reflects performance, faster-is-more-preferred)Have you included adequate test coverage?
Oh yesssss.
Also here are the benchmarks of the various implementations:
I'll also drop in some graphs of the callstacks below.
Does this change affect the behavior of either the C or the Java implementations?
HTML5 is already only supported by CRuby, and so much of this functionality is simply not going to be turned on in JRuby. That said, if we need this functionality in JRuby, the fallback use of standard XPath
local-name()
would suffice until a need arose for performance optimizations there.