Skip to content

Commit

Permalink
perf: html5 attribute parsing (#3393)
Browse files Browse the repository at this point in the history
## What problem is this PR intended to solve?

Addressing quadratic performance of HTML5 attribute parsing as described
in #2568 and rubys/nokogumbo#144.

Closes #2568


### Some benchmarks

<details><summary>Here is the benchmark script.</summary>

```ruby
#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

def html_with_attributes(count)
  html = "<div"
  count.times do |j|
    html << " fake-attr-#{j}"
  end
  html << ">"
end

html = {}
Benchmark.ips do |x|
  x.warmup = 0

  [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 ].each do |attribute_count|
    html[attribute_count] = html_with_attributes(attribute_count)

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html[attribute_count], max_attributes: 100_000)
    end
  end
end
```

</details>

before:

```
Calculating -------------------------------------
        1 attributes    242.976k (± 9.3%) i/s    (4.12 μs/i) -    936.599k in   4.838811s
        2 attributes    219.456k (± 9.4%) i/s    (4.56 μs/i) -    837.644k in   4.858317s
        4 attributes    191.173k (± 9.3%) i/s    (5.23 μs/i) -    715.392k in   4.882336s
        8 attributes    144.813k (±12.2%) i/s    (6.91 μs/i) -    516.194k in   4.912577s
       16 attributes     83.396k (±17.2%) i/s   (11.99 μs/i) -    238.564k in   4.958331s
       32 attributes     55.800k (± 9.9%) i/s   (17.92 μs/i) -    190.871k in   4.970063s
       64 attributes     27.381k (±14.7%) i/s   (36.52 μs/i) -     91.867k in   4.981936s
      128 attributes     12.707k (± 9.4%) i/s   (78.70 μs/i) -     49.906k in   4.990254s
      256 attributes      4.723k (± 8.1%) i/s  (211.74 μs/i) -     19.657k in   4.995888s
      512 attributes      1.417k (±10.6%) i/s  (705.61 μs/i) -      6.339k in   4.998084s
     1024 attributes    382.078 (±11.5%) i/s    (2.62 ms/i) -      1.811k in   5.000356s
     2048 attributes    119.825 (±10.8%) i/s    (8.35 ms/i) -    582.000 in   5.003555s
     4096 attributes     29.034 (± 6.9%) i/s   (34.44 ms/i) -    144.000 in   5.003602s
     8192 attributes      6.492 (± 0.0%) i/s  (154.04 ms/i) -     33.000 in   5.111381s
    16384 attributes      1.478 (± 0.0%) i/s  (676.68 ms/i) -      8.000 in   5.417232s
```

after:

```
Calculating -------------------------------------
        1 attributes    229.084k (±10.4%) i/s    (4.37 μs/i) -    868.071k in   4.846385s
        2 attributes    213.788k (± 9.9%) i/s    (4.68 μs/i) -    804.683k in   4.859819s
        4 attributes    181.602k (±10.7%) i/s    (5.51 μs/i) -    652.740k in   4.886700s
        8 attributes    137.811k (± 9.9%) i/s    (7.26 μs/i) -    500.690k in   4.911176s
       16 attributes     95.462k (± 9.9%) i/s   (10.48 μs/i) -    338.045k in   4.938699s
       32 attributes     61.388k (± 9.5%) i/s   (16.29 μs/i) -    216.911k in   4.961587s
       64 attributes     34.368k (±10.2%) i/s   (29.10 μs/i) -    121.985k in   4.976219s
      128 attributes     17.315k (±10.9%) i/s   (57.75 μs/i) -     60.716k in   4.987542s
      256 attributes      9.011k (± 9.2%) i/s  (110.98 μs/i) -     32.443k in   4.992878s
      512 attributes      4.552k (±11.8%) i/s  (219.69 μs/i) -     16.522k in   4.995834s
     1024 attributes      2.255k (±13.6%) i/s  (443.38 μs/i) -      8.303k in   4.997861s
     2048 attributes      1.091k (±17.6%) i/s  (916.91 μs/i) -      4.132k in   4.998710s
     4096 attributes    544.840 (±22.0%) i/s    (1.84 ms/i) -      2.101k in   5.006943s
     8192 attributes    282.789 (±22.3%) i/s    (3.54 ms/i) -      1.089k in   5.000604s
    16384 attributes    134.991 (±27.4%) i/s    (7.41 ms/i) -    542.000 in   5.001360s
```

In graphical form:


![image](https://github.com/user-attachments/assets/c07ac065-0069-43ce-85e7-ed1cb4a86542)

Just the "after", showing linear performance now:


![image](https://github.com/user-attachments/assets/3051bcda-1b55-4270-8f14-f24eba0fb327)


### Analysis

Analyzing the parsing of a tag with 16000 attributes (using `samply`),
we see two hotspots:


![image](https://github.com/user-attachments/assets/3a2409e6-c394-4a73-a1b2-3373654e760d)

1. the `strlen`/`memcmp` checks in libgumbo's
`tokenizer.c:finish_attribute_name` which checks for duplicates in the
attribute list:
https://github.com/sparklemotion/nokogiri/blob/729c96c05f692d69e4d1d3be6d4b9524bdbe9322/gumbo-parser/src/tokenizer.c#L804-L820
2. libxml2's `xmlNewPropInternal` traversing the singly-linked
properties list to append each new property (to maintain parse order)

This PR addresses (1) by introducing a hashmap and using it if there are
16 or more attributes.

This PR addresses (2) by inlining most of `xmlNewNsProp` but keeping a
pointer to the last property to avoid re-traversing the linked list each
time.

After those changes, the flamegraph looks much better, with no obvious
hotspots:


![image](https://github.com/user-attachments/assets/36859e83-b35b-4489-af24-7f8c2f52935c)


### hashmap

Keeping in mind we may still someday want to publish our libgumbo fork
as a separate library, I opted to bring in a new self-contained hashmap
library from https://github.com/tidwall/hashmap.c

I'm using [xxHash3](https://github.com/Cyan4973/xxHash) which seems to
be pretty fast, but I'm open to other opinions on what hash function to
use.


## Have you included adequate test coverage?

Yes.


## Does this change affect the behavior of either the C or the Java
implementations?

HTML5 is CRuby-only.
  • Loading branch information
flavorjones authored Dec 29, 2024
2 parents 9ddc4dd + c29c920 commit 1d83082
Show file tree
Hide file tree
Showing 15 changed files with 1,504 additions and 21 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

---

## next / unreleased

### Improved

* [CRuby] The HTML5 parser now has linear performance when parsing many attributes. Previously performance was quadratic due to two hotspots, one in detecting duplicate attributes and one in constructing the libxml2 data structures. (#3393) @flavorjones



## v1.18.0 / 2024-12-25

### Notable Changes
Expand Down
29 changes: 29 additions & 0 deletions LICENSE-DEPENDENCIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Note that this document is broken into multiple sections, each of which describe
* [Native WindowsⓇ platform releases ("x64-mingw-ucrt")](#native-windows%E2%93%A1-platform-releases-x64-mingw-ucrt)
* [JavaⓇ (JRuby) platform release ("java")](#java%E2%93%A1-jruby-platform-release-java)
- [Appendix: Dependencies' License Texts](#appendix-dependencies-license-texts)
* [hashmap.c](#hashmapc)
* [libgumbo](#libgumbo)
* [libxml2](#libxml2)
* [libxslt](#libxslt)
Expand Down Expand Up @@ -112,6 +113,34 @@ This section contains a subsection for each potentially-distributed dependency,
Please see previous sections to understand which of these potential dependencies is actually distributed in the gem file you're downloading and using.


### hashmap.c

MIT

https://github.com/tidwall/hashmap.c/blob/master/LICENSE

The MIT License (MIT)

Copyright (c) 2020 Joshua J Baker

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


### libgumbo

Apache 2.0
Expand Down
1 change: 1 addition & 0 deletions ext/nokogiri/extconf.rb
Original file line number Diff line number Diff line change
Expand Up @@ -1133,6 +1133,7 @@ def compile
have_func("xmlCtxtSetOptions") # introduced in libxml2 2.13.0
have_func("xmlCtxtGetOptions") # introduced in libxml2 2.14.0
have_func("xmlSwitchEncodingName") # introduced in libxml2 2.13.0
have_func("xmlAddIDSafe") # introduced in libxml2 2.13.0
have_func("rb_category_warning") # introduced in Ruby 3.0 but had trouble resolving this symbol in truffleruby

other_library_versions_string = OTHER_LIBRARY_VERSIONS.map { |k, v| [k, v].join(":") }.join(",")
Expand Down
70 changes: 69 additions & 1 deletion ext/nokogiri/gumbo.c
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,71 @@ set_line(xmlNodePtr node, size_t line)
}
}

// This function is essentially xmlNewNsProp, but we skip the full list traversal to append by
// providing the last property in the linked list as a parameter.
static xmlAttrPtr
append_property(xmlNodePtr node, xmlNsPtr ns, const xmlChar *name, const xmlChar *value, xmlAttrPtr last_prop)
{
xmlAttrPtr cur = (xmlAttrPtr) xmlMalloc(sizeof(xmlAttr));
if (cur == NULL) {
return NULL;
}
memset(cur, 0, sizeof(xmlAttr));
cur->type = XML_ATTRIBUTE_NODE;
cur->parent = node;
xmlDocPtr doc = node->doc;
cur->doc = doc;
cur->ns = ns;

if ((doc != NULL) && (doc->dict != NULL)) {
cur->name = (xmlChar *) xmlDictLookup(doc->dict, name, -1);
} else {
cur->name = xmlStrdup(name);
}
if (cur->name == NULL) {
goto error;
}

if (value != NULL) {
cur->children = xmlNewDocText(doc, value);
if (cur->children == NULL) {
goto error;
}
cur->last = NULL;
xmlNodePtr tmp = cur->children;
while (tmp != NULL) {
tmp->parent = (xmlNodePtr) cur;
if (tmp->next == NULL) {
cur->last = tmp;
}
tmp = tmp->next;
}

if (doc != NULL) {
int res = xmlIsID(doc, node, cur);
if (res < 0) {
goto error;
}
if ((res == 1) && (xmlAddIDSafe(cur, value) < 0)) {
goto error;
}
}
}

if (node->properties == NULL) {
node->properties = cur;
} else {
last_prop->next = cur;
cur->prev = last_prop;
}

return cur;

error:
xmlFreeProp(cur);
return (NULL);
}

// Construct an XML tree rooted at xml_output_node from the Gumbo tree rooted
// at gumbo_node.
static void
Expand Down Expand Up @@ -200,6 +265,7 @@ build_tree(
xmlAddChild(xml_node, xml_child);

// Add the attributes.
xmlAttrPtr last_prop = NULL;
const GumboVector *attrs = &gumbo_child->v.element.attributes;
for (size_t i = 0; i < attrs->length; i++) {
const GumboAttribute *attr = attrs->data[i];
Expand All @@ -220,7 +286,9 @@ build_tree(
default:
ns = NULL;
}
xmlNewNsProp(xml_child, ns, (const xmlChar *)attr->name, (const xmlChar *)attr->value);

// We micromanage the attribute list for performance reasons.
last_prop = append_property(xml_child, ns, (const xmlChar *)attr->name, (const xmlChar *)attr->value, last_prop);
}

// Add children for this element.
Expand Down
12 changes: 12 additions & 0 deletions ext/nokogiri/libxml2_polyfill.c
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,15 @@ xmlSwitchEncodingName(xmlParserCtxtPtr ctxt, const char *encoding)
return (xmlSwitchToEncoding(ctxt, handler));
}
#endif

#ifndef HAVE_XMLADDIDSAFE
int
xmlAddIDSafe(xmlAttrPtr attr, const xmlChar *value)
{
xmlIDPtr id = xmlAddID(NULL, attr->doc, value, attr);
if (id) {
return 1;
}
return 0;
}
#endif
3 changes: 3 additions & 0 deletions ext/nokogiri/nokogiri.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ int xmlCtxtGetOptions(xmlParserCtxtPtr ctxt);
#ifndef HAVE_XMLSWITCHENCODINGNAME
int xmlSwitchEncodingName(xmlParserCtxtPtr ctxt, const char *encoding);
#endif
#ifndef HAVE_XMLADDIDSAFE
int xmlAddIDSafe(xmlAttrPtr attr, const xmlChar *value);
#endif

#define XMLNS_PREFIX "xmlns"
#define XMLNS_PREFIX_LEN 6 /* including either colon or \0 */
Expand Down
2 changes: 2 additions & 0 deletions gumbo-parser/src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ gumbo_objs := \
char_ref.o \
error.o \
foreign_attrs.o \
hashmap.o \
parser.o \
string_buffer.o \
string_piece.o \
string_set.o \
svg_attrs.o \
svg_tags.o \
tag.o \
Expand Down
Loading

0 comments on commit 1d83082

Please sign in to comment.