Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize #at_css and #css initialization #14

Merged
merged 15 commits into from
Dec 27, 2024
82 changes: 44 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![CI](https://github.com/serpapi/nokolexbor/actions/workflows/ci.yml/badge.svg)](https://github.com/serpapi/nokolexbor/actions/workflows/ci.yml)

Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.
Nokolexbor is a drop-in replacement for Nokogiri. It's 4.7x faster at parsing HTML and up to 1352x faster at CSS selectors.

It's a performance-focused HTML5 parser for Ruby based on [Lexbor](https://github.com/lexbor/lexbor/). It supports both CSS selectors and XPath. Nokolexbor's API is designed to be 1:1 compatible as much as possible with [Nokogiri's API](https://github.com/sparklemotion/nokogiri).

Expand Down Expand Up @@ -106,75 +106,81 @@ end

## Benchmarks

Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.
Benchmarks of parsing Google search result page (367 KB) and finding nodes using CSS selectors and XPath.

CPU: AMD Ryzen 5 5600 (Ubuntu 20.04 on Windows 10 WSL 2).

Run with: `ruby bench/bench.rb`

| | Nokolexbor (iters/s) | Nokogiri (iters/s) | Diff |
| ---------- | ------------- | ----------- | -------------- |
| parsing | 487.6 | 93.5 | 5.22x faster |
| at_css | 50798.8 | 50.9 | 997.87x faster |
| css | 7437.6 | 52.3 | 142.11x faster |
| at_xpath | 57.077 | 53.176 | same-ish |
| xpath | 51.523 | 58.438 | same-ish |
| ---------- | ------------- | ------------ | --------------- |
| parsing | 994.8 | 211.8 | 4.70x faster |
| at_css | 202963.7 | 150.1 | 1352.33x faster |
| css | 9787.9 | 150.0 | 65.27x faster |
| at_xpath | 154.6 | 153.2 | same-ish |
| xpath | 154.3 | 153.2 | same-ish |

<details>
<summary>Raw data</summary>

```
Warming up --------------------------------------
Nokolexbor parse 56.000 i/100ms
Nokogiri parse 8.000 i/100ms
Nokolexbor parse (367 KB)
100.000 i/100ms
Nokogiri parse (367 KB)
20.000 i/100ms
Calculating -------------------------------------
Nokolexbor parse 487.564 (±10.9%) i/s - 9.688k in 20.117173s
Nokogiri parse 93.470 (±21.4%) i/s - 1.736k in 20.024163s
Nokolexbor parse (367 KB)
994.773 (± 0.9%) i/s - 19.900k in 20.006124s
Nokogiri parse (367 KB)
211.793 (±12.3%) i/s - 4.180k in 20.093299s

Comparison:
Nokolexbor parse: 487.6 i/s
Nokogiri parse: 93.5 i/s - 5.22x (± 0.00) slower
Nokolexbor parse (367 KB): 994.8 i/s
Nokogiri parse (367 KB): 211.8 i/s - 4.70x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor at_css 5.548k i/100ms
Nokogiri at_css 6.000 i/100ms
Nokolexbor at_css 20.195k i/100ms
Nokogiri at_css 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_css 50.799k13.8%) i/s - 987.544k in 20.018481s
Nokogiri at_css 50.90735.4%) i/s - 828.000 in 20.666258s
Nokolexbor at_css 202.964k 0.7%) i/s - 4.059M in 20.000626s
Nokogiri at_css 150.084 0.7%) i/s - 3.015k in 20.089207s

Comparison:
Nokolexbor at_css: 50798.8 i/s
Nokogiri at_css: 50.9 i/s - 997.87x (± 0.00) slower
Nokolexbor at_css: 202963.7 i/s
Nokogiri at_css: 150.1 i/s - 1352.33x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor css 709.000 i/100ms
Nokogiri css 4.000 i/100ms
Nokolexbor css 977.000 i/100ms
Nokogiri css 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor css 7.438k14.7%) i/s - 145.345k in 20.083833s
Nokogiri css 52.33836.3%) i/s - 816.000 in 20.042053s
Nokolexbor css 9.788k 0.4%) i/s - 196.377k in 20.063658s
Nokogiri css 149.956 0.7%) i/s - 3.000k in 20.006363s

Comparison:
Nokolexbor css: 7437.6 i/s
Nokogiri css: 52.3 i/s - 142.11x (± 0.00) slower
Nokolexbor css: 9787.9 i/s
Nokogiri css: 150.0 i/s - 65.27x (± 0.00) slower

Warming up --------------------------------------
Nokolexbor at_xpath 2.000 i/100ms
Nokogiri at_xpath 4.000 i/100ms
Nokolexbor at_xpath 15.000 i/100ms
Nokogiri at_xpath 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor at_xpath 57.07731.5%) i/s - 920.000 in 20.156393s
Nokogiri at_xpath 53.17635.7%) i/s - 876.000 in 20.036717s
Nokolexbor at_xpath 153.190 0.7%) i/s - 3.075k in 20.073628s
Nokogiri at_xpath 154.588 0.6%) i/s - 3.105k in 20.086664s

Comparison:
Nokolexbor at_xpath: 57.1 i/s
Nokogiri at_xpath: 53.2 i/s - same-ish: difference falls within error
Nokogiri at_xpath: 154.6 i/s
Nokolexbor at_xpath: 153.2 i/s - same-ish: difference falls within error

Warming up --------------------------------------
Nokolexbor xpath 3.000 i/100ms
Nokogiri xpath 3.000 i/100ms
Nokolexbor xpath 15.000 i/100ms
Nokogiri xpath 15.000 i/100ms
Calculating -------------------------------------
Nokolexbor xpath 51.52331.1%) i/s - 903.000 in 20.102568s
Nokogiri xpath 58.43835.9%) i/s - 852.000 in 20.001408s
Nokolexbor xpath 153.159 0.7%) i/s - 3.075k in 20.077580s
Nokogiri xpath 154.322 1.3%) i/s - 3.090k in 20.026288s

Comparison:
Nokogiri xpath: 58.4 i/s
Nokolexbor xpath: 51.5 i/s - same-ish: difference falls within error
Nokogiri xpath: 154.3 i/s
Nokolexbor xpath: 153.2 i/s - same-ish: difference falls within error
```
</details>
10 changes: 1 addition & 9 deletions ext/nokolexbor/extconf.rb
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,6 @@ def which(cmd)
append_cflags("-DLEXBOR_STATIC")
append_cflags("-DLIBXML_STATIC")

def sys(cmd)
puts "-- #{cmd}"
unless ret = xsystem(cmd)
raise "ERROR: '#{cmd}' failed"
end
ret
end

# Thrown when we detect CMake is taking too long and we killed it
class CMakeTimeout < StandardError
end
Expand Down Expand Up @@ -138,7 +130,7 @@ def apply_patch(patch_file, chdir)

Dir.chdir("build") do
run_cmake(10 * 60, ".. -DCMAKE_INSTALL_PREFIX:PATH=#{INSTALL_DIR} #{lexbor_cmake_flags.join(' ')}")
sys("#{MAKE} install")
system("#{MAKE}", "install")
end
end

Expand Down
2 changes: 1 addition & 1 deletion ext/nokolexbor/libxml/tree.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ extern "C" {
#endif

static size_t tmp_len;
#define NODE_NAME(node) lxb_dom_node_name_qualified((node), &tmp_len)
#define NODE_NAME(node) lxb_dom_node_name_qualified((lxb_dom_node_t *)(node), &tmp_len)
#define NODE_NS_HREF(node) ((node)->prefix ? lxb_ns_by_id((node)->owner_document->ns, (node)->ns, &tmp_len) : NULL)
#define NODE_NS_PREFIX(node) lxb_ns_by_id((node)->owner_document->prefix, (node)->prefix, &tmp_len)

Expand Down
8 changes: 4 additions & 4 deletions ext/nokolexbor/nl_attribute.c
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ nl_attribute_parent(VALUE self)
if (attr->owner == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->owner, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->owner, nl_rb_document_get(self));
}

/**
Expand All @@ -158,7 +158,7 @@ nl_attribute_previous(VALUE self)
if (attr->prev == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->prev, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->prev, nl_rb_document_get(self));
}

/**
Expand All @@ -175,7 +175,7 @@ nl_attribute_next(VALUE self)
if (attr->next == NULL) {
return Qnil;
}
return nl_rb_node_create(attr->next, nl_rb_document_get(self));
return nl_rb_node_create((lxb_dom_node_t *)attr->next, nl_rb_document_get(self));
}

static VALUE
Expand All @@ -189,7 +189,7 @@ nl_attribute_inspect(VALUE self)

return rb_sprintf("#<%" PRIsVALUE " %s=\"%s\">", c,
lxb_dom_attr_qualified_name(attr, &len),
attr_value == NULL ? "" : attr_value);
attr_value == NULL ? "" : (char *)attr_value);
}

void Init_nl_attribute(void)
Expand Down
32 changes: 16 additions & 16 deletions ext/nokolexbor/nl_document.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ VALUE cNokolexborDocument;

#ifdef HAVE_PTHREAD_H
#include <pthread.h>
pthread_key_t p_key_parser;
pthread_key_t p_key_html_parser;
#endif

static void
Expand Down Expand Up @@ -51,23 +51,23 @@ nl_document_parse(VALUE self, VALUE rb_string_or_io)
size_t html_len = RSTRING_LEN(rb_html);

#ifdef HAVE_PTHREAD_H
lxb_html_parser_t *g_parser = (lxb_html_parser_t *)pthread_getspecific(p_key_parser);
lxb_html_parser_t *html_parser = (lxb_html_parser_t *)pthread_getspecific(p_key_html_parser);
#else
lxb_html_parser_t *g_parser = NULL;
lxb_html_parser_t *html_parser = NULL;
#endif
if (g_parser == NULL) {
g_parser = lxb_html_parser_create();
lxb_status_t status = lxb_html_parser_init(g_parser);
if (html_parser == NULL) {
html_parser = lxb_html_parser_create();
lxb_status_t status = lxb_html_parser_init(html_parser);
if (status != LXB_STATUS_OK) {
nl_raise_lexbor_error(status);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaks the initial allocation unless status was LXB_STATUS_ERROR_OBJECT_IS_NULL. Otherwise it also leaks other allocations performed by lxb_html_parser_init. The parser object also leaks in the non-HAVE_PTHREAD_H case due to reliance on it calling the destructor.

}
g_parser->tree->scripting = true;
html_parser->tree->scripting = true;
#ifdef HAVE_PTHREAD_H
pthread_setspecific(p_key_parser, g_parser);
pthread_setspecific(p_key_html_parser, html_parser);
#endif
}

lxb_html_document_t *document = lxb_html_parse(g_parser, (const lxb_char_t *)html_c, html_len);
lxb_html_document_t *document = lxb_html_parse(html_parser, (const lxb_char_t *)html_c, html_len);

if (document == NULL) {
rb_raise(rb_eRuntimeError, "Error parsing document");
Expand Down Expand Up @@ -104,7 +104,7 @@ static VALUE
nl_document_get_title(VALUE self)
{
size_t len;
lxb_char_t *str = lxb_html_document_title(nl_rb_document_unwrap(self), &len);
lxb_char_t *str = lxb_html_document_title((lxb_html_document_t *)nl_rb_document_unwrap(self), &len);
return str == NULL ? rb_str_new("", 0) : rb_utf8_str_new(str, len);
}

Expand All @@ -126,7 +126,7 @@ nl_document_set_title(VALUE self, VALUE rb_title)
{
const char *c_title = StringValuePtr(rb_title);
size_t len = RSTRING_LEN(rb_title);
lxb_html_document_title_set(nl_rb_document_unwrap(self), (const lxb_char_t *)c_title, len);
lxb_html_document_title_set((lxb_html_document_t *)nl_rb_document_unwrap(self), (const lxb_char_t *)c_title, len);
return rb_title;
}

Expand All @@ -143,18 +143,18 @@ nl_document_root(VALUE self)
}

static void
free_parser(void *data)
free_html_parser(void *data)
{
lxb_html_parser_t *g_parser = (lxb_html_parser_t *)data;
if (g_parser != NULL) {
g_parser = lxb_html_parser_destroy(g_parser);
lxb_html_parser_t *html_parser = (lxb_html_parser_t *)data;
if (html_parser != NULL) {
html_parser = lxb_html_parser_destroy(html_parser);
}
}

void Init_nl_document(void)
{
#ifdef HAVE_PTHREAD_H
pthread_key_create(&p_key_parser, free_parser);
pthread_key_create(&p_key_html_parser, free_html_parser);
#endif

cNokolexborDocument = rb_define_class_under(mNokolexbor, "Document", cNokolexborNode);
Expand Down
Loading
Loading