Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] RegExp uses undue amount of memory on Chromium-based browsers #3193

Closed
gorhill opened this issue Nov 2, 2017 · 3 comments
Closed

Comments

@gorhill
Copy link
Owner

gorhill commented Nov 2, 2017

In commit bacf502, I refactored how hostnames as specified in the domain= option in a network static filter was implemented.

As a result of the set-vs-regexp.html benchmark, I decided to use regexp to quickly lookup whether a hostname is part of a set of hostnames as specified in a domain= option.

However, as revealed by the "Take heap snapshot" memory tool in Chromium, the amount of memory used by regexp instances on Chromium-based browsers is quite surprising. RegExp instances are internally lazily allocated in Chromium, meaning that internally memory is allocated only when the method exec() is called on a RegExp instance.

However, as shown in the following screenshot, a lot of filters with the domain= option end up having their regexp executed earlier than expected. The heap snapshot was taken after launching uBO and visiting only the links on the front page of https://news.ycombinator.com/news:

a

The top RegExp by memory use comes from the filter $script,third-party,domain=123videos.tv|171gifs.com|1proxy.de|... in EasyList. Such filter will always end up being executed because if applies to any network request of type script. The number of distinct hostnames in the domain= option of that specific filter is 732.

As seen in the screenshot, even with a minimalist browsing session, all these RegExp instances add up to a good amount of memory. Pretty much all these memory-expensive RegExps are related to the domain= option in network static filtering.

Even a small EasyList filter such as |https://$script,third-party,xmlhttprequest,domain=candyreader.com|likesblog.com|projectfreetv.at|projectfreetv.sc|projectfreetv.us|projectwatchseries.com|shupebrothers.com|watchseriesonline.info -- which also always end up executing -- will have a memory footprint of 6,880 bytes to represent just the eight distinct hostnames specified in its 144-character long domain= option.

As shown in the benchmark, RegExp are reportedly quite faster than using Set when it comes to lookup whether a specific hostname is part of the set or not.

This issue is to document and address this domain=-related RegExp memory issue.

@gorhill
Copy link
Owner Author

gorhill commented Nov 2, 2017

First step: the benchmark was revised to fix what I saw as flaws in it:
gorhill/obj-vs-set-vs-map@1809cc1

@lespea
Copy link

lespea commented Nov 2, 2017

Just going to throw this out this, and this may be more work than you're willing to do and/or use too much memory in-and-of itself, but maybe using one of these regex-trie libraries to generate just a single regex for hostname matching will work?

I've only used a similar type of library in perl so I'm not sure how well either performs.

gorhill added a commit that referenced this issue Nov 2, 2017
gorhill added a commit to gorhill/obj-vs-set-vs-map that referenced this issue Nov 2, 2017
@gorhill
Copy link
Owner Author

gorhill commented Nov 2, 2017

All of the regexp seen in screenshot in opening comment are from FilterOriginHitSet objects which purpose is to implement the domain= filter option. Below is the memory heap snapshots for FilterOriginHitSet using RegExp and HNTrie, respectively:

RegExp:
a

HNTrie:
b

So by the look of the heap snapshots, HNTrie requires 10% of the memory required by RegExp, for the same underlying domain= option.

I added HNTrie to the benchmark.

The creation time is slower when using HNTrie, but this is not a reason to not use it:

  • The creation is a one-time event
  • The creation is done lazily -- triggered when the set of hostnames need to be evaluated (not at launch time).

On my side, with Chromium, results are improved for small set. For medium and large sets, there is a small performance decrease observed relative to using regexp. The difference is not enough to worry given the gain in memory efficiency, and also to keep in mind that regexps may incur other costs not measured by the benchmark. For example, there is no memory churning with HNTrie.matches, which might not be the case with regexps.

With Firefox (and Firefox for Android), there is performance improvement for all cases -- Firefox deals well with optimizing javascript code dealing with TypedArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants