Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[static filter syntax] irregularities when no wildcard is used #1065

Closed
360up opened this issue Dec 14, 2015 · 8 comments
Closed

[static filter syntax] irregularities when no wildcard is used #1065

360up opened this issue Dec 14, 2015 · 8 comments

Comments

@360up
Copy link

360up commented Dec 14, 2015

browser version/ublock version: iw-38.4.0esr / uBO-1.4.1b2
open this exact URL: https://github.com
do this: lets try to filter https://api.github.com/_private/browser/stats

Neither of the following filters work
||api.git
^browse
browser
stats
_privat (for some reason _private works though)

All of the filters work in ABP (even a single character does), but in uBO it seems you have to add at least one wildcard if you don't:

  1. specify a full domain (eg github.com)
  2. enclose filter with separation char substitute (eg ^browser^)
  3. use enclosed separation char ( eg browser/stats)
    ... but why is _private working then?

Not that those examples make much sense, I'm just trying to understand in which cases the syntax doesn't comply with ABP.

@gorhill
Copy link
Owner

gorhill commented Dec 14, 2015

in uBO it seems you have to add at least

By design. Unlike ABP, uBO does not use RegExp by default internally for most filters, only when it can not be avoided -- for efficiency reason. An asterisk is required to explicitly declare that any character segment is allowed in one spot, this includes the beginning/end of a filter, otherwise it is assume the beginning/end is a word boundary (if defining word as a sequence of any of the character from the set %0-9A-Za-z).

@gorhill gorhill closed this as completed Dec 14, 2015
@lewisje
Copy link

lewisje commented Dec 15, 2015

Could you put that into the "Static filter syntax" section of the wiki? I too was under the impression that blocking rules had implicit wildcards at beginning and end (with the exception that if the rule would otherwise start and end with a forward slash /, an asterisk * needed to be prepended or appended, usually the latter, to keep it from being interpreted as a regex).

This actually makes uBO more efficient than ABP and the like, because it's possible to specify that a filter begins and ends on a word boundary without using a regex.

@gorhill
Copy link
Owner

gorhill commented Dec 15, 2015

I too was under the impression that blocking rules had implicit wildcards at beginning and end

They have implicit wildcards, except after implying word boundaries.

This actually makes uBO more efficient than ABP

The kind of filters as seen in OP are quite rare in EasyList/EasyPrivacy. They leads to inefficient filtering, because they are broad, can't be tokenized (if following ABP rules), and with such filter every single URL must be checked against every single one of these untokenizable filters. Given their rarity in EasyList/EasyPrivacy, this is not really the explanation of why uBO's static network filtering engine is more efficient.

For other cases where the start or end is a word character, it's clear the intended function was not to have a wildcard interpretation. For example, in EasyList:

120x500.gif|

I really doubt that the purpose of this filter is to block all GIF images which URL ends with 120x500.gif, but rather all GIF images which size is 120 x 500, not 5120 x 500, or 10120 x 500. (correction: ok, I realize the filter could be to match zebra120x500.gif and lion120x500.gif -- I may chose to remove the word boundary assumption.. reopening).

Implicit word boundaries is what is pretty much always happening anyways in EasyList/EasyPrivacy. I am sure that nowadays, for performance consideration re. ABP, official filter list maintainers avoid filters like in OP.

Regarding the evaluation of network static filters, the upperhand of uBO versus ABP is not because of the differing interpretation re. leading/tailing wildcards, but rather because majority of filters are evaluated not using a regex in uBO (this would still be true if implying wildcards like ABP), while all filters are translated into regexes in ABP (someone correct me if I am wrong).

Typically regex will cause the whole URL to be scanned. With uBO, the token extracted from the URL is used to first lookup a bucket of potentially matching filters, and key to the performance of static network filter evaluation, the position of the token is used as an anchor to perform a plain string comparison, no need to scan the whole URL:

filter:  /top728x90.
tokens:   top728x90
offset:   1

URL:     https://www.example.com/images/top728x90.png?t=145009383
tokens:  https   www example com images top728x90 png   145009383
offset:  0       8   12      20  24     31        40    46

So uBO matches the anchor of the filter token to the anchor of the URL token:

tokens:                                 top728x90
offset:                                 1
tokens:  https   www example com images top728x90 png   145009383
offset:  0       8   12      20  24     31        40    46

Segment of string to compare in URL to filter string is at position (31 - 1 = 30):

filter:                                /top728x90.
URL:     https://www.example.com/images/top728x90.png?t=145009383
offset:                                30

In the end it's a mere string comparison: URL.startsWith('/top728x90', 30).

There are other different optimizations too for static network filtering evaluation, like using the request type/party as hash key to narrow the number of filters to evaluate; or another optimization which is related to the compact storing of plain hostname-based filters -- which is the most common occurrence in uBO.

@gorhill gorhill reopened this Dec 15, 2015
@gorhill gorhill closed this as completed Dec 15, 2015
@gorhill
Copy link
Owner

gorhill commented Dec 15, 2015

I prefer to leave it as is, it's just makes more sense to explicitly use a wildcard if we want to cancel word boundaries (i.e. *120x500.gif using above example).

@gorhill
Copy link
Owner

gorhill commented Feb 11, 2016

Here is a real case where implying a wildcard causes an issue: http://www.858.photos/.

The images in the carousel at the top of the page are prevented from loading, because their URL is something like http://www.858.photos/wp-content/uploads/2015/10/Blue-Hour-Exterior-18512-HDR-1160x600.jpg, while there is the filter 160x600.jpg| in EasyList.

@ghost
Copy link

ghost commented Mar 16, 2016

Is this the reason why the filter /adblockpopunder*$script works on NSFW but /adblockpopunder$script only works in ABP? Have been wondering this for some weeks and now stumbled upon this so I just want to make sure this is the same case and intentional.

@gorhill
Copy link
Owner

gorhill commented Mar 16, 2016

Yes, /adblockpopunder is interpreted differently in uBO compared to ABP. I believe ABP should still change this to adopt the word-boundary interpretation, it's way more sensible -- no word boundary is rather the exception in majority of cases, and when no word boundary is required, the asterisk can be used to specify this (/adblockpopunder*). Currently, if someone really wants to force a word boundary, there is nothing in ABP to specify this.

@gwarser
Copy link
Contributor

gwarser commented Jun 3, 2017

Interesting note - asterisk is not required to be at end of string https://gitlab.com/xuhaiyang1234/AAK-Cont/issues/17#note_31436319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants