Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad regex in filters.txt #8280

Closed
timsayshey opened this issue Nov 27, 2020 · 23 comments
Closed

Bad regex in filters.txt #8280

timsayshey opened this issue Nov 27, 2020 · 23 comments

Comments

@timsayshey
Copy link

There is a regex in filters.txt on line 24380 and it's not clear what the regex filter is doing but it is randomly blocking legit scripts on different websites that I visit.

/^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p

There is no comment explaining what it does or why it is there.

I propose that it be removed.

Please advise.

@uBlock-user
Copy link
Contributor

but it is randomly blocking legit scripts on different websites that I visit.

Example ?

timsayshey added a commit to timsayshey/uAssets that referenced this issue Nov 27, 2020
There is a regex in filters.txt on line 24380 and it's not clear what the regex filter is doing but it is randomly blocking legit scripts on different websites that I visit.
@mapx-
Copy link
Contributor

mapx- commented Nov 27, 2020

reason: easylist/easylist#6476

Provide ALL the pages you found where legitimate stuff is blocked

@timsayshey
Copy link
Author

Okay, here's one where there should be an ecard widget displaying on the page but it's being blocked arbitrarily by this mess of a regex. There will be many unintended consequences for a regex like this.

Here's the site:
https://americanpatriotsunsung.com/thank-a-vet/

@mapx-
Copy link
Contributor

mapx- commented Nov 27, 2020

@Yuki2718 first FP

@Yuki2718
Copy link
Contributor

@timsayshey Sincere apology for inconvenience. It was added to beat revolving ad scripts but I now see it should be adjusted more carefully.
@mapx- Can you turn it off temporary?

@Yuki2718
Copy link
Contributor

@timsayshey Hi, can you give us some more examples of breakage? Fixing the regex to avoid the breakage on americanpatriotsunsung will be possible, but I want to ensure the fix doesn't break other pages. The thing is these ad server the filter targeted changes so frequently that chasing them is not trivial.

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

@Yuki2718 test this 1:

/^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/([a-z]+[0-9]+[A-Z]).{11,17}\/\d{4,5}/$script,3p

@Yuki2718
Copy link
Contributor

@mapx- Sorry, I have a working solution of /^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/(?=[0-9A-Z]{0,16}[a-z])(?=[0-9a-z]{0,16}[A-Z])[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p it's best because doesn't care which order they appear, and also there's a case not including any number. Just waiting the OP's example of other breakage.

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

yours does not work:

image

@Yuki2718
Copy link
Contributor

Yuki2718 commented Nov 28, 2020

@mapx- You're right, and apparently the reason is somehow uBO's regex interpretation is case-insensitive. I also observed the case insensitivity on easylist/easylist#6537 Pinging @gorhill for a possible bug.
As an evidence, /^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/(?=[0-9A-Z]{0,16}[a-z])(?=[a-zA-Z]{0,16}[0-9])[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p
(includes at least one number, instead of capital letter) correctly allow the script.

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

So, again, did you test my filter ?

@Yuki2718
Copy link
Contributor

Yes, that in turn misses some ad scripts. e.g. http://tamilyogimovie.co/ misses http://hubmaydaybrow.com/fEzlGv0Im0sb/27969.

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

it seems yours is good, some bug in uBO, tested your on regex101 is ok, see the internal discussion

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

for a possible bug

It's not a bug, that's by design. All matching is meant to be case insensitive. There used to be a match-case filter option, which uBO never implemented, but it was also dropped by ABP eventually.

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

So what is the purpose of that rather broad filter?

If it's meant to address something on tamilyogimovie.co, then please narrow it to that site.

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

There are legit scripts (all chars lowercase) and "bad" ones containing lowercase, uppercase and numeric chars (there is some case with only lowercase and uppercase). The logger is presenting the real situation.

The filter above is correct for the real world but not in the case of first replacing upper by lower case (as in uBO's case). When - for example - we have (bad script) lower+upper case will have no means to distinguish legit by crap scripts (all lowercase)

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

and "bad" ones containing lowercase

But which sites are using these bad ones? Surely we shouldn't assume they can be present on all sites?

@mapx-
Copy link
Contributor

mapx- commented Nov 28, 2020

It's a large category of such crap:
easylist/easylist#6476

13x4.com/view/Rsj9ZdxbHk => crap: https://beakedpissod.com/r8lEgmDzyqTglNtG/17604
http://tamilyogimovie.co/ => http://hulkflugarb.com/rWiSeTtqQ9d/27967 and http://hubmaydaybrow.com/fEzlGv0Im0sb/27969
https://uptomega.me/49rf3ow09i12 => https://ledmophemp.com/rnkHzsMoBCNP5k/12790

https://1filmy4wap.com => https://emolapnay.com/rlnzLzvjGRKBY/20600

Last case without numeric chars

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

It's still a limited set of sites, there are filters with domain= option with hundreds of hostnames in EasyList. Such risky broad filters meant to apply on a limited set of sites should be narrowed to those sites only, I consider it's worst to risk breaking a lot of legitimate sites than to miss blocking on a limited set of fishy sites. The kind of breakage reported here makes uBO look bad and makes it more difficult to argue that uBO is an install-and-forget blocker.

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

Now regarding the case insensitivity issue, if it's something really needed, then this should go into a new issue -- I may choose to support a match-case option but that would be only for regex-based filters.

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

but it was also dropped by ABP eventually

Source: https://issues.adblockplus.org/ticket/7318/

But I see now that it seems they never went ahead with this change, so apparently ABP still support match-case.

@Yuki2718
Copy link
Contributor

I won't open issue at least for this, since there's no guarantee that the case sensitive filter doesn't cause any FP I think the mentioned approach of listing all the domains the filter is useful will be safer. But then we have denyallow, so probably no need for regex. It was worth trying though, as these sites come and go one after another.

However, EasyList issue 6537 will be worth remembered. TL;DR is that an EL filter
/^https?:\/\/.*\/.*sw[0-9._].*/$domain=spankwire.com
wrongly trapped a benign request which happened to include sW. It was fixed by narrowing the scope to $script,xhr but such case can occur even in right scope. All filter authors know matching is case-insensitive but probably tend to assume regex is an exception, in fact EL and reginal lists often put A-Z along with a-z. Problems like this will still be prevented if regex is created carefully though.

@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

I decided I will add match-case, but it will be meaningful only for regex-based filters.

gorhill added a commit to gorhill/uBlock that referenced this issue Nov 28, 2020
`match-case`
------------

Related issue:
- uBlockOrigin/uAssets#8280 (comment)

The new filter option `match-case` can be used only for
regex-based filters. Using `match-case` with any other
sort of filters will cause uBO to discard the filter.

`redirect=`
-----------

Related issue:
- uBlockOrigin/uBlock-issues#1366

`redirect=` filters with unresolvable resource token at
runtime will be discarded.

Additionally, the implicit priority is now set to 1
(was 0). The idea is to allow custom `redirect=` filters
to be used strictly as fallback `redirect=` filters in case
another `redirect=` filter is not picked up.

For example, one might create a `redirect=click2load.html:0`
filter, to be taken if and only if the blocked resource is
not already being redirected by another "official" filter
in one of the enabled filter lists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants