Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with backtracking caused by some regexes #332

Closed
elukey opened this issue Jul 30, 2018 · 4 comments · Fixed by #368
Closed

Performance issue with backtracking caused by some regexes #332

elukey opened this issue Jul 30, 2018 · 4 comments · Fixed by #368
Labels

Comments

@elukey
Copy link

elukey commented Jul 30, 2018

Hi! We are happy users of uap-core via https://github.com/ua-parser/uap-python, but we have recently noticed that some regexes may lead to excessive backtracking.

The regex that caused some issue for us is the following:

^(.*)/(\d+).?(\d+)?.?(\d+)?.?(\d+)? CFNetwork

A simple python script like the following hangs on my laptop for several minutes:

user_agent = "Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222"
import re
pattern = re.compile(r'^(.*)/(\d+)\.?(\d+)?.?(\d+)?.?(\d+)? CFNetwork')
pattern.search(user_agent)

The main problem seems to be backtracking caused by the various blocks with .?, that doesn't happen if the regex is split into multiple ones for example:

^(.*)/(\d+) CFNetwork
^(.*)/(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)\.?(\d+)? CFNetwork

The example that I brought up is of course a corner case, but even shorter strings with the same format can cause slow downs in parsing or hanging. It would be great to find a solution that is probably more verbose but that doesn't lead to these issues.

Thanks in advance!

Luca

@bcaller
Copy link

bcaller commented Aug 20, 2018

For me it's getting stuck at Bots containing spider|scrape|bot(but not CUBOT)|Crawl:

((?:[A-z0-9]+|[A-z\-]+ ?)?(?: the )?(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[A-Za-z0-9-]*(?:[^C][^Uu])[Bb]ot|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]*)(?:(?:[ /]| v)(\d+)(?:\.(\d+)(?:\.(\d+))?)?)?

My timing results when doubling the input user agent string length of a string of repeated letters:

> %time a = user_agent_parser.ParseUserAgent('a' * 100)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.2 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 200)
CPU times: user 124 ms, sys: 4 ms, total: 128 ms
Wall time: 131 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 400)
CPU times: user 612 ms, sys: 12 ms, total: 624 ms
Wall time: 626 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 800)
CPU times: user 4.83 s, sys: 8 ms, total: 4.84 s
Wall time: 4.83 s

> %time a = user_agent_parser.ParseUserAgent('a' * 1600)
CPU times: user 38 s, sys: 104 ms, total: 38.1 s
Wall time: 38 s

> %time a = user_agent_parser.ParseUserAgent('a' * 3200)
CPU times: user 5min 3s, sys: 560 ms, total: 5min 4s
Wall time: 5min 3s

@commenthol
Copy link
Contributor

Problem should be solved. Could you please do a retest with a version greater 0.6.0?
If things got solved please close issue.

@bcaller
Copy link

bcaller commented Dec 26, 2018

How does replacing x? by (?:x|) prevent catastrophic backtracking?

It doesn't. In both python and javascript the same REDoS performance problems still exist (confirm by trying the above examples).

I assume it's because of the way safe-regex uses a heuristic test to find problematic regexes. I'm sorry but to me it looks like what #363 does is simply find a nice way to bypass safe-regex's test without actually fixing the regular expressions!

@commenthol
Copy link
Contributor

Hi @bcaller, Could you please retest with latest version 0.6.3? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants