Optimize yara timeout value #201

qkaiser · 2022-01-26T11:58:49Z

When running unblob on very large files, we can reach a timeout condition triggered by Yara due to the fixed timeout of 60 seconds set in https://github.com/IoT-Inspector/unblob/blob/main/unblob/finder.py#L127:

def search_yara_patterns(
    yara_rules: yara.Rule, handler_map: Dict[str, Handler], full_path: Path
) -> List[YaraMatchResult]:
    """Search with the compiled YARA rules and identify the handler which defined the rule."""
    # YARA uses a memory mapped file internally when given a path
    yara_matches: List[yara.Match] = yara_rules.match(str(full_path), timeout=60)

This timeout value is not optimal for large files.

vlaci · 2022-01-27T13:05:15Z

Also, this should be covered in #100, as there can be potentially inefficient Yara rules as well.

vlaci · 2022-01-27T13:52:23Z

Sidenote: do we really want to have a timeout feature in the first place? I'd like to see real-world firmwares causing significant slowdowns in the first place. Also, I don't think we should set a timeout for Yara as well.

qkaiser · 2022-01-27T15:04:08Z

Challenge accepted !

qkaiser · 2022-01-27T15:25:45Z

I edited the issue title and description. Ideally we should benchmark our ruleset against random files of different sizes and derive an optimal timeout value from it.

One limitation is that we can't make any assumptions about the resources of the machine unblob is running on, so there has to be some "breathing space" added to that timeout.

vlaci · 2022-01-28T12:07:43Z

Do we need a timeout at all at this level? Given that all our rules are well-behaving and don't cause backtracking in the regular expression engine, they all should scale linearly with the file-size. If there is a "badly performing" pattern, we should change it. On a related note I am not strictly happy with Yara's performance at all.

qkaiser · 2022-02-02T10:34:21Z

Yara scanning timed out on 15 files out of > 400.000 files during testing (setup: 4 cores, 16GB memory, but yara probably use a single process). I'll investigate which rules on which exact files triggered this and if a small increase in timeout value would fix it or not.

I'd like to fix it without increasing the timeout value, because at this point timeouts only happens on 0.00375% of our corpus.

qkaiser · 2022-02-02T16:01:46Z

After a closer inspection and manual re-run, only three files timeouts. Others probably timed out due to my clogged system. The ones that still timeout are the biggest samples with 1.5GB, 2.5GB, and 3.6GB respectively.

First thing to note is that Yara scanning with our current rules scales linearly against files with fully random content:

poetry run python3 test.py
/tmp/1K.bin : 0 miliseconds
/tmp/1M.bin : 13 miliseconds
/tmp/10M.bin : 145 miliseconds
/tmp/100M.bin : 1267 miliseconds
/tmp/1G.bin : 13137 miliseconds

Another thing to note is that increased processing time seems to be coming from context switching of my system. If nothing else is going on (no network traffic, no unblob distributed processes, etc), Yara can parse 1GB of data from one of our sample in ~15 seconds.

The impact of system activity will widely differ based on the system unblob runs on and that's why I think we should increase the current timeout value.

My guesstimation math would be, with:

T - average time in seconds it takes Yara to scan a 1GB file
M - maximum file size in GB we want unblob to support
F - factor representing potential impact of system activity

timeout = T * M * F

for example: timeout = 15 * 10 * 2 = 300

qkaiser · 2022-02-02T16:09:33Z

Another possibility, as mentioned by @vlaci would be to just get rid of the timeout. We now know for a fact that it's highly unlikely scan time will be higher than 60 seconds for our own usage. Removing the timeout would allow any unblob user to scan a 256GB full disk image if they want to have fun over the week-end.

vlaci · 2022-02-02T16:13:18Z

I mean, if we properly indicate the processing progress, the user can decide if they willing to wait for however long or kill it, At the end of the day, a limited total allowed processing time makes sense but we already have the timeout command for that.

qkaiser added the enhancement New feature or request label Jan 26, 2022

kukovecz self-assigned this Jan 27, 2022

qkaiser changed the title ~~Implement a timeout option~~ Optimize yara timeout value Jan 27, 2022

kukovecz removed their assignment Jan 28, 2022

kissgyorgy added this to the v1.0 - extraction milestone Feb 2, 2022

qkaiser linked a pull request Feb 2, 2022 that will close this issue

Remove yara scanning timeout. #226

Merged

qkaiser closed this as completed in #226 Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize yara timeout value #201

Optimize yara timeout value #201

qkaiser commented Jan 26, 2022 •

edited

Loading

vlaci commented Jan 27, 2022

vlaci commented Jan 27, 2022

qkaiser commented Jan 27, 2022

qkaiser commented Jan 27, 2022

vlaci commented Jan 28, 2022

qkaiser commented Feb 2, 2022

qkaiser commented Feb 2, 2022

qkaiser commented Feb 2, 2022 •

edited

Loading

vlaci commented Feb 2, 2022

Optimize yara timeout value #201

Optimize yara timeout value #201

Comments

qkaiser commented Jan 26, 2022 • edited Loading

vlaci commented Jan 27, 2022

vlaci commented Jan 27, 2022

qkaiser commented Jan 27, 2022

qkaiser commented Jan 27, 2022

vlaci commented Jan 28, 2022

qkaiser commented Feb 2, 2022

qkaiser commented Feb 2, 2022

qkaiser commented Feb 2, 2022 • edited Loading

vlaci commented Feb 2, 2022

qkaiser commented Jan 26, 2022 •

edited

Loading

qkaiser commented Feb 2, 2022 •

edited

Loading