Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize yara timeout value #201

Closed
qkaiser opened this issue Jan 26, 2022 · 9 comments · Fixed by #226
Closed

Optimize yara timeout value #201

qkaiser opened this issue Jan 26, 2022 · 9 comments · Fixed by #226
Labels
enhancement New feature or request

Comments

@qkaiser
Copy link
Contributor

qkaiser commented Jan 26, 2022

When running unblob on very large files, we can reach a timeout condition triggered by Yara due to the fixed timeout of 60 seconds set in https://github.com/IoT-Inspector/unblob/blob/main/unblob/finder.py#L127:

def search_yara_patterns(
    yara_rules: yara.Rule, handler_map: Dict[str, Handler], full_path: Path
) -> List[YaraMatchResult]:
    """Search with the compiled YARA rules and identify the handler which defined the rule."""
    # YARA uses a memory mapped file internally when given a path
    yara_matches: List[yara.Match] = yara_rules.match(str(full_path), timeout=60)

This timeout value is not optimal for large files.

@qkaiser qkaiser added the enhancement New feature or request label Jan 26, 2022
@vlaci
Copy link
Contributor

vlaci commented Jan 27, 2022

Also, this should be covered in #100, as there can be potentially inefficient Yara rules as well.

@vlaci
Copy link
Contributor

vlaci commented Jan 27, 2022

Sidenote: do we really want to have a timeout feature in the first place? I'd like to see real-world firmwares causing significant slowdowns in the first place. Also, I don't think we should set a timeout for Yara as well.

@kukovecz kukovecz self-assigned this Jan 27, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Jan 27, 2022

Challenge accepted !

@qkaiser qkaiser changed the title Implement a timeout option Optimize yara timeout value Jan 27, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Jan 27, 2022

I edited the issue title and description. Ideally we should benchmark our ruleset against random files of different sizes and derive an optimal timeout value from it.

One limitation is that we can't make any assumptions about the resources of the machine unblob is running on, so there has to be some "breathing space" added to that timeout.

@vlaci
Copy link
Contributor

vlaci commented Jan 28, 2022

Do we need a timeout at all at this level? Given that all our rules are well-behaving and don't cause backtracking in the regular expression engine, they all should scale linearly with the file-size. If there is a "badly performing" pattern, we should change it. On a related note I am not strictly happy with Yara's performance at all.

@kukovecz kukovecz removed their assignment Jan 28, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 2, 2022

Yara scanning timed out on 15 files out of > 400.000 files during testing (setup: 4 cores, 16GB memory, but yara probably use a single process). I'll investigate which rules on which exact files triggered this and if a small increase in timeout value would fix it or not.

I'd like to fix it without increasing the timeout value, because at this point timeouts only happens on 0.00375% of our corpus.

@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 2, 2022

After a closer inspection and manual re-run, only three files timeouts. Others probably timed out due to my clogged system. The ones that still timeout are the biggest samples with 1.5GB, 2.5GB, and 3.6GB respectively.

First thing to note is that Yara scanning with our current rules scales linearly against files with fully random content:

poetry run python3 test.py
/tmp/1K.bin : 0 miliseconds
/tmp/1M.bin : 13 miliseconds
/tmp/10M.bin : 145 miliseconds
/tmp/100M.bin : 1267 miliseconds
/tmp/1G.bin : 13137 miliseconds

Another thing to note is that increased processing time seems to be coming from context switching of my system. If nothing else is going on (no network traffic, no unblob distributed processes, etc), Yara can parse 1GB of data from one of our sample in ~15 seconds.

The impact of system activity will widely differ based on the system unblob runs on and that's why I think we should increase the current timeout value.

My guesstimation math would be, with:

  • T - average time in seconds it takes Yara to scan a 1GB file
  • M - maximum file size in GB we want unblob to support
  • F - factor representing potential impact of system activity
timeout = T * M * F

for example: timeout = 15 * 10 * 2 = 300

@kissgyorgy kissgyorgy added this to the v1.0 - extraction milestone Feb 2, 2022
@qkaiser
Copy link
Contributor Author

qkaiser commented Feb 2, 2022

Another possibility, as mentioned by @vlaci would be to just get rid of the timeout. We now know for a fact that it's highly unlikely scan time will be higher than 60 seconds for our own usage. Removing the timeout would allow any unblob user to scan a 256GB full disk image if they want to have fun over the week-end.

@vlaci
Copy link
Contributor

vlaci commented Feb 2, 2022

I mean, if we properly indicate the processing progress, the user can decide if they willing to wait for however long or kill it, At the end of the day, a limited total allowed processing time makes sense but we already have the timeout command for that.

@qkaiser qkaiser linked a pull request Feb 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants