-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize yara timeout value #201
Comments
Also, this should be covered in #100, as there can be potentially inefficient Yara rules as well. |
Sidenote: do we really want to have a timeout feature in the first place? I'd like to see real-world firmwares causing significant slowdowns in the first place. Also, I don't think we should set a timeout for Yara as well. |
Challenge accepted ! |
I edited the issue title and description. Ideally we should benchmark our ruleset against random files of different sizes and derive an optimal timeout value from it. One limitation is that we can't make any assumptions about the resources of the machine unblob is running on, so there has to be some "breathing space" added to that timeout. |
Do we need a timeout at all at this level? Given that all our rules are well-behaving and don't cause backtracking in the regular expression engine, they all should scale linearly with the file-size. If there is a "badly performing" pattern, we should change it. On a related note I am not strictly happy with Yara's performance at all. |
Yara scanning timed out on 15 files out of > 400.000 files during testing (setup: 4 cores, 16GB memory, but yara probably use a single process). I'll investigate which rules on which exact files triggered this and if a small increase in timeout value would fix it or not. I'd like to fix it without increasing the timeout value, because at this point timeouts only happens on 0.00375% of our corpus. |
After a closer inspection and manual re-run, only three files timeouts. Others probably timed out due to my clogged system. The ones that still timeout are the biggest samples with 1.5GB, 2.5GB, and 3.6GB respectively. First thing to note is that Yara scanning with our current rules scales linearly against files with fully random content:
Another thing to note is that increased processing time seems to be coming from context switching of my system. If nothing else is going on (no network traffic, no unblob distributed processes, etc), Yara can parse 1GB of data from one of our sample in ~15 seconds. The impact of system activity will widely differ based on the system unblob runs on and that's why I think we should increase the current timeout value. My guesstimation math would be, with:
for example: |
Another possibility, as mentioned by @vlaci would be to just get rid of the timeout. We now know for a fact that it's highly unlikely scan time will be higher than 60 seconds for our own usage. Removing the timeout would allow any unblob user to scan a 256GB full disk image if they want to have fun over the week-end. |
I mean, if we properly indicate the processing progress, the user can decide if they willing to wait for however long or kill it, At the end of the day, a limited total allowed processing time makes sense but we already have the |
When running unblob on very large files, we can reach a timeout condition triggered by Yara due to the fixed timeout of 60 seconds set in https://github.com/IoT-Inspector/unblob/blob/main/unblob/finder.py#L127:
This timeout value is not optimal for large files.
The text was updated successfully, but these errors were encountered: